Regular expressions are a notation for describing sets of character strings. When a particular string is in the set described by a regular expression, we often say that the regular expression matches the string.
The simplest regular expression is a single literal character. Except for the special metacharacters
*+?()|
, characters match themselves. To match a metacharacter, escape it with a backslash: \+
matches a literal plus character. Two regular expressions can be alternated or concatenated to form a new regular expression: if e1 matches s and e2 matches t, then e1
|
e2 matches s or t, and e1e2 matches st. The metacharacters
*
, +
, and ?
are repetition operators: e1*
matches a sequence of zero or more (possibly different) strings, each of which match e1; e1+
matches one or more; e1?
matches zero or one. The operator precedence, from weakest to strongest binding, is first alternation, then concatenation, and finally the repetition operators. Explicit parentheses can be used to force different meanings, just as in arithmetic expressions. Some examples:
ab|cd
is equivalent to (ab)|(cd)
; ab*
is equivalent to a(b*)
. The syntax described so far is a subset of the traditional Unix egrep regular expression syntax. This subset suffices to describe all regular languages: loosely speaking, a regular language is a set of strings that can be matched in a single pass through the text using only a fixed amount of memory. Newer regular expression facilities (notably Perl and those that have copied it) have added many new operators and escape sequences. These additions make the regular expressions more concise, and sometimes more cryptic, but usually not more powerful: these fancy new regular expressions almost always have longer equivalents using the traditional syntax.
One common regular expression extension that does provide additional power is called backreferences. A backreference like
\1
or \2
matches the string matched by a previous parenthesized expression, and only that string: (cat|dog)\1
matches catcat
and dogdog
but not catdog
nor dogcat
. As far as the theoretical term is concerned, regular expressions with backreferences are not regular expressions. The power that backreferences add comes at great cost: in the worst case, the best known implementations require exponential search algorithms, like the one Perl uses. Perl (and the other languages) could not now remove backreference support, of course, but they could employ much faster algorithms when presented with regular expressions that don't have backreferences, like the ones considered above. This article is about those faster algorithm
Walang komento:
Mag-post ng isang Komento