For more information on regular expressions, see:

In this vignette, I use to denote the regular expression, and to denote the string that represents the regular expression.

A character class is a list of characters enclosed between and which matches any single character in that list;unless the first character of the list is the caret , when itmatches any character not in the list. For example, theregular expression matches any single digit, and matches anything except the characters , or . A range of characters may be specified bygiving the first and last characters, separated by a hyphen. (Becausetheir interpretation is locale- and implementation-dependent,character ranges are best avoided.) The only portable way to specifyall ASCII letters is to list them all as the character class
(Thecurrent implementation uses numerical order of the encoding.)

There are a number of special characters that can be used when constructing a regular expression.

The very simplest pattern matched by a regular expression is a literal character or a sequence of literal characters. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lower case character is not identical with its upper case version, and vice versa. A space in a regular expression, by the way, matches a literal space in the target (this is unlike most programming languages or command-line tools, where spaces separate keywords).

We’ll start by learning about the simplest possible regular expressions. Sinceregular expressions are used to operate on strings, we’ll begin with the mostcommon task: matching characters.

A number of characters have special meanings to regular expressions. A symbol with a special meaning can be matched, but to do so you must prefix it with the backslash character (this includes the backslash character itself: to match one backslash in the target, your regular expression should include "\").

For a detailed explanation of the computer science underlying regularexpressions (deterministic and non-deterministic finite automata), you can referto almost any textbook on writing compilers.

    Regular Expressions (REs) provide a mechanism to select specific strings from a set of character strings.

    The three regular expression functions perform in a similar manner to their counterparts REPLACE, INSTR and LIKE.

    Some people, when confronted with aproblem, think ‘I know, I’ll use regular expressions.’ Now they havetwo problems.

Two special characters are used in almost all regular expression tools to mark the beginning and end of a line: caret (^) and dollarsign ($). To match a caret or dollarsign as a literal character, you must escape it (i.e. precede it by a backslash "").

Most letters and characters will simply match themselves. For example, theregular expression will match the string exactly. (You canenable a case-insensitive mode that would let this RE match or as well; more about this later.)

An interesting thing about the caret and dollarsign is that they match zero-width patterns. That is the length of the string matched by a caret or dollarsign by itself is zero (but the rest of the regular expression can still depend on the zero-width match). Many regular expression tools provide another zero-width pattern for word-boundary (). Words might be divided by whitespace like spaces, tabs, newlines, or other characters like nulls; the word-boundary pattern matches the actual point where a word starts or ends, not the particular whitespace characters.

In regular expressions, a period can stand for any character. Normally, the newline character is not included, but most tools have optional switches to force inclusion of the newline character also. Using a period in a pattern is a way of requiring that "something" occurs here, without having to decide what.

Users who are familiar with DOS command-line wildcards will know the question-mark as filling the role of "some character" in command masks. But in regular expressions, the question-mark has a different meaning, and the period is used as a wildcard.

A regular expression can have literal characters in it, and also zero-width positional patterns. Each literal character or positional pattern is an atom in a regular expression. You may also group several atoms together into a small regular expression that is part of a larger regular expression. One might be inclined to call such a grouping a "molecule," but normally it is also called an atom.

