AlternativeUniversity.net
Alternative University

Computer Science

Programming

Text Processing

Regular Expressions

“Regular expressions” is a terse programming language that is embedded in other languages to perform text processing with pattern matching within the other languages.

Regular expression pattern matching consists of an input text string that will be processed, and a second text string that is used as a “pattern” or “regular expression” that will be used to process the first string.

The second text string, called the pattern or regular expression, may be literal text for simple matching, or may contain metacharacters that specify different types of pattern matching and processing.

The term “regular expression” is often abbreviated “regex” or “regexp”.


Literal Text

Consider the input text string “The quick brown fox”, and the regular expression “quick”:

text  :  
regex  :  
The quick brown fox
quick

The result of this regex pattern matching is True, because the literal regex “quick” is found in (matches) the text string.

text  :  
regex  :  
result  :  
The quick brown fox
quick
    (The quick brown fox)

This regex just had standard plain text letters of the alphabet (literal text), which simply searches for an occurrence of that string in the input string.

Unless otherwise specified, regex matching is case sensitive. For example, if this regex was Quick instead of quick, that would not match, because the input text has a lower case ‘q’ (not an upper case ‘Q’).


Quantifiers

In the preceding example, the regex was literal text, which found an occurrence of the literal text in the input text. That can be denoted as follows:

The quick brown fox  +  regex/quick/  =  True

or simply:

The quick brown fox  +  /quick/  =  True

with slash (/) delimiters enclosing the regex.

Now consider searching for occurrences of the words color or colour in text input.

If we use the word color as a literal regex, it would not find occurrences of the word colour, and using the word colour as a literal regex would not find color.

To use a single regex that can match the words color or colour, we need to incorporate a metacharacter in the regex instead of simply using literal text as a regex.

The metacharacter that can accomplish this is the question mark (?), which will specify to match, the character preceding the question mark, zero or one times, in this example.

The correct regex to match color or colour is then:

colou?r

where the question mark specifies that the character immediately preceding it (which is ‘u’) must appear zero or one times. If it appears zero times, the regex matches color; if it appears one time, the regex matches colour.

The color was yellow  +  /colou?r/  =   

The colour was blue  +  /colou?r/  =   

This type of metacharacter is called a quantifier. Other quantifiers include the asterisk (*) which matches the preceding item zero or more times (not just zero or one time), and the plus sign (+) which matches the preceding item one or more times.

?  :  
*  :  
+  :  
match preceding item 0 or 1 time
match preceding item 0 or more times
match preceding item 1 or more times

For more information about regex quantifiers, see:

MDN Regular Expression Quantifiers


Grouping

A grouping is enclosed in parenthesis. The grouping can include an or-symbol (vertical bar) as a metacharacter to provide a choice. For example, the following regex:

house(call|party)

will match housecall or houseparty:

It was a houseparty  +  /house(call|party)/  =   

For making housecalls they charge extra  +  /house(call|party)/  =   

The house was white and yellow  +  /house(call|party)/  =   

The last example does not match because the regex requires that house is appended with the grouping call or party.

To make the grouping optional, the question mark metacharacter can immediately follow the group:

house(call|party)?

which matches house because the question mark metacharacter specifies for the grouping to appear zero or more times after house (in this case zero times):

The house was white and yellow  +  /house(call|party)?/  =   

Another way to make this grouping optional, without using the question mark metacharacter, would be to include another or-symbol with a null string within the grouping:

house(call|party|)

In this regex, not having text after the second or-symbol denotes a null-string (no characters). This regex specifies that house must be followed with the grouping call or party or the null string (no characters):

The house was white and yellow  +  /house(call|party|)/  =   

For more information about regex grouping, see:

Perl Regular Expression Grouping


This article is under development