Regular Expression Introduction

What are Regular Expressions?

Anywhere that you work with text, regular expressions are a powerful tool! Be it searching for files, editing text documents, evaluating logs, validating program input, and many others, Regex is the right tool for the job. Though they have a reputation as being difficult and complicated, my aim is to start with the basics and layer on in a way that is easy to understand.

Wild-card Searches

If you have ever used wild-cards in searches or command prompts, *'s or ?'s that stand for one or more other letters or symbols, your familiar with the idea of Regex. If not, let me lay out a scenario that will illustrate. Say you have a directory on your computer with a collection of thousands of articles about animals, one file per article with the filename of the article title. If you wanted to find all the articles that had 'cat' in the subject, you could use something like:

*cat*

or match both 'cat' and 'cats' with:

cat?

This is convenient, but has some fairly severe limitations. Say I wanted to match articles for cats but not catfish? What if you wanted to find catfish but not cats? Wild-card searches don't provide much help there, but Regex is the perfect tool for the job!

Terminology & Engines

There are a few terms with very specific meaning in the context of Regex. The term you craft to match what you want to search for is referred to as the Pattern. Often times this is what people call the Regular Expression. The text being searched in is called the String. The pattern is applied to the string to find Matches. The section, or area, of the string which is matched by the pattern is called a region. In a text document, the text of the document is the string. If searching for files, the string consists of the file names. Any character with a special meaning, like the * or ? in wild-card searching, is referred to as a meta-character.

Regex has been around for a long time, and there are a number of 'flavors' that have evolved, what are referred to as engines. Each one supports a core set of standard symbols and features, and then they diverge and each adds its own additional symbols, syntax, and functionality. Most of what I will discuss are core, and readily supported by modern Perl, Javascript, and Java. These are common engines. There are many more, Wikipedia has a good reference on the different Regex engines.

Escaping & Meta-Characters

We briefly defined meta-characters as those with special meaning. A quick note on this topic is necessary because we need a way to remove that special meaning and use those characters for matching. To do that we use what is called escaping. In escaping we proceed a special character with a backslash (\), which goes from top left to bottom right. In order to match the asterisk (*) we might use a pattern like

\*note

which would match the following string

*note

Any character with a special meaning can be escaped.

Related to escaping meta-characters is the idea of using the escape character, backslash (\), to give some characters special meaning. For instance, non visible characters, such as tabs (\t), newlines (\n), and return (\r) are formed by escaping characters that otherwise have no special meaning.

Learning what characters have special meaning, and what characters can be escaped to give them special meaning, is mostly just experience, practice, and familiarity with the tools. If all you remember is \t and \n you will have what you need 90+% of the time.

Wild-cards & Quantifiers

As with normal searching; letters, numbers, and some of the symbols are matched exactly. Regex takes quite an expanded view of wild-cards, what it calls character classes, but to get there we start small, with the humble dot (., or period). The dot matches any character. If you wanted to match any string 3 long, you could do it with three dots, '...'. For instance, the pattern:

Regular.Expression..Are .....

would match the following lines

  • Regular Expressions Are Great
  • Regular Expressions Are Good!
  • Regular-Expression_ Are !3%</

You can match a dot, removing the special meaning within your pattern, by escaping it, as discussed above. For example, the pattern:

Testing\. Testing\. 1\. 2\. 3\.\.\.

Would match the string

Testing. Testing. 1. 2. 3...

The power of Regex becomes more apparent with the idea of quantifiers. Quantifiers let you choose how many times to match something. They can be used on just about anything and always act on the proceeding character. The typical valid quantifiers are:

Quantifiers
Quantifier Meaning
?
Matches the proceeding character 0 or 1 times
+
Matches the proceeding character 1 or more times
*
Matches the proceeding character 0 or more times
{4,}
Matches the proceeding character 4 or more times
{4}
Matches the proceeding character 4 times
{4,6}
Matches the proceeding character at least 4 times, but no more than 6
Examples
Pattern Matches
a+h.?
aaah,
a.?s
ads
wo+ds*
woods
aaa.?e+i{4}o{2,}u*
aaaaeeeeiiiioooouuuu
.*
I could type anything I wanted here!
\d{10} - .+!
0123456789 - all the numbers!

Greedy vs. Reluctant

By default Regex engines are what we call greedy. This means that they will attempt to match as many characters as they possibly can while forming a match. The opposite of greedy is reluctant. Reluctant engines will only match as much as is required. In Regex parlance they will capture only as many characters as is required to form a match. The subtlety of the difference between reluctant and greedy isn't apparent in the examples I've given so far because we've been matching against entire strings, but as our interest expands to locating regions within strings, this greedy behavior can be good or bad. For instance, if I was trying to match just a part of a string with the pattern

this.+is

applied against the string

this is some text that is repetitive in it's use of is

Without defining reluctant or greedy there are three possible matches

  1. this is
  2. this is some text that is
  3. this is some text that is repetitive in it's use of is

The first represents the reluctant match, the last represents the greedy match.

As I mentioned, most engines are greedy by default. To make a quantified match reluctant you follow the quantifier with a question mark (?). For instance, to get the first match from the pattern above, I could change my pattern to the following

this.+?is

The system knows that + is special, it's a meta-character, so the ? following it doesn't confuse the system with it matching other characters as was used in proceeding examples.

Character Classes

A class is simply a collection of like items. In Regular Expressions that simply means a group of characters. We have already seen one character class, the dot. It represents all characters. There are a lot more pre-defined character classes, such as \s for space characters (like space and tab), or \d for digits (numbers), and there is also great flexibility to define your own. Here's a list of the more common pre-defined character classes.

Pre-Defined Character Classes
Pattern Matches
\d
Matches any digit, 0-9
0,1,2,3,4,5,6,7,8,9
\s
Matches any whitespace character
\s,\t,\v
\w
Matches any word character
0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,_

These named character classes can all be negated by using the capital (upper-case) version of the letter. For instance, to match any non-digit you would use \D.

There can be a great number of potential pre-defined character classes depending on the engine used, and each may use different syntax, but they all work the same way, they represent some group of characters that they will match against. For example, a common type of class support are POSIX character classes, which would use this form to match any upper-case letter character:

[:upper:]

I mentioned previously that Regex allows you to create your own character classes, and you can define any combination you wish. To accomplish this you place the square-brackets ([ and ]) around any characters you want to be included in your class. For example, to create a character class that would match all the vowels you could use the following pattern.

[aAeEiIoOuU]

which would match

E

but not match

X

Defined classes can accept ranges as well by using the dash (hyphen or minus character). For example, to create a class of the numbers from 0 to 7, handy for use in validating octal (base-8), you could use the following class:

[0-7]

To define a class to match lower-case and upper-case hexadecimal symbols, you could use

[a-fA-F0-9]

You can negate a character class by using the carrot (^) as the first item in the list. For example, to match any character that is not a 'greater than' (>) character, or a space, tab, or newline, useful for parsing HTML/XML tag names, you could use a character class like:

[^> \t\n]

If you would like to include the carrot (^) in your class without negating everything, you simply ensure it is not the first item in the list. If you wish to negate, the carrot must be the first item in the list! It is not possible to both include and negate in the same discrete class, however, some engines do allow you to nest classes.

Groups & Capturing

The last topic we will discuss in our introduction to Regex are groups. I hate to sound redundant, but groups do just what they sound like, group data together. They are formed with parenthesis, '(' and ')'. For instance:

some text (this is a group) other text

Groups have a lot of different uses, so I'll just cover the most common. The first is to allow quantification of more than single characters, which is what we have seen thus far. For instance, if I were looking for times where I may have typed the word 'is' more than once (I tend to do this from time to time when I type fast), I could use this pattern:

(\bis\b\s*){2,}

In the above pattern I introduced a new class, \b. This class is a special kind that we call an anchor which doesn't match characters, but matched on the relationships between different character classes. In this case it represents a word boundary, the transition between what is a word character (matching \w) and what is not a word character (matching \W). This allows me to actual catch times I repeated 'is' as a word, 'is is', and not just part of some larger word, 'isis'.

Another feature of groups that opens up many new avenues of use is their capturing feature. Anything grouped is saved and can be accessed later, both in your pattern, and as a matched region, the area of the string which matched the pattern. These saved groups can be accessed later in the pattern using \[number] where number represents the group number from left to right starting at 1 (group 0 is the match for the entire pattern). For instance, the pattern

(F[a-z]{3}\s).+?\1.*

which says to match any 4 letters starting with F and the remaining letters being lower case and a space, capturing them, then any number of characters that are not our original 4 letters and space, and ending with the original 4 letters and space. This would match on the string:

Fred likes to go by Fred in public

but would not match on the string

Fred likes to go by Frederic in public

because the second occurrence of Fred isn't followed by a space.

It's important to note that capturing groups capture the string, not the pattern. For instance, the pattern

(F[a-z]{3,4}) and \1

would not match on the string

Fred and Frank

because when the group is matched it captured the matched part from the string, 'Fred', and not the pattern which matched Fred, 'F[a-z]{3,4}'. When it later back-references this capture with \1 it looks to match 'Fred', not 'F[a-z]{3,4}', and fails.

Captured groups also allow for much finer grained control over what surrounds the items being sought. For instance, if I wanted to pull the tag names from HTML and/or XML I could use the pattern:

<([-\w]+)[^>]*>

which, when matched against the string

<div id="div_1">

would match the entire string and the group would capture the actual tag name 'div'.

This pattern is a bit more involved than those we started with, but it doesn't use anything we haven't been exposed to. We start out by matching the tag opener (<), and then capture a group of the next 1 or more word characters or hyphens (things we might consider valid tag characters). Then we match any number of non tag closes (>), followed by a tag close. Our group captures just the tag name, despite our pattern matching the entire tag.

Note how I am using '[^>]*>' instead of something like '.*>'. The reason is that, as we mentioned before, Regex engines are greedy by default. If I were to use that pattern against a true HTML or XML document, with multiple tags, I would find that '.*>' will happily match any number of '>' characters, as long as it could end the match with one. That was not my intent! If I were to repeatedly apply the pattern to find all the tags present the first match would consume all the other tags I was interested in and I would never match against them!

It is often advantageous to define what you are not looking for as much as you define what you do want to match. This allows better control, makes the pattern more readable, and also can improve performance. I could have just as easily use a reluctant quantifier to accomplish the same thing, like this:

<([-\w]+).*?>

but '<([-\w]+)[^>]*>' is more explicit and typically easier for folks to understand by just looking at it.

Regex Usage Notes

Regex can be used in many different contexts. One area I find myself using them a lot is in formatting and modifying text files, such as logs, or source code. Many text editors support regular expressions for find and replace, and that can be a powerful tool. For instance, if I had a log file I wanted to remove the info lines from, and those lines were of the form:

MM/DD/YYYY-hh:mm:ss [INFO] text description

A find and replace could be used, with the replacement being empty, using the following pattern:

\d\d/\d\d/\d\d\d\d-\d\d:\d\d:\d\d \[INFO\] .*?(?:\n|$)

The log file could be quite large, potentially millions of lines, but the regex will work efficiently to identify the necessary lines, and replace them with nothing. Note that we capture up to and including the newline, using a reluctant match. This removes the line itself, not just the content of the line, which ensures your resulting file doesn't have empty lines left after the replacement. Also note that the last line in the file would not be terminated with a newline, so I use \n|$ to capture on the end of string as well (in a non-capturing group). This is OK because as an anchor, a replacement on $ has no effect.

Perhaps you just want to remove empty lines from a file, you could use the pattern:

\n{2,}

and the replacement:

\n

Or maybe your empty lines actually have tabs or spaces, so you instead use could use a pattern like:

\n\s*?\n

Usage in other programming languages

A very common context in which Regex is used is in program code. This usage can have its own particular characteristics that can be confusing when first encountered. Mostly this is related to escaping. To illustrate, consider a statement such as:

String regex_pattern = "three\n\s*line\s*\nstring";

At this point nothing related to Regex is involved, I am simply declaring and instantiating a string literal. When I do this, the compiler for the language being used will parse this string looking for backslashes (\), and then look at the following character to determine if it has special meaning. For instance, in the example above, the \n has a special meaning, it represents the newline character. However, the string parser has no special meaning given to \s. It may throw an error, or it may simply removes the backslash and use the character as is, \s becoming s. The string parser replacing \n with an actual newline character is perfectly acceptable. The Regex engine will match against a newline character exactly the same as it would match against the combination \n. On the other hand, the parser replacing \s with s, or throwing an error, is unacceptable. The replacement woudl entirely change the meaning of the pattern, which will no longer match what we expect it to match, a worse outcome than an error.

The solution to the above would be to double escape those things we wish to make it though the string parser and remain escaped for the Regex engine. For instance, let's consider the following statement:

String regex_pattern = "three\n\\s*line\\s*\nstring";

In this string literal we have escaped the \s by adding another backslash, \\s. When the string parser parses this string it will replace the \n with newline characters as before. When it encounters \\s it sees the backslash and looks at the second character, a backslash as well. Backslash has no special meaning when escaped, so the parser removes the first backslash, leaves the second and moves on. Last it encounters the s and leaves it. The result is a string which looks like we expect:

three
\s*line\s*
string

or as a string literal

three\n\s*line\s*\nstring

This behavior can be particularly difficult to analyze at times. For instance, consider the following string literal:

			\"\\[abc\\]\"
		

When parsed by the compiler for escaped characters it would first see \" and remove \ and leave ". Next it sees \\, removes one \ and leaves \. Next it hits [ and leaves it, along with the abc. Next it finds \\, removes a \ and leaves \, next it finds ] and leaves it. Last \" becomes ".

The resulting string would look like this:

"\[abc\]"

Which when used as a pattern would match the string

"[abc]"