Introduction to Regular Expressions

Introduction

Regular expressions are awesome. Regular expressions are confusing. They are used widely to match certain patterns of strings within another string, and can commonly be used for email address validation, URL validation and even for stripping non-alphanumeric characters from a string. If you skim the surface of regex, it can be rather fun and easy, but diving deeper and combining regular expression syntax things can get complicated and confusing quickly!

The Basics

Let's start with a very simple scenario, if we have the phrase The cat sat on the mat. and want to see if the string contains the word cat using JavaScript's Regex.test(String) method we'd use the following:

var s = 'The cat sat on the mat.';
if(/cat/.test(s)){
    alert('We found a cat!');
} 

With the regular expression of /cat/ you can see that we've found a cat - as expected.

Simple grouping

This is all well and good, but what if we want to check if the phrase has cat, sat or mat in it? For multiple options we place them in brackets () and separate them by a pipe |. So we could in theory use the regex of:

/(cat|sat|mat)/

That would work, but I personally think that's a lot of repetitiveness, so why don't we break that down a bit. We can simplify that down to:

/(c|s|m)at/

What this is matching is a c or s or m followed immediately by at, so cat, sat or mat, with me so far? Let's complicate it a bit more, but keep it with simple strings.

Repetition repetition

Let's change the string that we have to Hey, look look a book on a stool, now we want to match the word look, simple, use /look/, no problems there. But let's think about that, we have repeated letters, I don't like repetitiveness.

/lo{2}k/

That will work the same, might look complicated, and yes, it's overcomplicating this, but it highlights the scenario well. We can get a number of occurences of a string by using curly braces with a number in. So for three occurences we could use {3} and {9} for nine, straight forward.

Match look look (which is two occurences of look followed by a space) with /(look ){2}/ or even /(lo{2}k ){2}/ - we're not getting complex yet! That's all cool, but what if we want to match a string that could be look, loook, looook or loooook but not looooook.

/(look|loook|looook|loooook)/

Yea, sod that, it's just ridiculous! We can put a comma after our first number in those braces to insert a number of occurences to go up to. So in our case we can use:

/lo{2,5}k/

Nice, neater and simple to understand. So let's throw everything together for now, we want to match the word book, look, loook and looook so we could use any of the following:

There's many possibilities, why not tweet us if you have a better alternative!

One character from a selection

Above we've used (b|l) and (c|s|m) which again, is pretty awesome and slender regex, but why do something half-arsed? Let's take that a step further and introduce some square brackets [] - these denote a collection of characters that can be used - but only one of them, and one occurance.

Nice, and simple, and we can of course use the braces after them to make x occurences be allowed...

`/[acmst]{2,3}/`

This will match any one of the following:

And so forth but will only match 2 or 3 characters, so might match in the word ham or cat or Birmingham.

Optional matching

Anything that's preceeded by a question mark ?, will be classed as optional, so if we were to have a regular expression to match either flow or flo, we could use the following:

/flow?/

That will mean that the w is an optional match, and isn't required to be matched.

Ranges

For a range of letters or numbers we are able to separate the first and last by a hyphen. This will then match anything that's within that range, saving us from writing out all of the letters from a to z, we can use a-z. Also 0-9 and A-Z are available to use. So let's match two capital letters, then one lowercase then a number:

/[A-Z]{2}[a-z][0-9]/

It's as simple as that.

'Not' operator

When we're using square bracket matching we might want instead to list characters that shouldn't be matched, we can do that by making sure the first character in the brackets is actually a caret ^. So for example to match anything that isn't alphanumeric we could use the regex:

/[^a-zA-Z0-9]/

Dot notation

If we want to just match whatever is between two characters we can use a ., this will match any character (except for a newline character), and so can be handy at times. This is the same as using the regular expression /[^\r\n]/, and is normally followed by a multiple occurance selector.

Multiple occurrences

We've already seen the brace method in use to select multiples of a string, but what if we don't know how many there will be, what if we just want to blanket catch? This is where we have two symbols that we can use, but be careful using them!

You need to be careful because both of the above symbols are greedy, meaning that if you had the regular expression of /<.*>/ used against the string <strong>Testing</strong>, this wouldn't match the first <strong> tag as you would hope, infact it matches the whole string.

This is where we use the previously mentioned question mark - /<.*?>/ - this makes it not greedy and will then match the first <strong> tag as originally hoped.

Start and end of line

All of the expressions that we've used so far will match at any point in a line. We might at some point want to only select the letter at the start of the line, or perhaps a number at the end of the line. We can do this by using two special symbols:

So to match a line that doesn't end is a semicolon we could use some handy regex:

/[^;]$/

Match Modifiers

That trailing slash that you see can be followed by different letters to alter the behaviour of the regular expression:

I personally rarely use anything other than the i modifier as it saves having to worry about matching capital letters too.

Predefined Groups

There are a few predefined groups for any number, whitespace or letter, they also have negatives:

The negations for each of these is as simple as using the capital letter, \W, \S and \D respectively.

Escaping Characters

Finally, you'll realise that we've used a lot of characters as identifiers, modifiers and match helpers. But what if we want to match that symbol (such as $) in our regular expression? Simple, just preceed it with a backslash much like you would when you escape quotes inside quotes etc. Let's try matching a PHP variable that is just letters:

/\$([a-z]+)/

Futher Thoughts and Ideas

A lot of what I've shown you has matchers all grouped in brackets, in most cases when you're matching you're looking to use the matched string in a separate section, this is what the brackets will do - they are classed as a capturing group. If you want to change this group to not being captured, you can put a ?: straight after the opening bracket, this means that it will match the regex still, but just won't capture what's been matched.

If you're not thoroughly confused yet, you will be when you look into other items such as lookaheads, lookbehinds and boundaries - but those are beyond the scope of this basic tutorial for now.

Some final examples

Change a string (alpha-numeric and underscore) that is followed by .php into /testing/{string}/index.php where {string} is what we matched, we'll firstly write the regular expression, there are multiple options we can do - take a look below:

/(\w+)\.php/
/([a-z0-9-]+)\.php/i
/(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|_|0|1|2|3|4|5|6|7|8|9)\.php/i

The last one is mostly for example, and should really not be used - at all!

Now to replace what we've matched with the output, firstly javascript:

var s = 'login.php';
s = s.replace(/(\w+)\.php/, '/testing/$1/index.php');

Finally PHP:

$s = 'login.php';
$s = preg_replace("/(\w+)\.php/", "/testing/$1/index.php", $s);

The Challenge

There are a lot of examples on the internet of various matching regular expressions - but why don't you try to come up with a regular expression to match the word bacon, dog and baton but not baconz in the following sentence. When you've got a solution why don't you tweet us with it.

There was a 'baton' wielding, bacon eating man. Dogs were chasing him mostly because of his awesome tasting baconz.