This is an old revision of the document!

A Regular Expression Primer

by Daz

Daz has mentioned that something on the site has altered some of the examples on this page. It may no longer be necessary to put a \-character before either an open parens or close, i.e. neither ( nor ).

A regular expression is a symbolic representation of the most general search query one can make to that word finder. Any letter just means itself. A . (a period) means any single letter (or other symbol). A * after any expression means 0 or more consecutive occurrences of that expression (or more accurately, of anything that MATCHES that expression).

Usually, regular expressions are case-SENSITIVE, so if you ask for a j you won't match J. (Apparently our NPL word finder has been built to override that case sensitivity if desired, but a bug in this has just been pointed out.)

Regular expressions can be built up of simpler regular expressions; some examples follow.

To refer to arbitrary consecutive stuff, use .* (here the period matches any single character, and the * means 0 or more occurences thereof. (Note the occurrences need not be to the same character!)

The expression [xyz…w] (where x,y,z,…,w are any letters) means any one of these letters.

^ at the left of the whole expression means the beginning of a word; $ at the right means the end of the word. So to find all words composed of just the five vowels, use the regular expression ^[aeiou]*$

But, a ^ at the left INSIDE square brackets – like [^xyz…w] – means any symbol EXCEPT xyz…w. SO [^aeiou] means any character except a,e,i,o,u.

To find all words that have 3 consecutive of these vowels somewhere in the word, use:

[aeiou][aeiou][aeiou]

(Note we didn't use ^ or $ here.)

To find all words that use NONE OF a,e,i,o,u, use:

^[aeiou]*$

To search for “regex1 OR regex2” just write $regex1$|$regex2$ (note that the pair $ serves as a left bracket, $ as a right bracket, and the vertical line | as the symbol for OR).

There are a whole bunch of other things one can ask about For example:

To ask for all words that consist of a repeated string, the symbolic query would be this: ^$.*$\1$

To ask for all words that contain a consecutively repeated string, you'd use $.*$\1

To ask for all words containing a three-letter string repeated twice (possibly with intervening letters) you'd use $…$.*\1

To ask for all words with a 2-letter string repeated thrice anywhere in the word, you'd use $..$.*\1.*\1

To ask for two two-letter strings that occur alternately in the word as _A_B_A_B_, anywhere in the word, you'd use $..$.*$..$.*\1.*\2

Explanation: Using the pair $ as a left bracket and $ as a right one (note backslash), the subexpression $something$ defines a pattern that the part of the regular expression to its right will understand as the pair \n, where n is the number of aforesaid subexpression counting from the left until it's first encountered. (All but the last example involved only one subexpression, so only \1 was used in those.) The notation is very dense and unforgiving, but it's really not complicated.

You may be interested in Lucifers's notes on Cryptograms.