Monday, 18 August 2014

The one true regular expression syntax

This post has been left intentionally blank.

One of the most common questions I get about regular expressions is along the lines of "I tried this pattern and it doesn't do what I expect it to do! Why not?"

There are typically two issues. One is that the questioner has misunderstood the syntax - usually with something like /a|b$/ which (usually) means something completely different to /(a|b)$/

The other issue is that the environment where you're using regular expressions isn't the same as the environment that we discuss in class. (I use Perl, in case you're wondering.)
Again, there are two different problems:
  • Character classes (\s, \w, etc. and particularly \1) are defined in a different way or not defined at all. This can be mitigated somewhat by reading the documentation for the regular expression package you're using.
  • If you're using a command line interpreter, for example, grep, then you also have to worry about how the various characters are interpretted from the command line.
I don't intend to give a primer on *nix command line interpreters, but I would like to give a quick example of what the problem looks like, and how I might go about working out what's happening.
Say that we have a file:
abc
\tdef ghi
 jkl jkl mno
\tno no
pqr stu
And we'd like to try out the exercise we had in the lectures: /\s(\w+)\s\1/
What happens if I just run grep from the command line?
> grep "\s(\w+)\s\1" file
grep: invalid back reference
What does this mean? grep thinks that I'm using the back reference \1 without instantiating register 1 (with those parentheses). Probably the command line is doing something funny with the parentheses... I vaguely recall that system variables are interpretted within the double quotes (") but not in single quotes (') - let's try that:
> grep '\s(\w+)\s\1' file
grep: invalid back reference
No dice. What about back ticks (`)? I seem to remember that they tell the interpreter to run the line as its own r-value, which isn't what I want... But I might be wrong - let's try:
> grep `\s(\w+)\s\1` file
bash: command line substitution: line 1: syntax error near unexpected token `\w+'
bash: command line substitution: line 1: `\s(\w+)\s\1` file
Okay, that looks much worse. What else could be happening? Well, the interpreter might be doing something fancy with ( and ), when I just want it to pass them as regular characters to grep. Let's try escaping them:
> grep "\s\(\w+\)\s\1" file
No error message, but no output either. Note that \( is how we match an actual "(" in a regular expression - to do this in grep, I would need to escape the slash "\\" and then escape the parenthesis "\(", i.e. "\\\(" Based on this, I realise that grep is doing this to all my backslashes there - i.e. "\s" from the command line means interpret this "s" as "s", like really really "s". (You can confirm this by adding a line with "sws" to the input file.) Let's escape all of the backslashes:
> grep "\\s\(\\w+\)\\s\\1" file
 jkl jkl no
    no no
Yeah! It worked! This escaping of characters in command line expressions leads to a phenomenon famously called "picket fences", in particular to file directory paths, e.g. "\/home\/subjects\/comp\[0-9\]+\/" etc.

1 comment:

  1. Very interesting blog post that shows how to work out a solution starting from the limitations of the tool in use (grep). To make things easier I would like to point out that the extended version of grep (egrep) deals better with the regular expression "\s(\w+)\s\1"

    ReplyDelete