Saturday, May 14, 2011

Regular Expressions Excluding Strings

I ran into a situation recently where it would be very, very handy to be able to write a regular expression which would both look for certain content and exclude others. I admit this is probably not the most efficient way to go about things, but for small and quick use cases I don't see why it shouldn't be used. See below for some explanations!

Negative lookahead:


Let's say you have a string set of strings, 'foobar','barbar','barfoo'. Now let's further speculate that for some unknown, but perfectly valid to you, reason, you want only the strings in the above set which contain a 'bar' but only where 'bar' is not followed by 'foo'. (I'm making this distinction now, this means it's OK to have 'foo' before 'bar', just not after.)

If your regular expression engine supports it, and most do -- at least Perl and Python do, you can write something like this:

  • bar(?!foo)
  • Python: re.search('bar(?!foo)',string)
  • Perl: string =~ /bar(?!foo)/
  • 'foobar' and 'barbar' would match the above regular expression, 'barfoo' would not -- perfect!
Now, as I said, this is for looking ahead, you cannot write something like (?!foo)bar it will not do what you want, as you're attempting to lookbehind. Conveniently, see below for how to do a negative lookbehind.

Below is a Python snippet to really flesh things out:


Negative lookbehind:

We can use the same list as above to demonstrate a lookbehind, but this time let's assume we only want strings which contain 'bar', but only where 'bar' is not preceded by 'foo'.

We can write a regular expression for negative lookbehinds like this:
  • (?<!foo)bar
  • Python: re.search('(?<!foo)bar',string)
  • Perl: string =~ /(?<!foo)bar/
  • 'barfoo' and 'barbar' would match the above regular expression, 'foobar' would not, again, exactly what we set out to do!
Another Python snippet below:

Note:

The only thing which makes these lookahead and lookbehinds negative is the exclamation points, you can easily turn this requirement around by removing it, so bar(?foo) would suddenly make the string 'barfoo' the only valid string in our set, pretty intuitive!

No comments:

Post a Comment

Followers