How (not) to write regular expressions

Published February 15, 2015

A few days ago there was a regular expression building library featured on Hacker News: https://news.ycombinator.com/item?id=9033146

Its premise is that regular expression syntax is more confusing than using an object oriented method chaining approach.

The comments are overwhelmingly positive, and the library itself has a lot of attention on GitHub, and I find this strange because using the library appears more complex than just learning regular expression syntax to a fluent level and writing them directly. While something like Linq for regular expressions would be very interesting, this is not it. This seems to fall into the trap that it makes trivial things easy to do and hard things harder, which isn't very useful.

Let's try rewriting a fairly simple PCRE regex to match a doubly quoted string with a backslash escaping scheme into an object oriented construction syntax.

/ " 
  ( [^"\\]+ | \\. )* 
  ( " | $ ) 
/ xs

It's not particularly easy to read, but it's also not particularly hard. In theory it should be a good candidate for simplification by an alternative construction method. To convert this into a chained construction, after a quick glance at the API docs, I'd expect to write something like this:

r.find('"').then(
  r.maybe(
    r.anythingBut(r.find('\\').or('"')).or('\\.')
  ).then(r.find('"').or(r.endOfInput())
)

This is hardly an improvement. It doesn't end up reading anything like normal English because English doesn't really handle the nesting, so we have swapped the small amount of PCRE syntax clutter for a larger amount of English and OO syntax clutter, which in comparison, makes it extraordinarily hard to scan and get a feel for what the expression actually matches. This would only get worse on genuinely complex expressions, which this one is not. It's not clear whether the capturing groups are preserved, so we might have to add more clutter for those, and then we'd only end up deeper in a mess if we had to start considering details like lazy vs non-lazy token consumption and explicitly preventing backtracking.

In summary: regular expressions are useful; if you need them, learn them.

Filed under: programming, regular expressions

Talk is cheap

Leave a comment:

HTML is not valid. Use:
[url=http://www.google.com]Google[/url] [b]bold[/b] [i]italics[/i] [u]underline[/u] [code]code[/code]
'