Thursday, April 05, 2007

Faking Conditionals in Regular Expressions

Update: This post contains several major errors. Please view the updated version on my new blog:

Mimic Regular Expression Conditionals.

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted regex features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the libraries I'm familiar with, conditionals are only supported by .NET, PCRE, PHP (when using PCRE via the preg functions), and JGSoft products (including RegexBuddy).

There are two kinds of regex conditionals in those libraries... lookaround-based and capturing-group-based. The functionality of lookaround-based conditionals is very easy to replicate. First, here's what such conditionals look like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

To mimic that behavior in languages which don't support conditionals, just add a colon after the initial question mark to turn it into a non-capturing group, like so:

(?:(?=if_assertion)then|else)

As long as the regex engine you're using supports the specified lookaround type, those patterns do the same thing.

However, mimicking capturing-group-based conditionals proved to be more tricky. Conditionals which use an optional capturing group as their test allow you to base logic on whether a capturing group has participated in the match so far. Thus...

(a)?b(?(1)c|d)

...matches only "bd" and "abc". That pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

To use it without an "else" part, you still need to include "\3" at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's an empty alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the alternation are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

  • This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
  • It interacts with backtracking differently than a real conditonal (the "a" part is treated as if it were within an optional, atomic group... e.g., (?>(a)?) instead of (a)?), so it's best to think of this as a new operator which is similar to a conditional.

Here are the regex engines I've briefly tested this pattern with:

Language Supports "fake conditionals" Supports real conditionals Notes
.NET Yes Yes Tested using Expresso.
ColdFusion Yes No Tested using ColdFusion 7.
Java Yes No Tested using Regular Expression Test Applet.
JavaScript No No JavaScript assigns an empty string, rather than null, to backreferences for non-participating capturing groups. Unfortunately, this pattern depends on the way most other libraries handle non-participating capturing groups.
JGSoft Yes
(buggy)
Yes As of RegexBuddy version 2.3.2, it performs correctly in more cases if you change the two empty capturing groups ("()") to match a zero-length value of no impact, such as "(.{0})" or "(\b|\B)". It also has problems matching values at the end of a string when using an empty else. I've reported both issues to JGSoft, and have been told they will be fixed in the next version of RegexBuddy.
PHP Yes
(buggy)
Yes Tested using PHP Regex Tester. Performs correctly in more cases if you explicitly state the condition twice, like so: "(?=(?:a)()|())(?:a)?b(?:\1c| \2d)".

If you discover ways to improve this, or find problems not already mentioned, please let me know.

3 comments:

Anonymous said...

I had no idea that was possible! As no one had left a comment and because it was such a great post I thought I should say something. If you ever get a chance to expand this with more examples that would be cool. Thanks.

Steve said...

Thanks, Gary!

I'll try to add better examples for this eventually, but for now note that there are several errors in this post. You can see the updated, corrected version on my new blog at StevenLevithan.com: Mimic Regex Conditionals

Anonymous said...

thanks

http://www.libyanyouths.com/