Thursday, April 05, 2007

Faking Conditionals in Regular Expressions

Update: This post contains several major errors. Please view the updated version on my new blog:

Mimic Regular Expression Conditionals.

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted regex features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the libraries I'm familiar with, conditionals are only supported by .NET, PCRE, PHP (when using PCRE via the preg functions), and JGSoft products (including RegexBuddy).

There are two kinds of regex conditionals in those libraries... lookaround-based and capturing-group-based. The functionality of lookaround-based conditionals is very easy to replicate. First, here's what such conditionals look like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

To mimic that behavior in languages which don't support conditionals, just add a colon after the initial question mark to turn it into a non-capturing group, like so:

(?:(?=if_assertion)then|else)

As long as the regex engine you're using supports the specified lookaround type, those patterns do the same thing.

However, mimicking capturing-group-based conditionals proved to be more tricky. Conditionals which use an optional capturing group as their test allow you to base logic on whether a capturing group has participated in the match so far. Thus...

(a)?b(?(1)c|d)

...matches only "bd" and "abc". That pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

To use it without an "else" part, you still need to include "\3" at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's an empty alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the alternation are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

  • This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
  • It interacts with backtracking differently than a real conditonal (the "a" part is treated as if it were within an optional, atomic group... e.g., (?>(a)?) instead of (a)?), so it's best to think of this as a new operator which is similar to a conditional.

Here are the regex engines I've briefly tested this pattern with:

Language Supports "fake conditionals" Supports real conditionals Notes
.NET Yes Yes Tested using Expresso.
ColdFusion Yes No Tested using ColdFusion 7.
Java Yes No Tested using Regular Expression Test Applet.
JavaScript No No JavaScript assigns an empty string, rather than null, to backreferences for non-participating capturing groups. Unfortunately, this pattern depends on the way most other libraries handle non-participating capturing groups.
JGSoft Yes
(buggy)
Yes As of RegexBuddy version 2.3.2, it performs correctly in more cases if you change the two empty capturing groups ("()") to match a zero-length value of no impact, such as "(.{0})" or "(\b|\B)". It also has problems matching values at the end of a string when using an empty else. I've reported both issues to JGSoft, and have been told they will be fixed in the next version of RegexBuddy.
PHP Yes
(buggy)
Yes Tested using PHP Regex Tester. Performs correctly in more cases if you explicitly state the condition twice, like so: "(?=(?:a)()|())(?:a)?b(?:\1c| \2d)".

If you discover ways to improve this, or find problems not already mentioned, please let me know.

7 comments:

Gary Fenton said...

I had no idea that was possible! As no one had left a comment and because it was such a great post I thought I should say something. If you ever get a chance to expand this with more examples that would be cool. Thanks.

Steve said...

Thanks, Gary!

I'll try to add better examples for this eventually, but for now note that there are several errors in this post. You can see the updated, corrected version on my new blog at StevenLevithan.com: Mimic Regex Conditionals

Anonymous said...

thanks

http://www.libyanyouths.com/

دردشة said...

You wellness
Personal websites:
شات - دردشه - دردشة - شات صوتي-
دردشه صوتيه - دردشة صوتية -
شات صوتي
- منتديات -
منتدي
- منتدى - العاب -
دليل

Anonymous said...

杭州装修公司
杭州店面装修
杭州办公室装修
杭州装饰公司
杭州装饰公司

蜂王浆
芦荟
蜂胶
蜂王浆
芦荟
蜂胶

ball valve球阀
gate valve闸阀
angle valve角阀
bibcock水嘴
tap
Check valve
hot-water heating
fittings
苏州led
上海led
北京led
苏州电磁铁
苏州装修公司
苏州装饰公司
ats
ATS生产
ats
ATS开关

Anonymous said...

Making gw gold is the old question : Honestly there is no fast way to make lots of GuildWars Gold . Sadly enough a lot of the people that all of a sudden come to with millions of Guild Wars Gold almost overnight probably duped . Although there are a lot of ways to make lots of GuildWars moneyhere I will tell you all of the ways that I know and what I do to make cheap gw gold.

As a new player , you may need some game guides or information to enhance yourself.
habbo credits is one of the hardest theme for every class at the beginning . You must have a good way to manage yourhabbo gold.If yor are a lucky guy ,you can earn so many habbo coins by yourself . But if you are a not , I just find a nice way to get buy habbo gold. If you need , you can buycheap habbo credits at our website . Go to the related page and check the detailed information . Once you have any question , you can connect our customer service at any time .

Anonymous said...

網頁設計,情趣用品店,情趣用品專賣網

A片下載,成人影片下載
威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣,威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣
情惑用品性易購


免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,AV,AV女優


A片,色情A片,免費A片,成人影片,色情影片,a片免費看,情色貼圖,情色文學,情色小說,色情小說


影音視訊聊天室