Tuesday, April 03, 2007

Faking Atomic Groups in Regular Expressions

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds don't actually match anything. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference 0)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually match anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That appears to produce the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support positive lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking cool.

2 comments:

Unknown said...

All right, man. You've officially eclipsed my regex capacity. I considered myself pretty proficient until today. I've read this post 3 or 4 times and...my mind is officially blown.

That aside, you've introduced some pretty innovative stuff in this space. Keep it coming. I always learn something - even if I don't understand what it is.

Anonymous said...

thanks