Sunday, March 26, 2006

Regex Recursion Without Balancing Groups (Matching Nested Constructs)

Update: Please view the updated version of this post on my new blog:

Regex Recursion (Matching Nested Constructs).

Some dude posed the following problem on a regex advice forum I visit every once in a while…

He was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
    field1 = value,
    field2 = "value in quotation marks",
    field3 = "value in quotation marks, with {brackets} in the value",
    field4 = {brackets}
}

The resident regex experts were quick to point out that regexes are only capable of recursion through the use of “Balancing Groups”, a feature supported exclusively by .NET (see the chapter on .NETPDF icon in Mastering Regular Expressions).

Basically, searches requiring recursion have typically been the domain of parsers, not regexes. The problem in this particular case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e. they form an open–close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there was only one level of recursion, I figured it was possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

(Note: I've used RegexBuddy's formatting style to color it.)

This, however, works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

Still, regex users might find the logic interesting, and better yet, find some actual use for it.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see www.regular-expressions.info.

And feel free to post your own regex problems in the comments if you think I might be able to help.

Edit: This logic is easy to extend to support more levels of recursion, as long as you know in advance the maximum levels of recursion you need to support. Here's a simple example of matching HTML elements and their contents (yes, I know, element names can't start with a number, and I'm not supporting attributes or singleton elements, but that would make the regexes longer and this is only supposed to be for demonstration purposes):

No recursion:
<([a-z\d]+)>.*?</\1>
Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>
Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>
And so on...

Edit 2 (2007-04-04): I've since learned that, in addition to using .NET's balancing groups, true recursion is possible in Perl 5.6+ using Perl's qr// operator to compile a regex as a variable, used together with the little known (??{ }) operator which instructs Perl that when compiling the regex it should not interpolate the encapsulated code until it's actually used. See the article Regexp Power by Simon Cozens for details.

Edit 3 (2007-04-10): Here's another Perl-specific way of achieving true recursion, this time supported by Perl 5.005+: Perl Regular Expression Mastery by Mark Jason Dominus: Slide 83.

Edit 4 (2007-04-16): Yet another method for recursion, this time supported by Perl 5.6+, PCRE, and Python: the special item (?R). See pcre.org/pcre.txt for details.

5 comments:

Anonymous said...

Hey! Love the background and fonts and all! Didn't know you were so 'pretty' :)! And today I learned what a DOM is (Yeah, I'm behind all this stuff! Just starting to catch up.)

Will keep an eye on your script comments and changes. I'm just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.

Take care!
PD

Anonymous said...

You know who I bet reads this blog a lot? Girls.

Steve said...

Regex Recursion Redux:

See the solution I offered for multiple recursion levels here:

http://regexadvice.com/forums/thread/17006.aspx

Anonymous said...

Hi,
I need some help.. I have this function that converts special characters into unicode escapes used for RegExp


http://pastebin.4programmers.net/1654

My problem is that Unicode escapes \u won't work if their inside RegExp word boundaries \b ..

Is there anyway I could remove the initial and ending \b from a string like this: /\bword\u0102\b|\bsecond\b/i so that it will become: /word\u0102|\bsecond\b/i

thanks

Steve said...

Yumo, unfortunately I'm not an expert with Unicode, and I've rarely used Unicode features in JavaScript regexes. Without running your code, I'm not sure what the problem is. "word\u0102|\bsecond\b" matches all but the "z" in "wordĂz" just fine. Could whatever problem you're facing be related to Unicode code points followed by combining marks?