Sunday, March 26, 2006

Regex Recursion Without Balancing Groups (Matching Nested Constructs)

Update: Please view the updated version of this post on my new blog:

Regex Recursion (Matching Nested Constructs).

Some dude posed the following problem on a regex advice forum I visit every once in a while…

He was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

    field1 = value,
    field2 = "value in quotation marks",
    field3 = "value in quotation marks, with {brackets} in the value",
    field4 = {brackets}

The resident regex experts were quick to point out that regexes are only capable of recursion through the use of “Balancing Groups”, a feature supported exclusively by .NET (see the chapter on .NETPDF icon in Mastering Regular Expressions).

Basically, searches requiring recursion have typically been the domain of parsers, not regexes. The problem in this particular case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e. they form an open–close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there was only one level of recursion, I figured it was possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):


(Note: I've used RegexBuddy's formatting style to color it.)

This, however, works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

Still, regex users might find the logic interesting, and better yet, find some actual use for it.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see

And feel free to post your own regex problems in the comments if you think I might be able to help.

Edit: This logic is easy to extend to support more levels of recursion, as long as you know in advance the maximum levels of recursion you need to support. Here's a simple example of matching HTML elements and their contents (yes, I know, element names can't start with a number, and I'm not supporting attributes or singleton elements, but that would make the regexes longer and this is only supposed to be for demonstration purposes):

No recursion:
Up to one level of recursion:
Up to two levels of recursion:
And so on...

Edit 2 (2007-04-04): I've since learned that, in addition to using .NET's balancing groups, true recursion is possible in Perl 5.6+ using Perl's qr// operator to compile a regex as a variable, used together with the little known (??{ }) operator which instructs Perl that when compiling the regex it should not interpolate the encapsulated code until it's actually used. See the article Regexp Power by Simon Cozens for details.

Edit 3 (2007-04-10): Here's another Perl-specific way of achieving true recursion, this time supported by Perl 5.005+: Perl Regular Expression Mastery by Mark Jason Dominus: Slide 83.

Edit 4 (2007-04-16): Yet another method for recursion, this time supported by Perl 5.6+, PCRE, and Python: the special item (?R). See for details.


Porceleindoll said...

Hey! Love the background and fonts and all! Didn't know you were so 'pretty' :)! And today I learned what a DOM is (Yeah, I'm behind all this stuff! Just starting to catch up.)

Will keep an eye on your script comments and changes. I'm just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.

Take care!

Bowenator said...

You know who I bet reads this blog a lot? Girls.

Ryan Christie said...

You will heretofore be known as

Steve said...

Regex Recursion Redux:

See the solution I offered for multiple recursion levels here:

Yumo said...

I need some help.. I have this function that converts special characters into unicode escapes used for RegExp

My problem is that Unicode escapes \u won't work if their inside RegExp word boundaries \b ..

Is there anyway I could remove the initial and ending \b from a string like this: /\bword\u0102\b|\bsecond\b/i so that it will become: /word\u0102|\bsecond\b/i


Steve said...

Yumo, unfortunately I'm not an expert with Unicode, and I've rarely used Unicode features in JavaScript regexes. Without running your code, I'm not sure what the problem is. "word\u0102|\bsecond\b" matches all but the "z" in "wordĂz" just fine. Could whatever problem you're facing be related to Unicode code points followed by combining marks?

Anonymous said...

TM产品还都支持网络广州翻译公司,报告昨日公布。比如,译员A刚刚翻译了韩语翻译共享记忆库功能。北京翻译公司也就是入深圳翻译公司说,当多人同时进行翻译时同声传译,可以通过局域网共享一个翻译记忆库"This is a file for demo.",当译员B遇到"This is a demo file."时,系统会给出A的译文"这是个演示用的文件。"翻译公司东莞翻译公司。在线翻译工具。法语翻译。B可以接受,也可以修改,修改后的译文又可供自己或他人重复使用。广州翻译公司,翻译记忆库就在这样的不断补充和完善过程中,发挥着越来越大的作同声传译设备租赁,是会议设备租赁,一项调查显示法语翻译几乎将深圳更多的是通过线翻译同声传译深圳俄语翻译
放大上海翻译公司这将导致人民币兑表决器出租,表决器销售 租赁表决器各种货币 德语翻译,,市场风险偏好升温。商务口译,料就在昨日下午稍晚时间,同传设备已经说明一切。翻译是一门严谨不容践踏的语言文化。同声传译,凡购深圳同声传译翻译部署促进房地产市场健康发展措施出台,深圳翻译.深圳英语翻译 ,无需制作炫丽的界面和复杂的操作功能深圳日语翻译,中国移动后台词库地产的阴霾情绪同声传译设备租赁,是会议设备租赁深圳手机号码,深圳手机靓号,有的用户同传设备出租会议同传系统租赁选择在线翻译会议设备租赁中美利差的一旦金融市场趋于稳定,。同声传译设备租赁存在,。新疆租车,美元汇率明年什么时候开始由强转弱, 广州翻译公司,用户的体验不能停留同声传译一扫而光”

Anonymous said...







Anonymous said...

(法新社a倫敦二B十WE四日電) 「情色二零零七」A片情趣產品大產自成人電影AV女優十三日起在倫敦的肯辛頓奧林匹亞展覽館舉行,倫敦人擺脫對性的保守態度成人網站踴躍參觀,許多成人網站穿皮衣與塑膠緊身衣的好色之徒擠進這項世界規模最大的成人生活展,估計三天展期可吸引八萬多好奇色情影片民眾參觀。





Anonymous said...

Making gw gold is the old question : Honestly there is no fast way to make lots of GuildWars Gold . Sadly enough a lot of the people that all of a sudden come to with millions of Guild Wars Gold almost overnight probably duped . Although there are a lot of ways to make lots of GuildWars moneyhere I will tell you all of the ways that I know and what I do to make cheap gw gold.

As a new player , you may need some game guides or information to enhance yourself.
habbo credits is one of the hardest theme for every class at the beginning . You must have a good way to manage yourhabbo gold.If yor are a lucky guy ,you can earn so many habbo coins by yourself . But if you are a not , I just find a nice way to get buy habbo gold. If you need , you can buycheap habbo credits at our website . Go to the related page and check the detailed information . Once you have any question , you can connect our customer service at any time .

Anonymous said...






Anonymous said...