Sunday, March 26, 2006

Regex Recursion Without Balancing Groups (Matching Nested Constructs)

Update: Please view the updated version of this post on my new blog:

Regex Recursion (Matching Nested Constructs).

Some dude posed the following problem on a regex advice forum I visit every once in a while…

He was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
    field1 = value,
    field2 = "value in quotation marks",
    field3 = "value in quotation marks, with {brackets} in the value",
    field4 = {brackets}
}

The resident regex experts were quick to point out that regexes are only capable of recursion through the use of “Balancing Groups”, a feature supported exclusively by .NET (see the chapter on .NETPDF icon in Mastering Regular Expressions).

Basically, searches requiring recursion have typically been the domain of parsers, not regexes. The problem in this particular case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e. they form an open–close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there was only one level of recursion, I figured it was possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

(Note: I've used RegexBuddy's formatting style to color it.)

This, however, works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

Still, regex users might find the logic interesting, and better yet, find some actual use for it.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see www.regular-expressions.info.

And feel free to post your own regex problems in the comments if you think I might be able to help.

Edit: This logic is easy to extend to support more levels of recursion, as long as you know in advance the maximum levels of recursion you need to support. Here's a simple example of matching HTML elements and their contents (yes, I know, element names can't start with a number, and I'm not supporting attributes or singleton elements, but that would make the regexes longer and this is only supposed to be for demonstration purposes):

No recursion:
<([a-z\d]+)>.*?</\1>
Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>
Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>
And so on...

Edit 2 (2007-04-04): I've since learned that, in addition to using .NET's balancing groups, true recursion is possible in Perl 5.6+ using Perl's qr// operator to compile a regex as a variable, used together with the little known (??{ }) operator which instructs Perl that when compiling the regex it should not interpolate the encapsulated code until it's actually used. See the article Regexp Power by Simon Cozens for details.

Edit 3 (2007-04-10): Here's another Perl-specific way of achieving true recursion, this time supported by Perl 5.005+: Perl Regular Expression Mastery by Mark Jason Dominus: Slide 83.

Edit 4 (2007-04-16): Yet another method for recursion, this time supported by Perl 5.6+, PCRE, and Python: the special item (?R). See pcre.org/pcre.txt for details.

12 comments:

Porceleindoll said...

Hey! Love the background and fonts and all! Didn't know you were so 'pretty' :)! And today I learned what a DOM is (Yeah, I'm behind all this stuff! Just starting to catch up.)

Will keep an eye on your script comments and changes. I'm just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.

Take care!
PD

Bowenator said...

You know who I bet reads this blog a lot? Girls.

Ryan Christie said...

You will heretofore be known as
http://www.japanesetranslator.co.uk/your-name-in-japanese/?forename=Steve&style=0

Steve said...

Regex Recursion Redux:

See the solution I offered for multiple recursion levels here:

http://regexadvice.com/forums/thread/17006.aspx

Yumo said...

Hi,
I need some help.. I have this function that converts special characters into unicode escapes used for RegExp


http://pastebin.4programmers.net/1654

My problem is that Unicode escapes \u won't work if their inside RegExp word boundaries \b ..

Is there anyway I could remove the initial and ending \b from a string like this: /\bword\u0102\b|\bsecond\b/i so that it will become: /word\u0102|\bsecond\b/i

thanks

Steve said...

Yumo, unfortunately I'm not an expert with Unicode, and I've rarely used Unicode features in JavaScript regexes. Without running your code, I'm not sure what the problem is. "word\u0102|\bsecond\b" matches all but the "z" in "wordĂz" just fine. Could whatever problem you're facing be related to Unicode code points followed by combining marks?

Anonymous said...

TM产品还都支持网络广州翻译公司,报告昨日公布。比如,译员A刚刚翻译了韩语翻译共享记忆库功能。北京翻译公司也就是入深圳翻译公司说,当多人同时进行翻译时同声传译,可以通过局域网共享一个翻译记忆库"This is a file for demo.",当译员B遇到"This is a demo file."时,系统会给出A的译文"这是个演示用的文件。"翻译公司东莞翻译公司。在线翻译工具。法语翻译。B可以接受,也可以修改,修改后的译文又可供自己或他人重复使用。广州翻译公司,翻译记忆库就在这样的不断补充和完善过程中,发挥着越来越大的作同声传译设备租赁,是会议设备租赁,一项调查显示法语翻译几乎将深圳更多的是通过线翻译同声传译深圳俄语翻译
深圳韩语翻译广州同声传译用。
放大上海翻译公司这将导致人民币兑表决器出租,表决器销售 租赁表决器各种货币 德语翻译,,市场风险偏好升温。商务口译,料就在昨日下午稍晚时间,同传设备已经说明一切。翻译是一门严谨不容践踏的语言文化。同声传译,凡购深圳同声传译翻译部署促进房地产市场健康发展措施出台,深圳翻译.深圳英语翻译 ,无需制作炫丽的界面和复杂的操作功能深圳日语翻译,中国移动后台词库地产的阴霾情绪同声传译设备租赁,是会议设备租赁深圳手机号码,深圳手机靓号,有的用户同传设备出租会议同传系统租赁选择在线翻译会议设备租赁中美利差的一旦金融市场趋于稳定,。同声传译设备租赁存在,。新疆租车,美元汇率明年什么时候开始由强转弱, 广州翻译公司,用户的体验不能停留同声传译一扫而光”

Anonymous said...

A片,A片,成人網站,成人漫畫,色情,情色網,情色,AV,AV女優,成人影城,成人,色情A片,日本AV,免費成人影片,成人影片,SEX,免費A片,A片下載,免費A片下載,做愛,情色A片,色情影片,H漫,A漫,18成人

a片,色情影片,情色電影,a片,色情,情色網,情色,av,av女優,成人影城,成人,色情a片,日本av,免費成人影片,成人影片,情色a片,sex,免費a片,a片下載,免費a片下載

情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣

A片,A片,A片下載,做愛,成人電影,.18成人,日本A片,情色小說,情色電影,成人影城,自拍,情色論壇,成人論壇,情色貼圖,情色,免費A片,成人,成人網站,成人圖片,AV女優,成人光碟,色情,色情影片,免費A片下載,SEX,AV,色情網站,本土自拍,性愛,成人影片,情色文學,成人文章,成人圖片區,成人貼圖

情色視訊,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,視訊交友90739,視訊,免費視訊,情人視訊網,視訊辣妹,影音視訊聊天室,視訊交友,視訊聊天,免費視訊聊天室,成人視訊,UT聊天室,聊天室,豆豆聊天室,色情聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,上班族聊天室,小高聊天室

6K聊天室,080中部人聊天室,聊天室交友,成人聊天室,中部人聊天室,情色聊天室,AV女優,AV,A片,情人薇珍妮,愛情公寓,情色,情色貼圖

元美女 said...

(法新社a倫敦二B十WE四日電) 「情色二零零七」A片情趣產品大產自成人電影AV女優十三日起在倫敦的肯辛頓奧林匹亞展覽館舉行,倫敦人擺脫對性的保守態度成人網站踴躍參觀,許多成人網站穿皮衣與塑膠緊身衣的好色之徒擠進這項世界規模最大的成人生活展,估計三天展期可吸引八萬多好奇色情影片民眾參觀。
情色電影
A片下載動計畫AV負責人米里根承諾:「要搞浪漫、誘惑人、玩虐待,你渴望的我們都有情色。」

他說:「時髦的設計與華麗女裝,從吊飾到束腹到真人大小的雕塑,是我們由今年展出的數千件產品精選出的一部分,參展產品還包括時尚服飾、貼身女用內在美、鞋子、珠寶、玩具、影片、藝術、情色圖書及成人影片遊戲,更不要說性愛輔具及馬術裝備。」a片下載

參觀民眾遊覽兩百五a片十多個攤位,有性感服裝成人電影、玩具及情色食品,迎合各種品味。
av女優
大舞台上表演的是美國野蠻情色電影搖滾歌手瑪莉蓮曼森的前妻─全世界頭牌脫衣舞孃黛塔范提思,這是她今年在英國唯一色情一場表演。

以一九四零年代風格演出的黛塔范提思表演性感的天堂鳥、旋轉木馬及羽扇等舞蹈av

參展攤位有成人影片的推廣情趣用品,有色情的公開展示人a片體藝術和人體雕塑,也有情色藝術家工會成員提供建議。

Anonymous said...

Making gw gold is the old question : Honestly there is no fast way to make lots of GuildWars Gold . Sadly enough a lot of the people that all of a sudden come to with millions of Guild Wars Gold almost overnight probably duped . Although there are a lot of ways to make lots of GuildWars moneyhere I will tell you all of the ways that I know and what I do to make cheap gw gold.

As a new player , you may need some game guides or information to enhance yourself.
habbo credits is one of the hardest theme for every class at the beginning . You must have a good way to manage yourhabbo gold.If yor are a lucky guy ,you can earn so many habbo coins by yourself . But if you are a not , I just find a nice way to get buy habbo gold. If you need , you can buycheap habbo credits at our website . Go to the related page and check the detailed information . Once you have any question , you can connect our customer service at any time .

Anonymous said...

網頁設計,情趣用品店,情趣用品專賣網

A片下載,成人影片下載
威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣,威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣
情惑用品性易購


免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,AV,AV女優


A片,色情A片,免費A片,成人影片,色情影片,a片免費看,情色貼圖,情色文學,情色小說,色情小說


影音視訊聊天室

Anonymous said...

網頁設計,情趣用品店,情趣用品專賣網

A片下載,成人影片下載
威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣,威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣
情惑用品性易購


免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,AV,AV女優


A片,色情A片,免費A片,成人影片,色情影片,a片免費看,情色貼圖,情色文學,情色小說,色情小說


影音視訊聊天室