Monday, April 02, 2007

Regexes in Depth: Matching Innermost HTML Elements

Update: Please view the updated, syntax-highlighted version of this post on my new blog:

Matching Innermost HTML Elements.

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match "<table>...</table>", but should only match the inner table within "<table>...<table>...</table>...</table>". This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that, unless you're working with .NET or Perl, regexes are incapable of recursion (although it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. ;-)

Here's the solution I offered (though there were a few steps to get there):

<table(?:\s[^>]*)?>(?:(?>[^<]+)|<(?!table(?:\s[^>]*)?>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also very fast, and it can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the highlighted text below:

<table><td><table><td>&nbsp;</td></table></td></table> <table><tr><td>&nbsp;</td></tr></table> <table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed):

<table>(?:[\S\s](?!<table>))*?</table>

Basically, the way that works is it matches an opening "<table>" tag, then it looks at each following character one at a time and checks if they are followed by another instance of "<table>" before "</table>". If so, the match fails, because it's not an innermost table.

In theory, at least. Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening "<table>", but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory... you'll see that there are problems with this as is), because within each <table> tag it will greedily jump between all characters which are not "<" in single steps (using "[^<]+"), and it will only use the negative lookahead when it encounters "<". Secondly, it solves the previously noted error by using "<(?!table>)" instead of ".(?!<table>)".

If you're wondering about table tags which contain attributes, that's not a problem. The construct is such that it can easily be extended to support element attributes. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table(?:\s[^>]*)?>(?:[^<]+|<(?!table(?:\s[^>]*)?>))*?</table>

At first I thought this closed the case... The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The change to the regex is highlighted:

<table(?:\s[^>]*)?>(?:(?>[^<]+)|<(?!table(?:\s[^>]*)?>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in less than 0.01 second on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, a solid approach which is not susceptible to catastrophic backtracking would be:

<table(?:\s[^>]*)?>(?!<table(?:\s[^>]*)?>)(?:[\S\s](?!<table(?:\s[^>]*)?>))*?</table>

That runs just a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through RegexBuddy's debugger for an under-the-hood look at how they're handled by a regex engine.

Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table(?:\s[^>]*)?>(?:(?=([^<]+))\1|<(?!table(?:\s[^>]*)?>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by "\1" to act just like an atomic group!

6 comments:

Anonymous said...

Thank you very much! :) Excellent work!

Anonymous said...

A片,A片,成人網站,成人漫畫,色情,情色網,情色,AV,AV女優,成人影城,成人,色情A片,日本AV,免費成人影片,成人影片,SEX,免費A片,A片下載,免費A片下載,做愛,情色A片,色情影片,H漫,A漫,18成人

a片,色情影片,情色電影,a片,色情,情色網,情色,av,av女優,成人影城,成人,色情a片,日本av,免費成人影片,成人影片,情色a片,sex,免費a片,a片下載,免費a片下載

情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣

A片,A片,A片下載,做愛,成人電影,.18成人,日本A片,情色小說,情色電影,成人影城,自拍,情色論壇,成人論壇,情色貼圖,情色,免費A片,成人,成人網站,成人圖片,AV女優,成人光碟,色情,色情影片,免費A片下載,SEX,AV,色情網站,本土自拍,性愛,成人影片,情色文學,成人文章,成人圖片區,成人貼圖

情色視訊,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,視訊交友90739,視訊,免費視訊,情人視訊網,視訊辣妹,影音視訊聊天室,視訊交友,視訊聊天,免費視訊聊天室,成人視訊,UT聊天室,聊天室,豆豆聊天室,色情聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,上班族聊天室,小高聊天室

6K聊天室,080中部人聊天室,聊天室交友,成人聊天室,中部人聊天室,情色聊天室,AV女優,AV,A片,情人薇珍妮,愛情公寓,情色,情色貼圖

Anonymous said...

A片,A片,成人網站,成人漫畫,色情,情色網,情色,AV,AV女優,成人影城,成人,色情A片,日本AV,免費成人影片,成人影片,SEX,免費A片,A片下載,免費A片下載,做愛,情色A片,色情影片,H漫,A漫,18成人

a片,色情影片,情色電影,a片,色情,情色網,情色,av,av女優,成人影城,成人,色情a片,日本av,免費成人影片,成人影片,情色a片,sex,免費a片,a片下載,免費a片下載

情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣

A片,A片,A片下載,做愛,成人電影,.18成人,日本A片,情色小說,情色電影,成人影城,自拍,情色論壇,成人論壇,情色貼圖,情色,免費A片,成人,成人網站,成人圖片,AV女優,成人光碟,色情,色情影片,免費A片下載,SEX,AV,色情網站,本土自拍,性愛,成人影片,情色文學,成人文章,成人圖片區,成人貼圖

情色文學,色情小說,色情,寄情築園小遊戲,AIO交友愛情館,情色電影,一葉情貼圖片區,色情遊戲

言情小說,情色論壇,色情網站,微風成人,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人網站,免費影片,色情影片,自拍,hilive,做愛,微風成人,微風論壇,AIO

husha said...

Today, the Microsoft-owned in-game ad agency said that it has signed an exclusive multiyear agreement with Blizzard. Azerothians opposed to seeing in-game ads in their local world of warcft gold watering holes need not worry, however, because the deal is limited to Blizzard's Web sites and Battle.net,the game maker's online-gaming hub. Terms of the deal were not announced, but Massive did note that the agreement is applicable to users in the US, Canada, Europe, South Korea, and Australia.
buy wow gold


Massive also said today that it would be extending its aforementioned deal with Activision to encompass an additional 18 games appearing on the Xbox 360 and PC.cheap wow goldThe agency didn't fully delineate which would fall under this deal, though it did call out Guitar Hero: World Tour, James Bond: Quantum of Solace, and Transformers: Revenge of the Fallen,buy wow items as well as games in its Tony Hawk and AMAX Racing franchises.Shortly before Activision and Vivendi announced their deal of the decade,wow power leveling the Guitar Hero publisher signed on to receive in-game advertisements from Massive Inc for a number of its Xbox 360 and PC games. A bit more than a year later, Massive is now extending its reach to Activision's new power player, Blizzard Entertainment.buy wow gold from our site ,you'll get more surprises!

Anonymous said...

Making gw gold is the old question : Honestly there is no fast way to make lots of GuildWars Gold . Sadly enough a lot of the people that all of a sudden come to with millions of Guild Wars Gold almost overnight probably duped . Although there are a lot of ways to make lots of GuildWars moneyhere I will tell you all of the ways that I know and what I do to make cheap gw gold.

As a new player , you may need some game guides or information to enhance yourself.
habbo credits is one of the hardest theme for every class at the beginning . You must have a good way to manage yourhabbo gold.If yor are a lucky guy ,you can earn so many habbo coins by yourself . But if you are a not , I just find a nice way to get buy habbo gold. If you need , you can buycheap habbo credits at our website . Go to the related page and check the detailed information . Once you have any question , you can connect our customer service at any time .

Anonymous said...

網頁設計,情趣用品店,情趣用品專賣網

A片下載,成人影片下載
威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣,威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣
情惑用品性易購


免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,AV,AV女優


A片,色情A片,免費A片,成人影片,色情影片,a片免費看,情色貼圖,情色文學,情色小說,色情小說


影音視訊聊天室