Sunday, February 04, 2007

Regexes in Depth: Advanced Quoted String Matching

Update: Please view the updated version of this post on my new blog:

Advanced Quoted String Matching.

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:

First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

  • "test"
  • 'test'
  • "t'es't"
  • 'te"st'
  • 'te\'st'
  • "\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure. (Due to the way the regex is written, it doesn't need special handling for inner quotes that are not of the same type as the enclosure.)

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion without balancing groups.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote character with the quote character allows the lazy repetition operator to treat them as one token, and continue on it's way through the string.

Side note: If you wanted to support multi-line quotes in libraries without an option to make dots match newlines, change the dot to [\S\s]

Also note that with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

  • (["'])(?:\\\1|.)*?\1 (the regex being discussed)
  • (["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. Also, if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. Secondly, the reason we can't use [^\1] either is because you can't use backreferences within character classes (negated or otherwise), even though in this case the match contained within the backreference is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. They assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd need to use a leading zero (e.g., \01), but inside a character class the leading zero is optional.

If anyone has questions about how other, specific regex patterns work, or why they don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.

Edit: Just for kicks, here's a Unicode-based regular expression which adds support for any kind of opening/closing quote pair in any language (including the special characters , , , , etc.). Of the regex flavors I'm familiar with, Java, the .NET framework, and Perl use Unicode-based regex engines. Of those three, only the .NET framework also supports conditionals, which I'll also need to pull this off.

(?:(["'])|\p{Pi}).*?(?<!\\)(?(1)\1|\p{Pf})

I'm not going to go into explaining that, but the more advanced regex features used are a negative lookbehind, conditional, and Unicode character properties.

Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or ', neither of which are designated as opening or closing quote characters in Unicode).

  • “test”
  • “te“st”
  • “te\”st”
  • ‘test’
  • ‘t‘e"s\’t’

Edit 2: Shortly after posting the above Unicode-based regex, I realized it was flawed. Although it will correctly match all strings in the two lists of examples above, the fact that I'm using the Unicode character properties for any opening / closing quote means that it will also match, e.g., ‘test”, which is not what I was going for. The only way to get around this is to not use the Unicode character properties, and instead specifically include support for “” and ‘’ pairs (however, unfortunately we will lose the ability to work with special quote characters from any language). Here's an updated regex:

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

Now, it will no longer match ‘test”, and will successfully match things like ‘t‘e“"”s\’t’. Note that I'm using nested conditionals in the above regex to achieve an if-elseif-else construct. Also, now that it's no longer Unicode-based, it will work with regex engines which support both lookbehinds and conditionals (PCRE, PHP, the .NET framework, and possibly others).

8 comments:

Viagra said...

Good blog. Very useful

Anonymous said...

great help, Thanks

Anonymous said...


havalandırma
havalandirma
izolasyon
Sohbet
iso 9001
iso 14001
Söndürme
yangın söndürme cihazları
yangın dolapları
yangın tüpü
izalasyon
ısıtma soğutma
isitma sogutma
Aspirator
Aspiratör
Vantilatör
sohbetim
turizm işletme belgesi
turizm belgesi
turizm yatırım belgesi
Chat
sohbet odası
sohbet sitesi
türkiye sohbet
tr sohbet
tüm türkiye sohbet
arkadaş sohbet
türkiye sohpet
kızlarla sohbet
kızlarla sohpet
muhabbet
muhappet
kızlarla çet
çet
türkiye çet
çet sohpet
mırç
mirç
türkiye mirc
mirc
muhabbet
Sohbet Sitesi
Chat
Sohpet
Yangın
yangın güvenlik
yangın söndürme sistemleri
yangın tüpü dolum
yangın merdiveni
yangın çıkış kapısı 
Hava Soğutma
Hücreli Aspiratörler
Fanlar
Radyal Körükler
Toz Toplama
Soğutma Kulesi
Klima Santraller
Malzeme Nakil Vantilatörleri
iso 14001
iso 14001
iso 22000
iso 22000
haccp belgesi
haccp belgesi
ikamet tezkeresi
yabancı çalışma izni
yabancı personel çalışma izni
yabancı çalışma izni
yabancı personel çalışma izni
ohsas 18001
ohsas 18001
iso belgesi
iso 9001 belgesi
ohsas belgesi
ISO 9001
Teşvik Belgesi
Çocuk Bezi
Hasta Bezi
Makyaj Malzemeleri
Makyaj Temizleme Mendili
Kişisel Bakım
kolonyalı mendil
Islak mendil
Dudak Koruyucu
Temizlik Ürünleri
Göz Kalemi
Diyet Ürünleri
Süper Site
driver
Güvenlik Kamerası
Islak Mendil
Kolonyalı Mendil
Kolonyalı Mendil
JoyTurk
driver ara
web tasarım
Güvenlik Kamerası
paketleme
Kamera
Kamera Kurulum
Tatil
Tatil Yerleri
Tatil Beldeleri
Perde
Perde Modelleri
Kamera
Epilasyon
Emlak
Yaşam
Tatil
Video
Cilt Bakımı
video
süper
perde
jaluzi perde
stor perde
dikey perde
perde modelleri
perde
jaluzi perde
stor perde
dikey perde
perde modelleri
magazin
haberler
spor haberleri
video
eğitim
Giyim

Anonymous said...

TM产品还都支持网络广州翻译公司,报告昨日公布。比如,译员A刚刚翻译了韩语翻译共享记忆库功能。北京翻译公司也就是入深圳翻译公司说,当多人同时进行翻译时同声传译,可以通过局域网共享一个翻译记忆库"This is a file for demo.",当译员B遇到"This is a demo file."时,系统会给出A的译文"这是个演示用的文件。"翻译公司东莞翻译公司。在线翻译工具。法语翻译。B可以接受,也可以修改,修改后的译文又可供自己或他人重复使用。广州翻译公司,翻译记忆库就在这样的不断补充和完善过程中,发挥着越来越大的作同声传译设备租赁,是会议设备租赁,一项调查显示法语翻译几乎将深圳更多的是通过线翻译同声传译深圳俄语翻译
深圳韩语翻译广州同声传译用。
放大上海翻译公司这将导致人民币兑表决器出租,表决器销售 租赁表决器各种货币 德语翻译,,市场风险偏好升温。商务口译,料就在昨日下午稍晚时间,同传设备已经说明一切。翻译是一门严谨不容践踏的语言文化。同声传译,凡购深圳同声传译翻译部署促进房地产市场健康发展措施出台,深圳翻译.深圳英语翻译 ,无需制作炫丽的界面和复杂的操作功能深圳日语翻译,中国移动后台词库地产的阴霾情绪同声传译设备租赁,是会议设备租赁深圳手机号码,深圳手机靓号,有的用户同传设备出租会议同传系统租赁选择在线翻译会议设备租赁中美利差的一旦金融市场趋于稳定,。同声传译设备租赁存在,。新疆租车,美元汇率明年什么时候开始由强转弱, 广州翻译公司,用户的体验不能停留同声传译一扫而光”

Anonymous said...

A片,A片,成人網站,成人漫畫,色情,情色網,情色,AV,AV女優,成人影城,成人,色情A片,日本AV,免費成人影片,成人影片,SEX,免費A片,A片下載,免費A片下載,做愛,情色A片,色情影片,H漫,A漫,18成人

a片,色情影片,情色電影,a片,色情,情色網,情色,av,av女優,成人影城,成人,色情a片,日本av,免費成人影片,成人影片,情色a片,sex,免費a片,a片下載,免費a片下載

情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣

A片,A片,A片下載,做愛,成人電影,.18成人,日本A片,情色小說,情色電影,成人影城,自拍,情色論壇,成人論壇,情色貼圖,情色,免費A片,成人,成人網站,成人圖片,AV女優,成人光碟,色情,色情影片,免費A片下載,SEX,AV,色情網站,本土自拍,性愛,成人影片,情色文學,成人文章,成人圖片區,成人貼圖

情色視訊,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,視訊交友90739,視訊,免費視訊,情人視訊網,視訊辣妹,影音視訊聊天室,視訊交友,視訊聊天,免費視訊聊天室,成人視訊,UT聊天室,聊天室,豆豆聊天室,色情聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,上班族聊天室,小高聊天室

6K聊天室,080中部人聊天室,聊天室交友,成人聊天室,中部人聊天室,情色聊天室,AV女優,AV,A片,情人薇珍妮,愛情公寓,情色,情色貼圖

世界のさんぺい said...

出会い
出会い
セフレ
メル友
人妻
出会い
出会い系
メル友
セフレ
人妻
恋人
不倫
セックスフレンド
出会い
ご近所
無料出会い系サイト
出会い系サイト無料
無料出会い系サイト
メル友
出会い系
人妻
出会い
結婚
童貞
セフレ
スタビ
援助
ギャル
熟女
メル友

Anonymous said...

Making gw gold is the old question : Honestly there is no fast way to make lots of GuildWars Gold . Sadly enough a lot of the people that all of a sudden come to with millions of Guild Wars Gold almost overnight probably duped . Although there are a lot of ways to make lots of GuildWars moneyhere I will tell you all of the ways that I know and what I do to make cheap gw gold.

As a new player , you may need some game guides or information to enhance yourself.
habbo credits is one of the hardest theme for every class at the beginning . You must have a good way to manage yourhabbo gold.If yor are a lucky guy ,you can earn so many habbo coins by yourself . But if you are a not , I just find a nice way to get buy habbo gold. If you need , you can buycheap habbo credits at our website . Go to the related page and check the detailed information . Once you have any question , you can connect our customer service at any time .

Anonymous said...

網頁設計,情趣用品店,情趣用品專賣網

A片下載,成人影片下載
威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣,威而柔,自慰套,自慰套,SM,充氣娃娃,充氣娃娃,潤滑液,飛機杯,按摩棒,跳蛋,性感睡衣
情惑用品性易購


免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,AV,AV女優


A片,色情A片,免費A片,成人影片,色情影片,a片免費看,情色貼圖,情色文學,情色小說,色情小說


影音視訊聊天室