Sunday, February 04, 2007

Regexes in Depth: Advanced Quoted String Matching

Update: Please view the updated version of this post on my new blog:

Advanced Quoted String Matching.

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:

First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

  • "test"
  • 'test'
  • "t'es't"
  • 'te"st'
  • 'te\'st'
  • "\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure. (Due to the way the regex is written, it doesn't need special handling for inner quotes that are not of the same type as the enclosure.)

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion without balancing groups.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote character with the quote character allows the lazy repetition operator to treat them as one token, and continue on it's way through the string.

Side note: If you wanted to support multi-line quotes in libraries without an option to make dots match newlines, change the dot to [\S\s]

Also note that with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

  • (["'])(?:\\\1|.)*?\1 (the regex being discussed)
  • (["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. Also, if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. Secondly, the reason we can't use [^\1] either is because you can't use backreferences within character classes (negated or otherwise), even though in this case the match contained within the backreference is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. They assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd need to use a leading zero (e.g., \01), but inside a character class the leading zero is optional.

If anyone has questions about how other, specific regex patterns work, or why they don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.

Edit: Just for kicks, here's a Unicode-based regular expression which adds support for any kind of opening/closing quote pair in any language (including the special characters , , , , etc.). Of the regex flavors I'm familiar with, Java, the .NET framework, and Perl use Unicode-based regex engines. Of those three, only the .NET framework also supports conditionals, which I'll also need to pull this off.

(?:(["'])|\p{Pi}).*?(?<!\\)(?(1)\1|\p{Pf})

I'm not going to go into explaining that, but the more advanced regex features used are a negative lookbehind, conditional, and Unicode character properties.

Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or ', neither of which are designated as opening or closing quote characters in Unicode).

  • “test”
  • “te“st”
  • “te\”st”
  • ‘test’
  • ‘t‘e"s\’t’

Edit 2: Shortly after posting the above Unicode-based regex, I realized it was flawed. Although it will correctly match all strings in the two lists of examples above, the fact that I'm using the Unicode character properties for any opening / closing quote means that it will also match, e.g., ‘test”, which is not what I was going for. The only way to get around this is to not use the Unicode character properties, and instead specifically include support for “” and ‘’ pairs (however, unfortunately we will lose the ability to work with special quote characters from any language). Here's an updated regex:

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

Now, it will no longer match ‘test”, and will successfully match things like ‘t‘e“"”s\’t’. Note that I'm using nested conditionals in the above regex to achieve an if-elseif-else construct. Also, now that it's no longer Unicode-based, it will work with regex engines which support both lookbehinds and conditionals (PCRE, PHP, the .NET framework, and possibly others).

2 comments:

Anonymous said...

Good blog. Very useful

Anonymous said...

great help, Thanks