Sunday, February 04, 2007

Capturing vs. Non-capturing Regex Groups

I posted the following on Ben Nadel's excellent blog, in response to a question about why I use non-capturing groups in my regular expressions (e.g., (?:non-captured))...

I near-religiously use non-capturing groups whenever I do not need to reference a group's contents. There are only three reasons to use capturing groups:

  • You're using parts of a match to construct a replacement string, or otherwise referencing parts of the match in code outside the regex.
  • You need to reuse parts of the match within the regex itself. E.g., (["'])(?:\\\1|.)*?\1 would match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match, and allowing inner, escaped quotes of the same type as the enclosure.
  • You need to test if an optional group was part of the match so far, as the condition to evaluate within a conditional. E.g., (a)?b(?(1)c|d) only matches the values "bd" and "abc".

There are two primary reasons to use non-capturing groups if a grouping doesn't meet one of the above conditions:

  • Capturing groups negatively impact performace, since creating backreferences requires that their contents be stored in memory. The performance hit may be tiny, especially when working with small strings, but it's there.
  • When you need to use several groupings in a single regex, only some of which you plan to reference later, it's very convenient to have the backreferences you want to use numbered sequentially. E.g., the logic in my parseUri() UDF could not be nearly as simple if I had not made appropriate use of capturing and non-capturing groups within the same regex.

On a related note, the values of backreferences created using capturing groups with repetition operators on the end of them may not be obvious until you're familar with how it works. If you ran the regex (.)* over the string "test", although backreference 0 (i.e., the whole match) would be "test", backreference 1 would be "t", and there would be no 2nd, 3rd, or 4th backreferences created for the strings "e," "s," and "t." If you wanted the entire match of a repeated grouping to be captured into a backreference, you could use, e.g., ((?:.)*). Also note that the way both of those patterns would be evaluated is fundamentally different from how regex engines would treat (.*).

1 comment:

Ze said...

Steve, thanks for the helpful post. I would just like to point out that when you say "there are only 3 reasons to use capturing groups" and right on the first reason you list an infinite set("use the captured regex outside the regex") you are in a bit of contradiction.

"Outside the regex" is a set made of a billion reasons.

I just thought I'd point that out.

Regards, Jose