Wednesday, February 07, 2007

parseUri(): Split URLs in JavaScript

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(someUri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
Splits any well-formed URI into the following parts (all are optional):
----------------------
• source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
    • domain (part of the authority; can be an IP address)
    • port (part of the authority)
• path (includes both the directory path and filename)
    • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
    • fileName (part of the path)
• query (does not include the leading question mark)
• anchor (fragment)
*/
function parseUri(sourceUri){
    var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
    var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
    var uri = {};
    
    for(var i = 0; i < 10; i++){
        uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
    }
    
    // Always end directoryPath with a trailing backslash if a path was present in the source URI
    // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
    if(uri.directoryPath.length > 0){
        uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
    }
    
    return uri;
}

Is there any leaner, meaner URI parser out there? :-)

To make it easier to test this function, here is some code that can be copied and pasted into a new HTML file, allowing you to easily enter URIs and see the results.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Steve's URI Parser</title>
    
    <script type="text/javascript">
    //<![CDATA[
        /* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
        Splits any well-formed URI into the following parts (all are optional):
        ----------------------
        • source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
        • protocol (scheme)
        • authority (includes both the domain and port)
            • domain (part of the authority; can be an IP address)
            • port (part of the authority)
        • path (includes both the directory path and filename)
            • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
            • fileName (part of the path)
        • query (does not include the leading question mark)
        • anchor (fragment)
        */
        function parseUri(sourceUri){
            var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
            var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
            var uri = {};
            
            for(var i = 0; i < 10; i++){
                uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
            }
            
            // Always end directoryPath with a trailing backslash if a path was present in the source URI
            // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
            if(uri.directoryPath.length > 0){
                uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
            }
            
            return uri;
        }
        
        // Dump the test results in the page
        function dumpResults(obj){
            var output = "";
            for (var property in obj){
                output += '<tr><td class="name">' + property + '</td><td class="result">"<span class="value">' + obj[property] + '</span>"</td></tr>';
            }
            document.getElementById('output').innerHTML = "<table>" + output + "</table>";
        }
    //]]>
    </script>
    
    <style type="text/css" media="screen">
        h1 {font-size:1.25em;}
        table {border:solid #333; border-width:1px; background:#f5f5f5; margin:15px 0 0; border-collapse:collapse;}
        td {border:solid #333; border-width:1px 1px 0 0; padding:4px;}
        .name {font-weight:bold;}
        .result {color:#aaa;}
        .value {color:#33c;}
    </style>
</head>
<body>
    <h1>Steve's URI Parser</h1>
    
    <form action="#" onsubmit="dumpResults(parseUri(document.getElementById('uriInput').value)); return false;">
        <div>
            <input id="uriInput" type="text" style="width:500px" value="http://www.domain.com:81/dir1/dir.2/index.html?id=1&amp;test=2#top" />
            <input type="submit" value="Parse" />
        </div>
    </form>
    
    <div id="output">
    </div>
    
    <p><a href="http://badassery.blogspot.com">My blog</a></p>
</body>
</html>

Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take much longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if several people leave comments asking for the support, I'll go ahead and add it. I could also add more pre-concatenated parts (e.g., "relative" for everything starting with the path) or other stuff like "tld" (for just the top-level domain) if readers think it would be useful.

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.


You might also be looking for my script which fixes the JavaScript split method cross-browser.

Sunday, February 04, 2007

Regexes in Depth: Advanced Quoted String Matching

Update: Please view the updated version of this post on my new blog:

Advanced Quoted String Matching.

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:

First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

  • "test"
  • 'test'
  • "t'es't"
  • 'te"st'
  • 'te\'st'
  • "\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure. (Due to the way the regex is written, it doesn't need special handling for inner quotes that are not of the same type as the enclosure.)

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion without balancing groups.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote character with the quote character allows the lazy repetition operator to treat them as one token, and continue on it's way through the string.

Side note: If you wanted to support multi-line quotes in libraries without an option to make dots match newlines, change the dot to [\S\s]

Also note that with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

  • (["'])(?:\\\1|.)*?\1 (the regex being discussed)
  • (["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. Also, if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. Secondly, the reason we can't use [^\1] either is because you can't use backreferences within character classes (negated or otherwise), even though in this case the match contained within the backreference is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. They assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd need to use a leading zero (e.g., \01), but inside a character class the leading zero is optional.

If anyone has questions about how other, specific regex patterns work, or why they don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.

Edit: Just for kicks, here's a Unicode-based regular expression which adds support for any kind of opening/closing quote pair in any language (including the special characters , , , , etc.). Of the regex flavors I'm familiar with, Java, the .NET framework, and Perl use Unicode-based regex engines. Of those three, only the .NET framework also supports conditionals, which I'll also need to pull this off.

(?:(["'])|\p{Pi}).*?(?<!\\)(?(1)\1|\p{Pf})

I'm not going to go into explaining that, but the more advanced regex features used are a negative lookbehind, conditional, and Unicode character properties.

Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or ', neither of which are designated as opening or closing quote characters in Unicode).

  • “test”
  • “te“st”
  • “te\”st”
  • ‘test’
  • ‘t‘e"s\’t’

Edit 2: Shortly after posting the above Unicode-based regex, I realized it was flawed. Although it will correctly match all strings in the two lists of examples above, the fact that I'm using the Unicode character properties for any opening / closing quote means that it will also match, e.g., ‘test”, which is not what I was going for. The only way to get around this is to not use the Unicode character properties, and instead specifically include support for “” and ‘’ pairs (however, unfortunately we will lose the ability to work with special quote characters from any language). Here's an updated regex:

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

Now, it will no longer match ‘test”, and will successfully match things like ‘t‘e“"”s\’t’. Note that I'm using nested conditionals in the above regex to achieve an if-elseif-else construct. Also, now that it's no longer Unicode-based, it will work with regex engines which support both lookbehinds and conditionals (PCRE, PHP, the .NET framework, and possibly others).

Capturing vs. Non-capturing Regex Groups

I posted the following on Ben Nadel's excellent blog, in response to a question about why I use non-capturing groups in my regular expressions (e.g., (?:non-captured))...

I near-religiously use non-capturing groups whenever I do not need to reference a group's contents. There are only three reasons to use capturing groups:

  • You're using parts of a match to construct a replacement string, or otherwise referencing parts of the match in code outside the regex.
  • You need to reuse parts of the match within the regex itself. E.g., (["'])(?:\\\1|.)*?\1 would match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match, and allowing inner, escaped quotes of the same type as the enclosure.
  • You need to test if an optional group was part of the match so far, as the condition to evaluate within a conditional. E.g., (a)?b(?(1)c|d) only matches the values "bd" and "abc".

There are two primary reasons to use non-capturing groups if a grouping doesn't meet one of the above conditions:

  • Capturing groups negatively impact performace, since creating backreferences requires that their contents be stored in memory. The performance hit may be tiny, especially when working with small strings, but it's there.
  • When you need to use several groupings in a single regex, only some of which you plan to reference later, it's very convenient to have the backreferences you want to use numbered sequentially. E.g., the logic in my parseUri() UDF could not be nearly as simple if I had not made appropriate use of capturing and non-capturing groups within the same regex.

On a related note, the values of backreferences created using capturing groups with repetition operators on the end of them may not be obvious until you're familar with how it works. If you ran the regex (.)* over the string "test", although backreference 0 (i.e., the whole match) would be "test", backreference 1 would be "t", and there would be no 2nd, 3rd, or 4th backreferences created for the strings "e," "s," and "t." If you wanted the entire match of a repeated grouping to be captured into a backreference, you could use, e.g., ((?:.)*). Also note that the way both of those patterns would be evaluated is fundamentally different from how regex engines would treat (.*).

Saturday, February 03, 2007

More URI-related UDFs

To follow up my parseUri() function, here are several more UDFs I've written recently to help with URI management:

  • getPageUri()
    Returns a struct containing the relative and absolute URIs of the current page. The difference between getPageUri().relative and CGI.SCRIPT_NAME is that the former will include the query string, if present.
  • matchUri(testUri, [masterUri])
    Returns a Boolean indicating whether or not two URIs are the same, disregarding the following differences:
    • Fragments (page anchors), e.g., "#top".
    • Inclusion of "index.cfm" in paths, e.g., "/dir/" vs. "/dir/index.cfm" (supports trailing query strings).
    If masterUri is not provided, the current page is used for comparison (supports both relative and absolute URIs).
  • replaceUriQueryKey(uri, key, substring)
    Replaces a URI query key and its value with a supplied key=value pair. Works with relative and absolute URIs, as well as standalone query strings (with or without a leading "?"). This is also used to support the following two UDFs:
  • addUriQueryKey(uri, key, value)
    Removes any existing instances of the supplied key, then appends it together with the provided value to the provided URI.
  • removeUriQueryKey(uri, key)
    Removes one or more query keys (comma delimited) and their values from the provided URI.

Now that I have these at my disposal, I frequently find myself using them in combination with each other, e.g.,
<a href="<cfoutput>#addUriQueryKey(
    getPageUri().relative,
    "key",
    "value"
)#</cfoutput>">Link</a>
.

Let me know if you find any of these useful…

<!--- Returns the relative and absolute URIs of the current page --->
<cffunction name="getPageUri" returntype="struct" output="FALSE">
    <cfset var pageProtocol = "http" />
    <cfset var pageQuery = "" />
    <cfset var uri = structNew() />
    
    <!--- Get the protocol of the current page --->
    <cfif CGI.HTTPS IS "ON">
        <cfset pageProtocol = "https" />
    </cfif>
    
    <!--- Get the query of the current page, including the leading question if the query is not empty --->
    <cfset pageQuery = reReplace("?" & CGI.QUERY_STRING, "\?$", "") />
    
    <!--- Construct the relative URI of the current page (excludes the protocol and domain) --->
    <cfset uri.relative = CGI.SCRIPT_NAME & pageQuery />
    <!--- Construct the absolute URI of the current page --->
    <cfset uri.absolute = pageProtocol & "://" & CGI.SERVER_NAME & uri.relative />
    
    <cfreturn uri />
</cffunction>

<!--- Returns a Boolean indicating whether or not two URIs are the same, disregarding the following differences:
• Fragments (page anchors), e.g., "#top".
• Inclusion of "index.cfm" in paths, e.g., "/dir/" vs. "/dir/index.cfm" (supports trailing query strings).
If masterUri is not provided, the current page is used for comparison (supports both relative and absolute URIs) --->
<cffunction name="matchUri" returntype="boolean" output="FALSE">
    <cfargument name="testUri" type="string" required="TRUE" />
    <cfargument name="masterUri" type="string" required="FALSE" default="" />
    
    <!--- If a masterUri was not provided --->
    <cfif len(masterUri) EQ 0>
        <!--- If testUri is an absolute URI --->
        <cfif reFindNoCase("^https?://", testUri) EQ 1>
            <cfset masterUri = getPageUri().absolute />
        <cfelse>
            <cfset masterUri = getPageUri().relative />
        </cfif>
    </cfif>
    
    <cfreturn reReplaceNoCase(reReplace(testUri, "##.*", ""), "/index\.cfm(?=\?|$)", "/", "ONE") IS reReplaceNoCase(reReplace(masterUri, "##.*", ""), "/index\.cfm(?=\?|$)", "/", "ONE") />
</cffunction>

<!--- Replace a URI query key and its value with a supplied key=value pair.
Works with relative and absolute URIs, as well as standalone query strings (with or without a leading "?") --->
<cffunction name="replaceUriQueryKey" returntype="string" output="FALSE">
    <cfargument name="uri" type="string" required="TRUE" />
    <cfargument name="key" type="string" required="TRUE" />
    <cfargument name="substring" type="string" required="TRUE" />
    <cfset var preQueryComponents = "" />
    <cfset var currentKey = "" />
    
    <!--- Remove any existing fragment (page anchor) from uri, since it will mess with our processing, and is unlikely to be relevant and/or correct in the new URI --->
    <cfset uri = reReplace(uri, "##.*", "", "ONE") />
    <!--- Store any pre-query URI components. For this to work, the string must start with "protocol:", "//authority", or "/" (path). Otherwise, we will assume the uri is comprised entirely of a query component --->
    <cfset preQueryComponents = reReplace(uri, "^((?:(?:[^:/?.]+:)?//[^/?]+)?(?:/[^?]*)?)?.*", "\1", "ONE") />
    <!--- Remove any pre-query components and the leading question mark from uri --->
    <cfset uri = reReplace(uri, "^(?:(?:[^:/?.]+:)?//[^/?]+)?(?:/[^?]*)?\??(.*)", "\1", "ONE") />
    <!--- Remove any superfluous ampersands in the query (this cleans up the query but is not required, and in any case this function doesn't generate superfluous ampersands) --->
    <cfset uri = reReplace(uri, "&(?=&)|&$", "", "ALL") />
    
    <!--- For each key specified, remove the corresponding key=value pair from uri. Note that key names which contain regex special characters (.,*,+,?,^,$,{,},(,),|,[,],\) which are not percent-encoded may behave unpredictably --->
    <cfloop index="currentKey" list="#key#" delimiters=",">
        <cfif len(currentKey) GT 0>
            <cfset uri = reReplaceNoCase(uri, ("(?:^|&)" & currentKey & "(?:=[^&]*)?"), "", "ALL") />
        </cfif>
    </cfloop>
    
    <!--- If we still have a value in uri after the above processing (beyond what we're about to add) --->
    <cfif len(uri) GT 0>
        <!--- Ensure the query is returned with only the necessary separator characters (? and &) --->
        <cfreturn (preQueryComponents & "?" & reReplace(uri, "^&", "") & reReplace("&" & substring, "&$", "")) />
    <cfelse>
        <!--- Append substring, including a leading question mark if substring is not empty --->
        <cfreturn (preQueryComponents & reReplace("?" & substring, "\?$", "")) />
    </cfif>
</cffunction>

<cffunction name="addUriQueryKey" returntype="string" output="FALSE">
    <cfargument name="uri" type="string" required="TRUE" />
    <cfargument name="key" type="string" required="TRUE" />
    <cfargument name="value" type="string" required="TRUE" />
    
    <!--- Until proper support is included for adding multiple keys with one call, use only the first key --->
    <cfset key = listFirst(key, ",") />
    
    <!--- Remove any existing instances of the key from uri, then add the new key=value pair.
    Do not include the trailing equals sign (=) if we're assigning an empty value to the added key --->
    <cfreturn replaceUriQueryKey(removeUriQueryKey(uri, key), "", (key & reReplace("=" & value, "=$", ""))) />
</cffunction>

<cffunction name="removeUriQueryKey" returntype="string" output="FALSE">
    <cfargument name="uri" type="string" required="TRUE" />
    <!--- Use a comma-delimited list to remove multiple keys with one call --->
    <cfargument name="key" type="string" required="TRUE" />
    
    <cfreturn replaceUriQueryKey(uri, key, "") />
</cffunction>

In other news, this cracked me up.

Thursday, February 01, 2007

parseUri(): Split URLs in ColdFusion

Update: Please view the updated version of this post on my new blog:

parseUri: Split URLs in ColdFusion.

Here's a UDF I wrote recently which allows me to show off my regex skillz. parseUri() splits any well-formed URI into its components (all are optional).

The core code is already very brief, but I could replace the entire contents of the <cfloop> with one line of code if I didn't have to account for bugs in the reFind() function (tested in CF7). Note that all components are split with a single regex (using backreferences). My favorite part of this UDF is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers.

Since the function returns a struct, you can do, e.g., parseUri(someUri).anchor, etc. Check it out:

<!--- By Steven Levithan. Splits any well-formed URI into its components --->
<cffunction name="parseUri" returntype="struct" output="FALSE">
    <cfargument name="sourceUri" type="string" required="TRUE" />
    <!--- Get arrays named len and pos, containing the lengths and positions of each URI component (all are optional) --->
    <cfset var uriPattern = reFind("^(?:([^:/?##.]+):)?(?://)?(([^:/?##]*)(?::(\d*))?)?((/(?:[^?##](?![^?##/]*\.[^?##/.]+(?:[\?##]|$)))*/?)?([^?##/]*))?(?:\?([^##]*))?(?:##(.*))?", sourceUri, 1, TRUE) />
    <!--- Create an array containing the names of each key we will add to the uri struct --->
    <cfset var uriComponentNames = listToArray("source,protocol,authority,domain,port,path,directoryPath,fileName,query,anchor") />
    <cfset var uri = structNew() />
    <cfset var i = 1 />
    
    <!--- Add the following keys to the uri struct:
    • source (when using returnSubExpressions, reFind() returns backreference 0 [i.e., the entire match] as array element 1, so we might as well use it)
    • protocol (scheme)
    • authority (includes both the domain and port)
        • domain (part of the authority component; can be an IP address)
        • port (part of the authority component)
    • path (includes both the directory path and filename)
        • directoryPath (part of the path component; supports directories with periods, and without a trailing backslash)
        • fileName (part of the path component)
    • query (does not include the leading question mark)
    • anchor (fragment) --->
    <cfloop index="i" from="1" to="10"><!--- Could also use to="#arrayLen(uriComponentNames)#" --->
        <!--- If the component was found in the source URI...
        • The arrayLen() check is needed to prevent a CF error when sourceUri is empty, because due to an apparent bug, reFind() does not populate backreferences for zero-length capturing groups when run against an empty string (though it does still populate backreference 0)
        • The pos[i] value check is needed to prevent a CF error when mid() is passed a start value of 0, because of the way reFind() considers an optional capturing group that does not match anything to have a pos of 0 --->
        <cfif (arrayLen(uriPattern.pos) GT 1) AND (uriPattern.pos[i] GT 0)>
            <!--- Add the component to its corresponding key in the uri struct --->
            <cfset uri[uriComponentNames[i]] = mid(sourceUri, uriPattern.pos[i], uriPattern.len[i]) />
        <!--- Otherwise, set the key value to an empty string --->
        <cfelse>
            <cfset uri[uriComponentNames[i]] = "" />
        </cfif>
    </cfloop>
    
    <!--- Always end directoryPath with a trailing backslash if the path component was present in the source URI (Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key) --->
    <cfif len(uri.directoryPath) GT 0>
        <cfset uri.directoryPath = reReplace(uri.directoryPath, "/?$", "/") />
    </cfif>
    
    <cfreturn uri />
</cffunction>

Edit: I've written a JavaScript implementation of the above UDF. See parseUri(): Split URLs in JavaScript.

reMatch(): Improving ColdFusion's regex support

Update: Please see this post on my new blog, which includes a demo of the REMatch function:

REMatch (ColdFusion).

Following are some UDFs I wrote recently to make using regexes in ColdFusion a bit easier. The biggest deal here is my reMatch() function.

reMatch(), in its most basic usage, is similar to JavaScript's String.match() method. Compare getting the first number in a string using reMatch() vs. built-in ColdFusion functions:

  • reMatch():
    <cfset num = reMatch("\d+", string) />
  • reReplace():
    <cfset num = reReplace(string, "\D*(\d+).*", "\1") />
  • reFind():
    <cfset matchInfo = reFind("\d+", string, 1, TRUE) />
    <cfset num = mid(string, matchInfo.pos[1], matchInfo.len[1]) />

All of the above would return the same result, unless a number wasn't found in the string, in which case the reFind()-based method would throw an error since the mid() function would be passed a start value of 0. I think it's pretty clear from the above which approach is easiest to use for a situation like this.

Still, that's just the beginning of what reMatch() can do. Change the scope argument from the default of "ONE" to "ALL" (to follow the convention used by reReplace(), etc.), and the function will return an array of all matches. Finally, set the returnLenPos argument to TRUE and the function will return either a struct or array of structs (based on the value of scope) containing the len, pos, AND value of each match. This is very different from how the returnSubExpressions argument of reFind() works. When using returnSubExpressions, you get back a struct containing arrays of the len and pos (but not value) of each backreference from the first match.

Here's the code, with four additional UDFs (reMatchNoCase(), match(), matchNoCase(), and escapeReChars()) added for good measure:

<!--- UDFs by Steven Levithan --->

<cffunction name="reMatch" output="FALSE">
    <cfargument name="regEx" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfargument name="caseSensitive" type="boolean" required="FALSE" default="TRUE" />
    <cfset var thisMatch = "" />
    <cfset var matchInfo = structNew() />
    <cfset var matches = arrayNew(1) />
    <!--- Set the time before entering the loop --->
    <cfset var timeout = now() />
    
    <!--- Build the matches array. Continue looping until additional instances of regEx are not found. If scope is "ONE", the loop will end after the first iteration --->
    <cfloop condition="TRUE">
        <!--- By using returnSubExpressions (the fourth reFind argument), the position and length of the first match is captured in arrays named len and pos --->
        <cfif caseSensitive>
            <cfset thisMatch = reFind(regEx, string, start, TRUE) />
        <cfelse>
            <cfset thisMatch = reFindNoCase(regEx, string, start, TRUE) />
        </cfif>
        
        <!--- If a match was not found, end the loop --->
        <cfif thisMatch.pos[1] EQ 0>
            <cfbreak />
        <!--- If a match was found, and extended info was requested, append a struct containing the value, length, and position of the match to the matches array --->
        <cfelseif returnLenPos>
            <cfset matchInfo.value = mid(string, thisMatch.pos[1], thisMatch.len[1]) />
            <cfset matchInfo.len = thisMatch.len[1] />
            <cfset matchInfo.pos = thisMatch.pos[1] />
            <cfset arrayAppend(matches, matchInfo) />
        <!--- Otherwise, just append the match value to the matches array --->
        <cfelse>
            <cfset arrayAppend(matches, mid(string, thisMatch.pos[1], thisMatch.len[1])) />
        </cfif>
        
        <!--- If only the first match was requested, end the loop --->
        <cfif scope IS "ONE">
            <cfbreak />
        <!--- If the match length was greater than zero --->
        <cfelseif thisMatch.pos[1] + thisMatch.len[1] GT start>
            <!--- Set the start position for the next iteration of the loop to the end position of the match --->
            <cfset start = thisMatch.pos[1] + thisMatch.len[1] />
        <!--- If the match was zero length --->
        <cfelse>
            <!--- Advance the start position for the next iteration of the loop by one, to avoid infinite iteration --->
            <cfset start = start + 1 />
        </cfif>
        
        <!--- If the loop has run for 20 seconds, throw an error, to mitigate against overlong processing. However, note that even one pass using a poorly-written regex which triggers catastrophic backtracking could take longer than 20 seconds --->
        <cfif dateDiff("s", timeout, now()) GTE 20>
            <cfthrow message="Processing too long. Optimize regular expression for better performance" />
        </cfif>
    </cfloop>
    
    <cfif scope IS "ONE">
        <cfparam name="matches[1]" default="" />
        <cfreturn matches[1] />
    <cfelse>
        <cfreturn matches />
    </cfif>
</cffunction>

<cffunction name="reMatchNoCase" output="FALSE">
    <cfargument name="regEx" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(regEx, string, start, scope, returnLenPos, FALSE) />
</cffunction>

<cffunction name="match" output="FALSE">
    <cfargument name="substring" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, TRUE) />
</cffunction>

<cffunction name="matchNoCase" output="FALSE">
    <cfargument name="substring" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, FALSE) />
</cffunction>

<!--- Escape special regular expression characters (.,*,+,?,^,$,{,},(,),|,[,],\) within a string by preceding them with a forward slash (\). This allows safely using literal strings within regular expressions --->
<cffunction name="escapeReChars" returntype="string" output="FALSE">
    <cfargument name="string" type="string" required="TRUE" />
    <cfreturn reReplace(string, "[.*+?^${}()|[\]\\]", "\\\0", "ALL") />
</cffunction>

Now that I've got a deeply featured match function, all I need Adobe to add to ColdFusion in the way to regex support is lookbehinds, atomic groups, possessive quantifiers, conditionals, balancing groups, etc., etc. :-)