Thursday, February 01, 2007

reMatch(): Improving ColdFusion's regex support

Update: Please see this post on my new blog, which includes a demo of the REMatch function:

REMatch (ColdFusion).

Following are some UDFs I wrote recently to make using regexes in ColdFusion a bit easier. The biggest deal here is my reMatch() function.

reMatch(), in its most basic usage, is similar to JavaScript's String.match() method. Compare getting the first number in a string using reMatch() vs. built-in ColdFusion functions:

  • reMatch():
    <cfset num = reMatch("\d+", string) />
  • reReplace():
    <cfset num = reReplace(string, "\D*(\d+).*", "\1") />
  • reFind():
    <cfset matchInfo = reFind("\d+", string, 1, TRUE) />
    <cfset num = mid(string, matchInfo.pos[1], matchInfo.len[1]) />

All of the above would return the same result, unless a number wasn't found in the string, in which case the reFind()-based method would throw an error since the mid() function would be passed a start value of 0. I think it's pretty clear from the above which approach is easiest to use for a situation like this.

Still, that's just the beginning of what reMatch() can do. Change the scope argument from the default of "ONE" to "ALL" (to follow the convention used by reReplace(), etc.), and the function will return an array of all matches. Finally, set the returnLenPos argument to TRUE and the function will return either a struct or array of structs (based on the value of scope) containing the len, pos, AND value of each match. This is very different from how the returnSubExpressions argument of reFind() works. When using returnSubExpressions, you get back a struct containing arrays of the len and pos (but not value) of each backreference from the first match.

Here's the code, with four additional UDFs (reMatchNoCase(), match(), matchNoCase(), and escapeReChars()) added for good measure:

<!--- UDFs by Steven Levithan --->

<cffunction name="reMatch" output="FALSE">
    <cfargument name="regEx" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfargument name="caseSensitive" type="boolean" required="FALSE" default="TRUE" />
    <cfset var thisMatch = "" />
    <cfset var matchInfo = structNew() />
    <cfset var matches = arrayNew(1) />
    <!--- Set the time before entering the loop --->
    <cfset var timeout = now() />
    
    <!--- Build the matches array. Continue looping until additional instances of regEx are not found. If scope is "ONE", the loop will end after the first iteration --->
    <cfloop condition="TRUE">
        <!--- By using returnSubExpressions (the fourth reFind argument), the position and length of the first match is captured in arrays named len and pos --->
        <cfif caseSensitive>
            <cfset thisMatch = reFind(regEx, string, start, TRUE) />
        <cfelse>
            <cfset thisMatch = reFindNoCase(regEx, string, start, TRUE) />
        </cfif>
        
        <!--- If a match was not found, end the loop --->
        <cfif thisMatch.pos[1] EQ 0>
            <cfbreak />
        <!--- If a match was found, and extended info was requested, append a struct containing the value, length, and position of the match to the matches array --->
        <cfelseif returnLenPos>
            <cfset matchInfo.value = mid(string, thisMatch.pos[1], thisMatch.len[1]) />
            <cfset matchInfo.len = thisMatch.len[1] />
            <cfset matchInfo.pos = thisMatch.pos[1] />
            <cfset arrayAppend(matches, matchInfo) />
        <!--- Otherwise, just append the match value to the matches array --->
        <cfelse>
            <cfset arrayAppend(matches, mid(string, thisMatch.pos[1], thisMatch.len[1])) />
        </cfif>
        
        <!--- If only the first match was requested, end the loop --->
        <cfif scope IS "ONE">
            <cfbreak />
        <!--- If the match length was greater than zero --->
        <cfelseif thisMatch.pos[1] + thisMatch.len[1] GT start>
            <!--- Set the start position for the next iteration of the loop to the end position of the match --->
            <cfset start = thisMatch.pos[1] + thisMatch.len[1] />
        <!--- If the match was zero length --->
        <cfelse>
            <!--- Advance the start position for the next iteration of the loop by one, to avoid infinite iteration --->
            <cfset start = start + 1 />
        </cfif>
        
        <!--- If the loop has run for 20 seconds, throw an error, to mitigate against overlong processing. However, note that even one pass using a poorly-written regex which triggers catastrophic backtracking could take longer than 20 seconds --->
        <cfif dateDiff("s", timeout, now()) GTE 20>
            <cfthrow message="Processing too long. Optimize regular expression for better performance" />
        </cfif>
    </cfloop>
    
    <cfif scope IS "ONE">
        <cfparam name="matches[1]" default="" />
        <cfreturn matches[1] />
    <cfelse>
        <cfreturn matches />
    </cfif>
</cffunction>

<cffunction name="reMatchNoCase" output="FALSE">
    <cfargument name="regEx" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(regEx, string, start, scope, returnLenPos, FALSE) />
</cffunction>

<cffunction name="match" output="FALSE">
    <cfargument name="substring" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, TRUE) />
</cffunction>

<cffunction name="matchNoCase" output="FALSE">
    <cfargument name="substring" type="string" required="TRUE" />
    <cfargument name="string" type="string" required="TRUE" />
    <cfargument name="start" type="numeric" required="FALSE" default="1" />
    <cfargument name="scope" type="string" required="FALSE" default="ONE" />
    <cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
    <cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, FALSE) />
</cffunction>

<!--- Escape special regular expression characters (.,*,+,?,^,$,{,},(,),|,[,],\) within a string by preceding them with a forward slash (\). This allows safely using literal strings within regular expressions --->
<cffunction name="escapeReChars" returntype="string" output="FALSE">
    <cfargument name="string" type="string" required="TRUE" />
    <cfreturn reReplace(string, "[.*+?^${}()|[\]\\]", "\\\0", "ALL") />
</cffunction>

Now that I've got a deeply featured match function, all I need Adobe to add to ColdFusion in the way to regex support is lookbehinds, atomic groups, possessive quantifiers, conditionals, balancing groups, etc., etc. :-)

5 comments:

Anonymous said...

Hey Steve,

After playing with your REMatch method (which has helped me more than once) I made a little change. I added 'SUB' to the scope argument, which will loop over each match and return sub-matches. You can read all about it here. I don't have code snippets in the blog yet, but there is a download available. If you have any suggestions please let me know.

Anonymous said...

Didn't mean to be Anon on that last one!

Steve said...

Andrew,

Glad to hear this helped somebody. That is a potentially quite useful modification you made. ReMatch already had the potential to return tons of information about matches (i.e., the len, pos, and value of every match within a target string), but your modification ultimately results in an function capable of returning more info about matches via one function call than any regex-related function I personally know of in any programming language.

One note: When I wrote this, I wasn't aware that you could use underlying Java regex methods in ColdFusion. If I ever get around to releasing an updated version of ReMatch, it will use the Java methods, which offer faster speed and more powerful regular expression features (e.g., lookbehind). That would be my main suggestion for your CFC... use Java, if possible.

Thanks for posting!

Anonymous said...

Hey Steve, me again. I gave the java.util.regex package a shot, and was able to get a basic version working. Check it out here

Anonymous said...

Hi Steve.

I'm trying to use CF and a RegEx to replace a part of a word/s in a long string - but only if it's not a url or an email address. Apparently this is very difficult as 'lookbehinds' are not supported.

there a use or combination of uses of your reMatch tag that could help?

The rest of my was rejected by the comment sys, so I have put it here:
www.tandabui.com/steve.txt

Matt