Thursday, February 01, 2007

parseUri(): Split URLs in ColdFusion

Update: Please view the updated version of this post on my new blog:

parseUri: Split URLs in ColdFusion.

Here's a UDF I wrote recently which allows me to show off my regex skillz. parseUri() splits any well-formed URI into its components (all are optional).

The core code is already very brief, but I could replace the entire contents of the <cfloop> with one line of code if I didn't have to account for bugs in the reFind() function (tested in CF7). Note that all components are split with a single regex (using backreferences). My favorite part of this UDF is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers.

Since the function returns a struct, you can do, e.g., parseUri(someUri).anchor, etc. Check it out:

<!--- By Steven Levithan. Splits any well-formed URI into its components --->
<cffunction name="parseUri" returntype="struct" output="FALSE">
    <cfargument name="sourceUri" type="string" required="TRUE" />
    <!--- Get arrays named len and pos, containing the lengths and positions of each URI component (all are optional) --->
    <cfset var uriPattern = reFind("^(?:([^:/?##.]+):)?(?://)?(([^:/?##]*)(?::(\d*))?)?((/(?:[^?##](?![^?##/]*\.[^?##/.]+(?:[\?##]|$)))*/?)?([^?##/]*))?(?:\?([^##]*))?(?:##(.*))?", sourceUri, 1, TRUE) />
    <!--- Create an array containing the names of each key we will add to the uri struct --->
    <cfset var uriComponentNames = listToArray("source,protocol,authority,domain,port,path,directoryPath,fileName,query,anchor") />
    <cfset var uri = structNew() />
    <cfset var i = 1 />
    
    <!--- Add the following keys to the uri struct:
    • source (when using returnSubExpressions, reFind() returns backreference 0 [i.e., the entire match] as array element 1, so we might as well use it)
    • protocol (scheme)
    • authority (includes both the domain and port)
        • domain (part of the authority component; can be an IP address)
        • port (part of the authority component)
    • path (includes both the directory path and filename)
        • directoryPath (part of the path component; supports directories with periods, and without a trailing backslash)
        • fileName (part of the path component)
    • query (does not include the leading question mark)
    • anchor (fragment) --->
    <cfloop index="i" from="1" to="10"><!--- Could also use to="#arrayLen(uriComponentNames)#" --->
        <!--- If the component was found in the source URI...
        • The arrayLen() check is needed to prevent a CF error when sourceUri is empty, because due to an apparent bug, reFind() does not populate backreferences for zero-length capturing groups when run against an empty string (though it does still populate backreference 0)
        • The pos[i] value check is needed to prevent a CF error when mid() is passed a start value of 0, because of the way reFind() considers an optional capturing group that does not match anything to have a pos of 0 --->
        <cfif (arrayLen(uriPattern.pos) GT 1) AND (uriPattern.pos[i] GT 0)>
            <!--- Add the component to its corresponding key in the uri struct --->
            <cfset uri[uriComponentNames[i]] = mid(sourceUri, uriPattern.pos[i], uriPattern.len[i]) />
        <!--- Otherwise, set the key value to an empty string --->
        <cfelse>
            <cfset uri[uriComponentNames[i]] = "" />
        </cfif>
    </cfloop>
    
    <!--- Always end directoryPath with a trailing backslash if the path component was present in the source URI (Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key) --->
    <cfif len(uri.directoryPath) GT 0>
        <cfset uri.directoryPath = reReplace(uri.directoryPath, "/?$", "/") />
    </cfif>
    
    <cfreturn uri />
</cffunction>

Edit: I've written a JavaScript implementation of the above UDF. See parseUri(): Split URLs in JavaScript.

No comments: