Update: Please view the updated version of this post on my new blog:
Here's a UDF I wrote recently which allows me to show off my regex skillz. parseUri()
splits any well-formed URI into its components (all are optional).
The core code is already very brief, but I could replace the entire contents of the <cfloop>
with one line of code if I didn't have to account for bugs in the reFind()
function (tested in CF7). Note that all components are split with a single regex (using backreferences). My favorite part of this UDF is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers.
Since the function returns a struct, you can do, e.g., parseUri(someUri).anchor
, etc. Check it out:
<!--- By Steven Levithan. Splits any well-formed URI into its components ---> <cffunction name="parseUri" returntype="struct" output="FALSE"> <cfargument name="sourceUri" type="string" required="TRUE" /> <!--- Get arrays named len and pos, containing the lengths and positions of each URI component (all are optional) ---> <cfset var uriPattern = reFind("^(?:([^:/?##.]+):)?(?://)?(([^:/?##]*)(?::(\d*))?)?((/(?:[^?##](?![^?##/]*\.[^?##/.]+(?:[\?##]|$)))*/?)?([^?##/]*))?(?:\?([^##]*))?(?:##(.*))?", sourceUri, 1, TRUE) /> <!--- Create an array containing the names of each key we will add to the uri struct ---> <cfset var uriComponentNames = listToArray("source,protocol,authority,domain,port,path,directoryPath,fileName,query,anchor") /> <cfset var uri = structNew() /> <cfset var i = 1 /> <!--- Add the following keys to the uri struct: • source (when using returnSubExpressions, reFind() returns backreference 0 [i.e., the entire match] as array element 1, so we might as well use it) • protocol (scheme) • authority (includes both the domain and port) • domain (part of the authority component; can be an IP address) • port (part of the authority component) • path (includes both the directory path and filename) • directoryPath (part of the path component; supports directories with periods, and without a trailing backslash) • fileName (part of the path component) • query (does not include the leading question mark) • anchor (fragment) ---> <cfloop index="i" from="1" to="10"><!--- Could also use to="#arrayLen(uriComponentNames)#" ---> <!--- If the component was found in the source URI... • The arrayLen() check is needed to prevent a CF error when sourceUri is empty, because due to an apparent bug, reFind() does not populate backreferences for zero-length capturing groups when run against an empty string (though it does still populate backreference 0) • The pos[i] value check is needed to prevent a CF error when mid() is passed a start value of 0, because of the way reFind() considers an optional capturing group that does not match anything to have a pos of 0 ---> <cfif (arrayLen(uriPattern.pos) GT 1) AND (uriPattern.pos[i] GT 0)> <!--- Add the component to its corresponding key in the uri struct ---> <cfset uri[uriComponentNames[i]] = mid(sourceUri, uriPattern.pos[i], uriPattern.len[i]) /> <!--- Otherwise, set the key value to an empty string ---> <cfelse> <cfset uri[uriComponentNames[i]] = "" /> </cfif> </cfloop> <!--- Always end directoryPath with a trailing backslash if the path component was present in the source URI (Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key) ---> <cfif len(uri.directoryPath) GT 0> <cfset uri.directoryPath = reReplace(uri.directoryPath, "/?$", "/") /> </cfif> <cfreturn uri /> </cffunction>
Edit: I've written a JavaScript implementation of the above UDF. See parseUri(): Split URLs in JavaScript.
No comments:
Post a Comment