Wednesday, February 07, 2007

parseUri(): Split URLs in JavaScript

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(someUri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
Splits any well-formed URI into the following parts (all are optional):
----------------------
• source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
    • domain (part of the authority; can be an IP address)
    • port (part of the authority)
• path (includes both the directory path and filename)
    • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
    • fileName (part of the path)
• query (does not include the leading question mark)
• anchor (fragment)
*/
function parseUri(sourceUri){
    var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
    var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
    var uri = {};
    
    for(var i = 0; i < 10; i++){
        uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
    }
    
    // Always end directoryPath with a trailing backslash if a path was present in the source URI
    // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
    if(uri.directoryPath.length > 0){
        uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
    }
    
    return uri;
}

Is there any leaner, meaner URI parser out there? :-)

To make it easier to test this function, here is some code that can be copied and pasted into a new HTML file, allowing you to easily enter URIs and see the results.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Steve's URI Parser</title>
    
    <script type="text/javascript">
    //<![CDATA[
        /* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
        Splits any well-formed URI into the following parts (all are optional):
        ----------------------
        • source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
        • protocol (scheme)
        • authority (includes both the domain and port)
            • domain (part of the authority; can be an IP address)
            • port (part of the authority)
        • path (includes both the directory path and filename)
            • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
            • fileName (part of the path)
        • query (does not include the leading question mark)
        • anchor (fragment)
        */
        function parseUri(sourceUri){
            var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
            var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
            var uri = {};
            
            for(var i = 0; i < 10; i++){
                uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
            }
            
            // Always end directoryPath with a trailing backslash if a path was present in the source URI
            // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
            if(uri.directoryPath.length > 0){
                uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
            }
            
            return uri;
        }
        
        // Dump the test results in the page
        function dumpResults(obj){
            var output = "";
            for (var property in obj){
                output += '<tr><td class="name">' + property + '</td><td class="result">"<span class="value">' + obj[property] + '</span>"</td></tr>';
            }
            document.getElementById('output').innerHTML = "<table>" + output + "</table>";
        }
    //]]>
    </script>
    
    <style type="text/css" media="screen">
        h1 {font-size:1.25em;}
        table {border:solid #333; border-width:1px; background:#f5f5f5; margin:15px 0 0; border-collapse:collapse;}
        td {border:solid #333; border-width:1px 1px 0 0; padding:4px;}
        .name {font-weight:bold;}
        .result {color:#aaa;}
        .value {color:#33c;}
    </style>
</head>
<body>
    <h1>Steve's URI Parser</h1>
    
    <form action="#" onsubmit="dumpResults(parseUri(document.getElementById('uriInput').value)); return false;">
        <div>
            <input id="uriInput" type="text" style="width:500px" value="http://www.domain.com:81/dir1/dir.2/index.html?id=1&amp;test=2#top" />
            <input type="submit" value="Parse" />
        </div>
    </form>
    
    <div id="output">
    </div>
    
    <p><a href="http://badassery.blogspot.com">My blog</a></p>
</body>
</html>

Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take much longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if several people leave comments asking for the support, I'll go ahead and add it. I could also add more pre-concatenated parts (e.g., "relative" for everything starting with the path) or other stuff like "tld" (for just the top-level domain) if readers think it would be useful.

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.


You might also be looking for my script which fixes the JavaScript split method cross-browser.

26 comments:

Ben said...

Damn, that a serious regex! Thanks for posting this, it will be very handy. I have been using the following until now. Maybe you can point out if there is something wrong with it:

function parseUrl(data) {
var e = /((http|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+\.[^#?\s]+)(#[\w\-]+)?/;

if (data.match(e)) {
return {
url: RegExp['$&'],
protocol: RegExp.$1,
host:RegExp.$2,
path:RegExp.$3,file:
RegExp.$5,hash:RegExp.$6
};
}
else {
return {url: "", protocol: "", host: "", path: "", file: "", hash: ""};
}
}

Steve said...

Boyan, thanks! As for the code you posted, well, beyond being far less powerful/flexible, the first thing that jumps out at me when looking over the regex is that it wouldn't even match or split the URI "http://www.google.com/". In other words, it's deeply flawed.

Ben said...

Thanks! I'll be using your function from now on.

Anonymous said...

I could definitely use the "name@place" support.

Anonymous said...

A while back I wrote a CF UDF as well, which is basically the same, but passes back a little more information.

I had to make sure segments were supported and also wanted the url parameters to be returned in a more useful state.

http://blog.pengoworks.com/blogger/index.cfm?action=blog:565

Thunder Down Under said...

Might I also suggest looking athttp://www.flog.co.nz/index.php/journal/prototype-uri-parser-class/

This is a Prototype based class which is designed to slot in nicely. It will pass full Uri's, such as 'http://user:password@www.flog.co.nz:80/pathname?querystring&key=value#fragment'

Thunder Down Under said...

Here's a link that works:
Prototype URI parser library

Steve said...

@thunder down under:

Poly9's URL parser is weak. Ajaxian posted the reasons for this I sent them, though in my defense I hadn't meant for them to actually publish the list. Rather, it was part of my pitch towards why they might want to feature another URI parser even though they'd done so recently.

IMO, rewriting Poly9's parser to depend on a massive library like Prototype is extra weak.

Steve said...

For those who didn't find this via Ajaxian, here's the link: Ajaxian: parseUri: Another JavaScript URL parser.

Steve said...

Nice work, Dan G. Switzer, II.

BTW, one of the fundamental differences between our two UDFs (which adds some complexity to mine) is that with, e.g., the URIs "/dir/sub" and "/dir/sub?q", your UDF will treat "sub" as the file name, while mine will treat it as part of the directory path. Since many people enter directory paths without a trailing backslash (and such URIs work with every HTTP server I'm familiar with), I've found this adjustment to be a necessity.

Also, one issue I noticed during a very brief test is that, e.g., with the URI "www.foo.com:80/dir/", your UDF treats the "80" as part of the directory path, returns no authority, and returns "www.foo.com" as the scheme. Although this may be technically correct according to generic URI syntax (I understand why the scheme comes out the way it does, but I'm not so sure about "80" as part of the directory path), it prevents the common scenario of users entering URIs which start with a domain name, without the leading "//" to identify it as the authority. Other examples of differences are that your UDF will treat "www.foo.com" as a file name, and "www.foo.com/dir/" as one component comprised solely of a directory path. On the other hand, in all of the above cases parseUri() will identify "www.foo.com" as the domain, and "/dir/" as the path. I'm not noting this to claim superiority, but rather to point out additional areas where I've found that slightly diverging from the official generic URI syntax spec allows the function to become much more "real-world ready," and able to actually be tested against end user input.

Finally, I know code brevity was probably not your goal, but page weight becomes especially important with a JavaScript implementation. The over 90 lines of code (after stripping all comments and empty lines) in the post you linked to seems on the heavy side.

Nevertheless, it's a solid, fully-featured implementation, and gives me more incentive to add support for the missing pieces from my function (username/password/segment [these shouldn't add any lines of code], and param splitting).

Anonymous said...

hi,

i'm not familiar with regular expressions, so i tried to extract the user infos as an exercise...
so i added "userInfo", "userName", "password" in between "authority" and "domain" in uriPartNames, and added this part to your regexp :
"(" + "(?:(([^:]+)?(?::)?([^:]+)?)?@)?" + "([^:/?#]*)(?::(\\d*))?)?"

well, it seems to work with :
http://userName:password@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top
http://userName:@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top
http://userName@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top

please tell me if i'm wrong and/or if there is a better way to do it !

thank you.

Steve said...

Seb, that seems pretty reasonable, and after a minute testing it with several URIs it seems to hold up well (aside from when you start a URI with a username/password pair, but I'm not sure if I'd do anything to change the behavior).

BTW, here are a couple ways your addition to the regex can be tweaked, after a quick lookover. Like I mentioned, I haven't looked into this in depth.

• Change "(?::)?" to simply ":?" (the grouping is not necessary to make it optional).
• Replace both instances of "[^:]+" with "[^:@]+" (this will improve efficiency and performance when tested against certain types of values, by reducing the amount of backtracking required).

Whenever I find some time to do more extensive re-testing, I'll go ahead and add support for these and other additional URI parts.

Steve said...

BTW, I've updated my local copy of the regex to include support for usernames and passwords, while also appropriately splitting URIs which start with a username/password pair (i.e., they're not preceded by a protocol and/or "//"). I'll include this in v0.2 of this function, along with a few other minor changes/tweaks. Hopefully I'll release this within a few days (after more testing).

Also, Dan G. Switzer, I've decided against supporting filename param segments (e.g., "file.gif;p=5"), since as far as I understand they're deprecated by RFC 3986, and in any case they can easily be tested for after the fact since they're picked up as part of the file name. I've also decided against returning an array of objects containing the names and values of each discrete query parameter, since this is easy to implement in a separate function when needed (queries have only two, easily distinguishable delimiters: "&" and "="), and it would add to the function's length. I also don't want to get carried away with the idea (e.g., returning arrays containing each subdomain, directory, etc.).

Anonymous said...

I'm not sure this should be within the scope of this function, but I find it useful to be able to actually access the query string variables. As such, I added some code to the function to create an object (called queryVars) that serves as a hash of URL variables. That way you can do parseUri(window.location).queryVars.MyURLVar to access the value of a URL variable. Please note I just did this in 5 minutes and I'm sure it's not full-proof, but it's an idea... The code is as follows:

for(var i = 0; i < 10; i++) {
uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");

if ( uriParts[i] && uriPartNames[i] == 'query' ) {
uri['queryVars'] = {};
var qString = uriParts[i];
qString = qString.split('&');
for (var j=0; j<qString.length; j++) {
var qVar = qString[j].split('=');
var qKey = qVar[0];
var qVal = qVar[1];
uri['queryVars'][qKey] = qVal;
}
}
}

Anonymous said...

Thanks Thomas, I used your queryvars addition. Very nice.

(and Steve. this is a killer function. thank you.)

Derek said...

very nice... what sort of license is this covered by, if any?

Steve said...

@derek:

Thanks. License is MIT-style.

@Thomas Messier and Paul Irish:

Since it's clear that query-splitting is helpful for some users, I've gone ahead and added an implementation of this functionality (to the forthcoming version of parseUri) which uses 4 lines of code and additionally supports query keys which aren't followed by "=" as well as query values which contain "=". This, along with support for userInfo and extensive new demos, is all ready to go, but I'm hoping to release this on my own domain, and I'm currently having some trouble with my new host. I'll include an update here as soon as this is resolved (hopefully within a couple days).

Anonymous said...

Yeah, bad ass piece of code and some really masterful regexery! Saved me a good hour. Keep up the good work

Kris Kowal said...

If you do write user:pass support, please let me know. I'm including your URI parser in my module loader library project. http://cixar.com/tracs/javascript

Steve said...

Well, I'm still having problems with setting up my blog the way I want it with my new host (e.g., they're still trying to resolve issues with URL rewriting, etc.), but since I don't know when everything will be resolved, here's a link to the demo page for the latest version of parseUri:

http://stevenlevithan.com/demo/parseUri/js.cfm

Anonymous said...

the js thinks this is a valid url

http:/example.com

Steve said...

@Scott:

No, it makes no such assumption. It simply splits the URI in the most logical way according to its rules. See my note on how this function intentionally does not attempt to validate the URIs it receives.

josh said...

Thanks Steve for a very useful function. There is one thing I'd like to do with this function, and I'm not sure how to do it - I need to split the hostname further, and only retain the TLD portion. So I would match google.com in mail.google.com and google.com and www.google.com. My psuedo regex for this would be [optional some characters including dots][some characters without dots].com. The first portion I wouldn't need access to. The second portion I would. I'm not quite sure how to express this in real regex, in particular it's not clear how to indicate that a piece of a match should be "named". Is it parens? Anyway, any tips you could give on this would be great. FYI I need to write this function so I can set cookies via Javascript that can be set in one subdomain and read in another. According to the rfc for cookies you should be able to set the Domain attribute to the TLD portion, prepending a dot, and that cookie will be sent by the browser to subdomains.

Steve said...

@josh

JavaScript doesn't support named capturing groups. I'm assigning names to each part by mapping names from the uriPartNames array to the array of backreferences returned by the RegExp.exec() method. Parentheses are used to capture the backreferences, but not all of the parentheses are part of capturing groups.

As for your task, there are some cases you might not be thinking about. E.g., how would "www.google.co.uk", "64.233.287.99", or something like "localhost" be handled? (By the way, the Top Level Domain from your example would be "com", not "google.com".)

Anonymous said...

This is remarkable info you have posted mate. really will help me a lot. thanks a lot Thomas.

Anonymous said...

Can you prepend the license in the javascript file