Wednesday, August 08, 2007

RegexPal, and other goodies on my new blog

Just a quick note to anyone who's still subscribing to this feed: Make sure to head on over to my new regex / JavaScript blog, where there have been lots of interesting posts and comments since I last posted here (the old stuff has been migrated as well). In particular, check out RegexPal, a new regular expression tester which raises the bar for ease of use through features like real-time regex syntax and match highlighting.

See you there. :)

Wednesday, May 30, 2007

This Blog Has Moved (and XRegExp)

Blogger was a nice and easy introduction to the world of blogging, but alas, the time has come to move on. I now have my own domain name, and a shared hosting provider which supports both PHP and ColdFusion. Woohoo! Here's my new JavaScript / regex blog.

For feed subscribers, please update the this blog's feed URL to point here, as I'll no longer be posting on blogspot. In addition to a fancy new extended JavaScript regular expression constructor (XRegExp) which I've already posted over there, most of the stuff from here has been migrated, and in the process I've updated several posts and added demos for a few more, including Leet Translator, REMatch, and both the ColdFusion and JavaScript implementations of parseUri. Check 'em out.

Thursday, May 10, 2007

Fun with JavaScript variable types and constructors

Try running the following code in Firebug (the results are shown in trailing comments):

console.log(typeof null); // object
console.log(null instanceof Object); // false
console.log(typeof [1,2,3]); // object
console.log(typeof /regex/); // function in Firefox; object in IE
console.log(typeof new String()); // object
console.log(typeof Object); // function
console.log(Object instanceof Object); // true
console.log(Object instanceof Function); // true
console.log(Function.constructor); // Function()
console.log(Function.constructor.constructor); // Function()
console.log(window.constructor); // function() [note the lowercase "f"]
console.log(window.constructor.constructor); // Object()
console.log(window.constructor.constructor.constructor); // Function()
console.log(Function()); // anonymous()

console.log(typeof NaN); // number
console.log(NaN.constructor); // Number()
console.log(NaN instanceof Number); // false
console.log(NaN == NaN); // false
console.log(null + 1); // 1
console.log(null + null); // 0
console.log(undefined + 1); // NaN
console.log(null + "string"); // nullstring
console.log(undefined + "string"); // undefinedstring
console.log({} == 0); // false
console.log([] == 0); // true
console.log(1.0.toFixed(2)); // 1.00

console.log(new Boolean(false) == false); // true
console.log(new Boolean(false) === false); // false

Surprised by any of those results? If not, you're probably either quite knowledgeable about JavaScript variable types, type conversion, and constructors, or you don't fully understand some of the peculiarities and seeming contradictions. (If you have any questions, feel free to ask in the comments.)

Monday, April 16, 2007

Mastering Regular Expressions, 3rd Edition

Book cover

So, after reading positive reviews about it since I started using regular expressions a little over a year ago, I finally purchased O'Reilly Media's Mastering Regular Expressions by Jeffrey E. F. Friedl, after discovering that a third edition came out in August 2006. The book arrived today, and it is indeed pretty damn excellent (I'm as excited about it as I can be about a tech book, anyway). I've only spent a few minutes with it so far, but I can see that it is excellently presented, and there is much to discover. Hopefully I'll post about some cool things I learn over the next few weeks, if I get a chance to actually sit down and read through a significant portion of the book.

Thursday, April 05, 2007

Faking Conditionals in Regular Expressions

Update: This post contains several major errors. Please view the updated version on my new blog:

Mimic Regular Expression Conditionals.

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted regex features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the libraries I'm familiar with, conditionals are only supported by .NET, PCRE, PHP (when using PCRE via the preg functions), and JGSoft products (including RegexBuddy).

There are two kinds of regex conditionals in those libraries... lookaround-based and capturing-group-based. The functionality of lookaround-based conditionals is very easy to replicate. First, here's what such conditionals look like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

To mimic that behavior in languages which don't support conditionals, just add a colon after the initial question mark to turn it into a non-capturing group, like so:

(?:(?=if_assertion)then|else)

As long as the regex engine you're using supports the specified lookaround type, those patterns do the same thing.

However, mimicking capturing-group-based conditionals proved to be more tricky. Conditionals which use an optional capturing group as their test allow you to base logic on whether a capturing group has participated in the match so far. Thus...

(a)?b(?(1)c|d)

...matches only "bd" and "abc". That pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

To use it without an "else" part, you still need to include "\3" at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's an empty alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the alternation are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

  • This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
  • It interacts with backtracking differently than a real conditonal (the "a" part is treated as if it were within an optional, atomic group... e.g., (?>(a)?) instead of (a)?), so it's best to think of this as a new operator which is similar to a conditional.

Here are the regex engines I've briefly tested this pattern with:

Language Supports "fake conditionals" Supports real conditionals Notes
.NET Yes Yes Tested using Expresso.
ColdFusion Yes No Tested using ColdFusion 7.
Java Yes No Tested using Regular Expression Test Applet.
JavaScript No No JavaScript assigns an empty string, rather than null, to backreferences for non-participating capturing groups. Unfortunately, this pattern depends on the way most other libraries handle non-participating capturing groups.
JGSoft Yes
(buggy)
Yes As of RegexBuddy version 2.3.2, it performs correctly in more cases if you change the two empty capturing groups ("()") to match a zero-length value of no impact, such as "(.{0})" or "(\b|\B)". It also has problems matching values at the end of a string when using an empty else. I've reported both issues to JGSoft, and have been told they will be fixed in the next version of RegexBuddy.
PHP Yes
(buggy)
Yes Tested using PHP Regex Tester. Performs correctly in more cases if you explicitly state the condition twice, like so: "(?=(?:a)()|())(?:a)?b(?:\1c| \2d)".

If you discover ways to improve this, or find problems not already mentioned, please let me know.

Tuesday, April 03, 2007

Faking Atomic Groups in Regular Expressions

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds don't actually match anything. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference 0)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually match anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That appears to produce the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support positive lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking cool.

Monday, April 02, 2007

Regexes in Depth: Matching Innermost HTML Elements

Update: Please view the updated, syntax-highlighted version of this post on my new blog:

Matching Innermost HTML Elements.

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match "<table>...</table>", but should only match the inner table within "<table>...<table>...</table>...</table>". This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that, unless you're working with .NET or Perl, regexes are incapable of recursion (although it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. ;-)

Here's the solution I offered (though there were a few steps to get there):

<table(?:\s[^>]*)?>(?:(?>[^<]+)|<(?!table(?:\s[^>]*)?>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also very fast, and it can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the highlighted text below:

<table><td><table><td>&nbsp;</td></table></td></table> <table><tr><td>&nbsp;</td></tr></table> <table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed):

<table>(?:[\S\s](?!<table>))*?</table>

Basically, the way that works is it matches an opening "<table>" tag, then it looks at each following character one at a time and checks if they are followed by another instance of "<table>" before "</table>". If so, the match fails, because it's not an innermost table.

In theory, at least. Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening "<table>", but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory... you'll see that there are problems with this as is), because within each <table> tag it will greedily jump between all characters which are not "<" in single steps (using "[^<]+"), and it will only use the negative lookahead when it encounters "<". Secondly, it solves the previously noted error by using "<(?!table>)" instead of ".(?!<table>)".

If you're wondering about table tags which contain attributes, that's not a problem. The construct is such that it can easily be extended to support element attributes. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table(?:\s[^>]*)?>(?:[^<]+|<(?!table(?:\s[^>]*)?>))*?</table>

At first I thought this closed the case... The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The change to the regex is highlighted:

<table(?:\s[^>]*)?>(?:(?>[^<]+)|<(?!table(?:\s[^>]*)?>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in less than 0.01 second on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, a solid approach which is not susceptible to catastrophic backtracking would be:

<table(?:\s[^>]*)?>(?!<table(?:\s[^>]*)?>)(?:[\S\s](?!<table(?:\s[^>]*)?>))*?</table>

That runs just a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through RegexBuddy's debugger for an under-the-hood look at how they're handled by a regex engine.

Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table(?:\s[^>]*)?>(?:(?=([^<]+))\1|<(?!table(?:\s[^>]*)?>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by "\1" to act just like an atomic group!

Wednesday, February 07, 2007

parseUri(): Split URLs in JavaScript

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(someUri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
Splits any well-formed URI into the following parts (all are optional):
----------------------
• source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
    • domain (part of the authority; can be an IP address)
    • port (part of the authority)
• path (includes both the directory path and filename)
    • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
    • fileName (part of the path)
• query (does not include the leading question mark)
• anchor (fragment)
*/
function parseUri(sourceUri){
    var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
    var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
    var uri = {};
    
    for(var i = 0; i < 10; i++){
        uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
    }
    
    // Always end directoryPath with a trailing backslash if a path was present in the source URI
    // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
    if(uri.directoryPath.length > 0){
        uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
    }
    
    return uri;
}

Is there any leaner, meaner URI parser out there? :-)

To make it easier to test this function, here is some code that can be copied and pasted into a new HTML file, allowing you to easily enter URIs and see the results.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Steve's URI Parser</title>
    
    <script type="text/javascript">
    //<![CDATA[
        /* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
        Splits any well-formed URI into the following parts (all are optional):
        ----------------------
        • source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
        • protocol (scheme)
        • authority (includes both the domain and port)
            • domain (part of the authority; can be an IP address)
            • port (part of the authority)
        • path (includes both the directory path and filename)
            • directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
            • fileName (part of the path)
        • query (does not include the leading question mark)
        • anchor (fragment)
        */
        function parseUri(sourceUri){
            var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
            var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
            var uri = {};
            
            for(var i = 0; i < 10; i++){
                uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
            }
            
            // Always end directoryPath with a trailing backslash if a path was present in the source URI
            // Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
            if(uri.directoryPath.length > 0){
                uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
            }
            
            return uri;
        }
        
        // Dump the test results in the page
        function dumpResults(obj){
            var output = "";
            for (var property in obj){
                output += '<tr><td class="name">' + property + '</td><td class="result">"<span class="value">' + obj[property] + '</span>"</td></tr>';
            }
            document.getElementById('output').innerHTML = "<table>" + output + "</table>";
        }
    //]]>
    </script>
    
    <style type="text/css" media="screen">
        h1 {font-size:1.25em;}
        table {border:solid #333; border-width:1px; background:#f5f5f5; margin:15px 0 0; border-collapse:collapse;}
        td {border:solid #333; border-width:1px 1px 0 0; padding:4px;}
        .name {font-weight:bold;}
        .result {color:#aaa;}
        .value {color:#33c;}
    </style>
</head>
<body>
    <h1>Steve's URI Parser</h1>
    
    <form action="#" onsubmit="dumpResults(parseUri(document.getElementById('uriInput').value)); return false;">
        <div>
            <input id="uriInput" type="text" style="width:500px" value="http://www.domain.com:81/dir1/dir.2/index.html?id=1&amp;test=2#top" />
            <input type="submit" value="Parse" />
        </div>
    </form>
    
    <div id="output">
    </div>
    
    <p><a href="http://badassery.blogspot.com">My blog</a></p>
</body>
</html>

Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take much longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if several people leave comments asking for the support, I'll go ahead and add it. I could also add more pre-concatenated parts (e.g., "relative" for everything starting with the path) or other stuff like "tld" (for just the top-level domain) if readers think it would be useful.

Update: Please see the latest version of this function on my new blog:

parseUri: Split URLs in JavaScript.


You might also be looking for my script which fixes the JavaScript split method cross-browser.