tag:blogger.com,1999:blog-247443742024-02-20T07:58:54.242-05:00Flagrant BadasseryI ♥ ninjas.Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.comBlogger24125tag:blogger.com,1999:blog-24744374.post-37934698150392560722021-07-30T09:27:00.003-05:002022-06-14T03:32:00.153-05:00New Blog Slev.Life<p>I've recently launched a shiny new blog where I'm posting about everything unrelated to programming.</p><p>Want to learn about <a href="https://slev.life/hyperphantasia">aphantasia and hyperphantasia</a>, the time I unmasked cult leader <a href="https://slev.life/karen-zerby-unmasked">Karen Zerby</a>, why <a href="https://slev.life/south-dakota-residency-for-nomads">South Dakota residency</a> is an ideal choice for nomads, or the <a href="https://slev.life/best-english-teaching-books">best English teaching books</a>?</p><p>All of these and more await at <b><a href="https://slev.life/">Lifecurious</a></b>.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-4274565460811973502007-08-08T16:33:00.002-05:002021-07-30T09:09:58.115-05:00RegexPal, and other goodies on my new blog<p>Just a quick note to anyone who's still subscribing to this feed: Make sure to head on over to my new <a href="https://blog.stevenlevithan.com/">regex / JavaScript blog</a>, where there have been lots of interesting posts and comments since I last posted here (the old stuff has been migrated as well). In particular, check out <a href="https://stevenlevithan.com/regexpal/" title="regex tester">RegexPal</a>, a new <a href="https://stevenlevithan.com/regexpal/">regular expression tester</a> which raises the bar for ease of use through features like real-time regex syntax and match highlighting.</p>
<p>See you there. :)</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com1tag:blogger.com,1999:blog-24744374.post-26673522292456970272007-05-30T17:21:00.003-05:002021-07-07T20:20:01.663-05:00This Blog Has Moved (and XRegExp)<p>Blogger was a nice and easy introduction to the world of blogging, but alas, the time has come to move on. I now have my own domain name and a shared hosting provider which supports both PHP and ColdFusion. Woohoo! Here's my new <a href="http://blog.stevenlevithan.com/">JavaScript / regex blog</a>.</p>
<p>For feed subscribers, please update this blog's feed URL to point <a href="http://blog.stevenlevithan.com/feed/">here</a>. In addition to a fancy new <a href="https://xregexp.com/">extended JavaScript regular expression constructor (XRegExp)</a> which I've already posted over there, most of the stuff from here has been migrated, and in the process, I've updated several posts and added demos for a few more, including <a href="http://blog.stevenlevithan.com/archives/leet-translator">Leet Translator</a>, <a href="http://blog.stevenlevithan.com/archives/rematch-coldfusion">REMatch</a>, and both the <a href="http://blog.stevenlevithan.com/archives/parseuri-split-url-coldfusion">ColdFusion</a> and <a href="http://blog.stevenlevithan.com/archives/parseuri">JavaScript</a> implementations of parseUri. Check 'em out.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-25106092536635014422007-05-10T17:55:00.000-05:002007-05-24T10:02:52.569-05:00Fun with JavaScript variable types and constructors<p>Try running the following code in <a href="http://getfirebug.com">Firebug</a> (the results are shown in trailing comments):</p>
<pre class="code">
console.log(typeof null); <span class="comment">// object</span>
console.log(null instanceof Object); <span class="comment">// false</span>
console.log(typeof [1,2,3]); <span class="comment">// object</span>
console.log(typeof /regex/); <span class="comment">// function in Firefox; object in IE</span>
console.log(typeof new String()); <span class="comment">// object</span>
console.log(typeof Object); <span class="comment">// function</span>
console.log(Object instanceof Object); <span class="comment">// true</span>
console.log(Object instanceof Function); <span class="comment">// true</span>
console.log(Function.constructor); <span class="comment">// Function()</span>
console.log(Function.constructor.constructor); <span class="comment">// Function()</span>
console.log(window.constructor); <span class="comment">// function() <em>[note the lowercase "f"]</em></span>
console.log(window.constructor.constructor); <span class="comment">// Object()</span>
console.log(window.constructor.constructor.constructor); <span class="comment">// Function()</span>
console.log(Function()); <span class="comment">// anonymous()</span>
console.log(typeof NaN); <span class="comment">// number</span>
console.log(NaN.constructor); <span class="comment">// Number()</span>
console.log(NaN instanceof Number); <span class="comment">// false</span>
console.log(NaN == NaN); <span class="comment">// false</span>
console.log(null + 1); <span class="comment">// 1</span>
console.log(null + null); <span class="comment">// 0</span>
console.log(undefined + 1); <span class="comment">// NaN</span>
console.log(null + "string"); <span class="comment">// nullstring</span>
console.log(undefined + "string"); <span class="comment">// undefinedstring</span>
console.log({} == 0); <span class="comment">// false</span>
console.log([] == 0); <span class="comment">// true</span>
console.log(1.0.toFixed(2)); <span class="comment">// 1.00</span>
console.log(new Boolean(false) == false); <span class="comment">// true</span>
console.log(new Boolean(false) === false); <span class="comment">// false</span>
</pre>
<p>Surprised by any of those results? If not, you're probably either quite knowledgeable about JavaScript variable types, type conversion, and constructors, or you don't fully understand some of the peculiarities and seeming contradictions. (If you have any questions, feel free to ask in the comments.)</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-19666029070277867632007-04-16T23:37:00.000-05:002007-04-17T13:53:29.607-05:00Mastering Regular Expressions, 3rd Edition<img style="float:right; margin:0 0 10px 10px;" src="http://www.oreilly.com/catalog/covers/0596528124_cat.gif" alt="Book cover" />
<p>So, after reading positive reviews about it since I started using regular expressions a little over a year ago, I finally purchased O'Reilly Media's <a href="http://regex.info/">Mastering Regular Expressions</a> by Jeffrey E. F. Friedl, after discovering that a third edition came out in August 2006. The book arrived today, and it is indeed pretty damn excellent (I'm as excited about it as I can be about a tech book, anyway). I've only spent a few minutes with it so far, but I can see that it is excellently presented, and there is much to discover. Hopefully I'll post about some cool things I learn over the next few weeks, if I get a chance to actually sit down and read through a significant portion of the book.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com2tag:blogger.com,1999:blog-24744374.post-86683742072430999442007-04-05T04:44:00.000-05:002007-08-08T16:23:57.831-05:00Faking Conditionals in Regular Expressions<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> This post contains several major errors. Please view the updated version on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/archives/mimic-conditionals">Mimic Regular Expression Conditionals</a>.</p>
</div>
<p>Excited by the fact that I can <a href="http://blog.stevenlevithan.com/archives/mimic-atomic-groups">mimic atomic groups</a> when using most regex libraries which don't support them, I set my sights on another of my most wanted regex features which is commonly lacking: <a href="http://www.regular-expressions.info/conditional.html">conditionals</a> (which provide an if-then-else construct). Of the libraries I'm familiar with, conditionals are only supported by .NET, PCRE, PHP (when using PCRE via the preg functions), and JGSoft products (including RegexBuddy).</p>
<p>There are two kinds of regex conditionals in those libraries... lookaround-based and capturing-group-based. The functionality of lookaround-based conditionals is very easy to replicate. First, here's what such conditionals look like (this example uses a positive lookahead for the assertion):</p>
<p><code>(?(?=if_assertion)then|else)</code></p>
<p>To mimic that behavior in languages which don't support conditionals, just add a colon after the initial question mark to turn it into a non-capturing group, like so:</p>
<p><code>(?:(?=if_assertion)then|else)</code></p>
<p>As long as the regex engine you're using supports the specified lookaround type, those patterns do the same thing.</p>
<p>However, mimicking capturing-group-based conditionals proved to be more tricky. Conditionals which use an optional capturing group as their test allow you to base logic on whether a capturing group has participated in the match so far. Thus...</p>
<p><code>(a)?b(?(1)c|d)</code></p>
<p>...matches only "bd" and "abc". That pattern can be expressed as follows:</p>
<p><code>(if_matched)?inner_pattern(?(1)then|else)</code></p>
<p>Here's a comparable pattern I created which doesn't require support for conditionals:</p>
<p><code>(?=(a)()|())\1?b(?:\2c|\3d)</code></p>
<p>To use it without an "else" part, you still need to include "<code>\3</code>" at the end, like this:</p>
<p><code>(?=(a)()|())\1?b(?:\2c|\3)</code></p>
<p>As a brief explanation of how that works, there's an empty alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the alternation are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:</p>
<ul>
<li>This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.</li>
<li>It interacts with backtracking differently than a real conditonal (the "a" part is treated as if it were within an optional, atomic group... e.g., <code>(?>(a)?)</code> instead of <code>(a)?</code>), so it's best to think of this as a new operator which is similar to a conditional.</li>
</li>
</ul>
<p>Here are the regex engines I've briefly tested this pattern with:</p>
<table summary="Tested support for mimicking conditionals" cellspacing="0" border="1" cellpadding="3" style="margin-bottom:10px; border:1px solid #b1793e; border-collapse:collapse;">
<thead>
<tr style="background:#ce8b43; color:#f7f0e9; font-size:90%;">
<th>Language</th>
<th>Supports "fake conditionals"</th>
<th>Supports real conditionals</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>.NET</td>
<td><strong>Yes</strong></td>
<td>Yes</td>
<td>Tested using <a href="http://www.ultrapico.com/Expresso.htm">Expresso</a>.</td>
</tr>
<tr>
<td>ColdFusion</td>
<td><strong>Yes</strong></td>
<td>No</td>
<td>Tested using ColdFusion 7.</td>
</tr>
<tr>
<td>Java</td>
<td><strong>Yes</strong></td>
<td>No</td>
<td>Tested using <a href="http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html">Regular Expression Test Applet</a>.</td>
</tr>
<tr>
<td>JavaScript</td>
<td>No</td>
<td>No</td>
<td>JavaScript assigns an empty string, rather than null, to backreferences for non-participating capturing groups. Unfortunately, this pattern depends on the way most other libraries handle non-participating capturing groups.</td>
</tr>
<tr>
<td>JGSoft</td>
<td><strong>Yes</strong><br/><span class="small">(buggy)</span></td>
<td>Yes</td>
<td style="font-size:90%;">As of RegexBuddy version 2.3.2, it performs correctly in more cases if you change the two empty capturing groups ("<code>()</code>") to match a zero-length value of no impact, such as "<code>(.{0})</code>" or "<code>(\b|\B)</code>". It also has problems matching values at the end of a string when using an empty else. I've reported both issues to JGSoft, and have been told they will be fixed in the next version of RegexBuddy.</td>
</tr>
<tr>
<td>PHP</td>
<td><strong>Yes</strong><br/><span class="small">(buggy)</span></td>
<td>Yes</td>
<td style="font-size:90%;">Tested using <a href="http://www.supercrumbly.com/assets/html/phpregextester/">PHP Regex Tester</a>. Performs correctly in more cases if you explicitly state the condition twice, like so: "<code>(?=(?:a)()|())(?:a)?b(?:\1c|
\2d)</code>".</td>
</tr>
</tbody>
</table>
<p>If you discover ways to improve this, or find problems not already mentioned, please let me know.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com3tag:blogger.com,1999:blog-24744374.post-54066806979180765522007-04-03T19:24:00.000-05:002007-08-08T16:28:39.330-05:00Faking Atomic Groups in Regular Expressions<p>So, I was messing around with <a href="http://regexbuddy.com/">RegexBuddy</a> and discovered that capturing groups work inside lookarounds (e.g., "<code>(?=(captured))</code>"), even though, of course, lookarounds don't actually match anything. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference 0)!</p>
<p>Thinking back to the <a href="http://blog.stevenlevithan.com/archives/match-innermost-html-element">regex I just posted about</a> (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an <a href="http://www.regular-expressions.info/atomic.html">atomic grouping</a>. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.</p>
<p>Here's how it's done:</p>
<p><code>(?=(pattern to make atomic))\1</code></p>
<p>Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually match anything, so the rest of the regex can't backtrack into it), followed by "<code>\1</code>" (the backreference you just captured), to act just like an atomic group. That appears to produce the exactly same result as "<code>(?>pattern to make atomic)</code>", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support positive lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking cool.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com2tag:blogger.com,1999:blog-24744374.post-49602345254209987442007-04-02T02:36:00.000-05:002007-08-08T16:26:54.980-05:00Regexes in Depth: Matching Innermost HTML Elements<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please view the updated, syntax-highlighted version of this post on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/archives/match-innermost-html-element">Matching Innermost HTML Elements</a>.</p>
</div>
<p>On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match "<code><table>...</table></code>", but should only match the inner table within "<code><table>...<table>...</table>...</table></code>". This logic needed to support an unlimited amount of nested tables.</p>
<p>One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.</p>
<p>It's true that, unless you're working with .NET or Perl, regexes are incapable of recursion (although it's often possible to <a href="http://blog.stevenlevithan.com/archives/regex-recursion">fake it</a> to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. ;-)</p>
<p>Here's the solution I offered (though there were a few steps to get there):</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table(?:\s[^>]*)?>(?:(?>[^<]+)|<(?!table(?:\s[^>]*)?>))*?</table></code></p>
<p>That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also very fast, and it can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).</p>
<p>To demonstrate, the above regex matches the highlighted text below:</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table><td><span style="background:yellow;"><table><td>&nbsp;</td></table></span></td></table> <span style="background:yellow;"><table><tr><td>&nbsp;</td></tr></table></span> <span style="background:yellow;"><table></table></span></code>
<p>In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed):</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table>(?:[\S\s](?!<table>))*?</table></code></p>
<p>Basically, the way that works is it matches an opening "<table>" tag, then it looks at each following character one at a time and checks if they are followed by another instance of "<table>" before "</table>". If so, the match fails, because it's not an innermost table.</p>
<p>In theory, at least. Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening "<table>", but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.</p>
<p>To address both of those issues, I used the following regex:</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table>(?:[^<]+|<(?!table>))*?</table></code></p>
<p>First, that increases speed (in theory... you'll see that there are problems with this as is), because within each <table> tag it will greedily jump between all characters which are not "<" in single steps (using "<code>[^<]+</code>"), and it will only use the negative lookahead when it encounters "<". Secondly, it solves the previously noted error by using "<code><(?!table>)</code>" instead of "<code>.(?!<table>)</code>".</p>
<p>If you're wondering about table tags which contain attributes, that's not a problem. The construct is such that it can easily be extended to support element attributes. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table<span style="background:yellow;">(?:\s[^>]*)?</span>>(?:[^<]+|<(?!table<span style="background:yellow;">(?:\s[^>]*)?</span>>))*?</table></code></p>
<p>At first I thought this closed the case... The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering <a href="http://www.regular-expressions.info/atomic.html">catastrophic backtracking</a>. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The change to the regex is highlighted:</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table(?:\s[^>]*)?>(?:<span style="background:yellow;">(?></span>[^<]+<span style="background:yellow;">)</span>|<(?!table(?:\s[^>]*)?>))*?</table></code></p>
<p>And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in less than 0.01 second on my system.</p>
<p>However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, a solid approach which is not susceptible to catastrophic backtracking would be:</p>
<p><code style="font-family: Arial,Helvetica,Sans-Serif; font-size:96%;"><table(?:\s[^>]*)?>(?!<table(?:\s[^>]*)?>)(?:[\S\s](?!<table(?:\s[^>]*)?>))*?</table></code></p>
<p>That runs just a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.</p>
<p>If you have a copy of <a href="http://regexbuddy.com">RegexBuddy</a> (and if you don't, I highly recommend it), run these regexes through RegexBuddy's debugger for an under-the-hood look at how they're handled by a regex engine.</p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit:</strong> Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:</p>
<p><code style="font-family:Arial, Helvetica, Sans-Serif;"><table(?:\s[^>]*)?>(?:(?=([^<]+))\1|<(?!table(?:\s[^>]*)?>))*?</table></code></p>
<p>Basically, that uses a capturing group inside a positive lookahead followed by "<code>\1</code>" to act just like an atomic group!</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com1tag:blogger.com,1999:blog-24744374.post-84604082754855854022007-02-07T23:40:00.000-05:002007-12-11T20:30:28.992-05:00parseUri(): Split URLs in JavaScript<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please see the latest version of this function on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/archives/parseuri">parseUri: Split URLs in JavaScript</a>.</p>
</div>
<p>For fun, I spent the 10 minutes needed to convert my <a href="/2007/01/parsing-uris-in-coldfusion.html"><code>parseUri()</code> ColdFusion UDF</a> into a JavaScript function.</p>
<p>For those who haven't already seen it, I'll repeat my explanation from the other post…</p>
<p><code>parseUri()</code> splits any well-formed URI into its parts (<strong>all are optional</strong>). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., <code>parseUri(someUri).anchor</code>, etc.</p>
<p>I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.</p>
<p>This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.</p>
<pre class="code" style="height:400px;"><span class="comment">/* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
Splits any well-formed URI into the following parts (all are optional):
----------------------
• source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
• domain (part of the authority; can be an IP address)
• port (part of the authority)
• path (includes both the directory path and filename)
• directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
• fileName (part of the path)
• query (does not include the leading question mark)
• anchor (fragment)
*/</span>
function parseUri(sourceUri){
var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
var uri = {};
for(var i = 0; i < 10; i++){
uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
}
<span class="comment">// Always end directoryPath with a trailing backslash if a path was present in the source URI
// Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key</span>
if(uri.directoryPath.length > 0){
uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
}
return uri;
}</pre>
<p>Is there any leaner, meaner URI parser out there? :-)</p>
<p>To make it easier to test this function, here is some code that can be copied and pasted into a new HTML file, allowing you to easily enter URIs and see the results.</p>
<pre class="code" style="height:125px; background:#FFF; border-color:#999;"><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Steve's URI Parser</title>
<script type="text/javascript">
//<![CDATA[
/* parseUri JS v0.1, by Steven Levithan (http://badassery.blogspot.com)
Splits any well-formed URI into the following parts (all are optional):
----------------------
• source (since the exec() method returns backreference 0 [i.e., the entire match] as key 0, we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
• domain (part of the authority; can be an IP address)
• port (part of the authority)
• path (includes both the directory path and filename)
• directoryPath (part of the path; supports directories with periods, and without a trailing backslash)
• fileName (part of the path)
• query (does not include the leading question mark)
• anchor (fragment)
*/
function parseUri(sourceUri){
var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"];
var uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)?((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri);
var uri = {};
for(var i = 0; i < 10; i++){
uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
}
// Always end directoryPath with a trailing backslash if a path was present in the source URI
// Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key
if(uri.directoryPath.length > 0){
uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
}
return uri;
}
// Dump the test results in the page
function dumpResults(obj){
var output = "";
for (var property in obj){
output += '<tr><td class="name">' + property + '</td><td class="result">"<span class="value">' + obj[property] + '</span>"</td></tr>';
}
document.getElementById('output').innerHTML = "<table>" + output + "</table>";
}
//]]>
</script>
<style type="text/css" media="screen">
h1 {font-size:1.25em;}
table {border:solid #333; border-width:1px; background:#f5f5f5; margin:15px 0 0; border-collapse:collapse;}
td {border:solid #333; border-width:1px 1px 0 0; padding:4px;}
.name {font-weight:bold;}
.result {color:#aaa;}
.value {color:#33c;}
</style>
</head>
<body>
<h1>Steve's URI Parser</h1>
<form action="#" onsubmit="dumpResults(parseUri(document.getElementById('uriInput').value)); return false;">
<div>
<input id="uriInput" type="text" style="width:500px" value="http://www.domain.com:81/dir1/dir.2/index.html?id=1&amp;test=2#top" />
<input type="submit" value="Parse" />
</div>
</form>
<div id="output">
</div>
<p><a href="http://badassery.blogspot.com">My blog</a></p>
</body>
</html></pre>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit:</strong> This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take much longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if several people leave comments asking for the support, I'll go ahead and add it. I could also add more pre-concatenated parts (e.g., "relative" for everything starting with the path) or other stuff like "tld" (for just the top-level domain) if readers think it would be useful.</p>
<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please see the latest version of this function on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/archives/parseuri">parseUri: Split URLs in JavaScript</a>.</p>
</div>
<hr/>
<p>You might also be looking for my script which fixes the <a href="http://blog.stevenlevithan.com/archives/cross-browser-split">JavaScript split</a> method cross-browser.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com26tag:blogger.com,1999:blog-24744374.post-77224367434556694412007-02-04T17:24:00.000-05:002007-05-30T17:47:52.988-05:00Regexes in Depth: Advanced Quoted String Matching<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please view the updated version of this post on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/regular-expressions/match-quoted-string/">Advanced Quoted String Matching</a>.</p>
</div>
<p>In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:</p>
<p><code>(["'])(?:\\\1|.)*?\1</code></p>
<p>To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.</p>
<p>On his blog, <a href="http://bennadel.com/">Ben Nadel</a> asked:</p>
<blockquote><p>I do not follow the <code>\\\1</code> in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (<code>*?</code>) know when to stop if it can have quotes in side of it? What am I missing?</p></blockquote>
<p>Good question. Following is the response I gave, slightly updated to improve clarity:</p>
<div style="border:1px solid #F7E8D8; border-width:6px 0; padding-top:10px; margin-bottom:10px;">
<p>First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:</p>
<ul>
<li>"test"</li>
<li>'test'</li>
<li>"t'es't"</li>
<li>'te"st'</li>
<li>'te\'st'</li>
<li>"\"te\"\"st\""</li>
</ul>
<p>In other words, it allows any number of escaped quotes of the same type as the enclosure. (Due to the way the regex is written, it doesn't need special handling for inner quotes that are not of the same type as the enclosure.)</p>
<p>As for <em>how</em> the regex works, it uses a trick similar in construct to the examples I gave in my blog post about <a href="/2006/03/regex-recursion-without-balancing.html">regex recursion without balancing groups</a>.</p>
<p>Basically, the inner grouping matches escaped quotes <em>OR</em> any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (<code>*?</code>) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote character with the quote character allows the lazy repetition operator to treat them as one token, and continue on it's way through the string.</p>
<p>Side note: If you wanted to support multi-line quotes in libraries without an option to make dots match newlines, change the dot to <code>[\S\s]</code></p>
<p>Also note that with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:</p>
<ul>
<li><code>(["'])(?:\\\1|.)*?\1</code> (the regex being discussed)</li>
<li><code>(["']).*?(?<!\\)\1</code> (uses a negative lookbehind to achieve logic which is possibly simpler to understand)</li>
</ul>
<p>Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. Also, if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.</p>
<p>One final thing worth noting is that in neither regex did I try to use anything like <code>[^\1]</code> for matching the inner, quoted content. If <code>[^\1]</code> worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or <code>[\S\s]</code> is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. Secondly, the reason we can't use <code>[^\1]</code> either is because you can't use backreferences within character classes (negated or otherwise), even though in this case the match contained within the backreference is only one character in length. Also note that the patterns <code>[\1]</code> and <code>[^\1]</code> actually <em>do</em> have special meaning, though possibly not what you would expect. They assert: <em>match a single character which is/is not octal index 1 in the character set</em>. To assert that outside of a character class, you'd need to use a leading zero (e.g., <code>\01</code>), but inside a character class the leading zero is optional.</p>
</div>
<p>If anyone has questions about how other, specific regex patterns work, or why they don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.</p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit:</strong> Just for kicks, here's a Unicode-based regular expression which adds support for any kind of opening/closing quote pair in any language (including the special characters <span style="font-family:Courier New;">“</span>, <span style="font-family:Courier New;">”</span>, <span style="font-family:Courier New;">‘</span>, <span style="font-family:Courier New;">’</span>, etc.). Of the regex flavors I'm familiar with, Java, the .NET framework, and Perl use Unicode-based regex engines. Of those three, only the .NET framework also supports conditionals, which I'll also need to pull this off.</p>
<p><code>(?:(["'])|\p{Pi}).*?(?<!\\)(?(1)\1|\p{Pf})</code></p>
<p>I'm not going to go into explaining that, but the more advanced regex features used are a negative lookbehind, conditional, and Unicode character properties.</p>
<p>Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with <span style="font-family:Courier New;">"</span> or <span style="font-family:Courier New;">'</span>, <strong>neither of which are designated as opening or closing quote characters in Unicode</strong>).</p>
<ul>
<li style="font-family:Courier New;">“test”</li>
<li style="font-family:Courier New;">“te“st”</li>
<li style="font-family:Courier New;">“te\”st”</li>
<li style="font-family:Courier New;">‘test’</li>
<li style="font-family:Courier New;">‘t‘e"s\’t’</li>
</ul>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit 2:</strong> Shortly after posting the above Unicode-based regex, I realized it was flawed. Although it will correctly match all strings in the two lists of examples above, the fact that I'm using the Unicode character properties for any opening / closing quote means that it will also match, e.g., <span style="font-family:Courier New;">‘test”</span>, which is not what I was going for. The only way to get around this is to not use the Unicode character properties, and instead specifically include support for <span style="font-family:Courier New;">“”</span> and <span style="font-family:Courier New;">‘’</span> pairs (however, unfortunately we will lose the ability to work with special quote characters from <em><strong>any</strong></em> language). Here's an updated regex:</p>
<p><code>(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))</code></p>
<p>Now, it will no longer match <span style="font-family:Courier New;">‘test”</span>, and will successfully match things like <span style="font-family:Courier New;">‘t‘e“"”s\’t’</span>. Note that I'm using nested conditionals in the above regex to achieve an if-elseif-else construct. Also, now that it's no longer Unicode-based, it will work with regex engines which support both lookbehinds and conditionals (PCRE, PHP, the .NET framework, and possibly others).</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com2tag:blogger.com,1999:blog-24744374.post-13647761746885476932007-02-04T11:38:00.000-05:002007-02-05T12:03:04.946-05:00Capturing vs. Non-capturing Regex Groups<p>I posted the following on <a href="http://www.bennadel.com/blog/">Ben Nadel's excellent blog</a>, in response to a question about why I use non-capturing groups in my regular expressions (e.g., <code>(?:non-captured)</code>)...</p>
<p>I near-religiously use non-capturing groups whenever I do not need to reference a group's contents. There are only three reasons to use capturing groups:</p>
<ul>
<li>You're using parts of a match to construct a replacement string, or otherwise referencing parts of the match in code outside the regex.</li>
<li>You need to reuse parts of the match within the regex itself. E.g., <code>(["'])(?:\\\1|.)*?\1</code> would match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match, and allowing inner, escaped quotes of the same type as the enclosure.</li>
<li>You need to test if an optional group was part of the match so far, as the condition to evaluate within a conditional. E.g., <code>(a)?b(?(1)c|d)</code> only matches the values "bd" and "abc".</li>
</ul>
<p>There are two primary reasons to use non-capturing groups if a grouping doesn't meet one of the above conditions:</p>
<ul>
<li>Capturing groups negatively impact performace, since creating backreferences requires that their contents be stored in memory. The performance hit may be tiny, especially when working with small strings, but it's there.</li>
<li>When you need to use several groupings in a single regex, only some of which you plan to reference later, it's very convenient to have the backreferences you want to use numbered sequentially. E.g., the logic in my <a href="/2007/01/parsing-uris-in-coldfusion.html"><code>parseUri()</code> UDF</a> could not be nearly as simple if I had not made appropriate use of capturing and non-capturing groups within the same regex.</li>
</ul>
<p>On a related note, the values of backreferences created using capturing groups with repetition operators on the end of them may not be obvious until you're familar with how it works. If you ran the regex <code>(.)*</code> over the string "test", although backreference 0 (i.e., the whole match) would be "test", backreference 1 would be "t", and there would be no 2nd, 3rd, or 4th backreferences created for the strings "e," "s," and "t." If you wanted the entire match of a repeated grouping to be captured into a backreference, you could use, e.g., <code>((?:.)*)</code>. Also note that the way both of those patterns would be evaluated is fundamentally different from how regex engines would treat <code>(.*)</code>.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com1tag:blogger.com,1999:blog-24744374.post-24258313601096717432007-02-03T02:17:00.000-05:002007-02-06T18:03:23.752-05:00More URI-related UDFs<p>To follow up my <a href="/2007/01/parsing-uris-in-coldfusion.html"><code>parseUri()</code></a> function, here are several more UDFs I've written recently to help with URI management:</p>
<ul>
<li><strong><code style="color:#000;">getPageUri()</code></strong><br />
Returns a struct containing the relative and absolute URIs of the current page. The difference between <code>getPageUri().relative</code> and <code>CGI.SCRIPT_NAME</code> is that the former will include the query string, if present.</li>
<li><strong><code style="color:#000;">matchUri(testUri, [masterUri])</code></strong><br />
Returns a Boolean indicating whether or not two URIs are the same, disregarding the following differences:
<ul>
<li>Fragments (page anchors), e.g., "#top".</li>
<li>Inclusion of "index.cfm" in paths, e.g., "/dir/" vs. "/dir/index.cfm" (supports trailing query strings).</li>
</ul>
If <code>masterUri</code> is not provided, the current page is used for comparison (supports both relative and absolute URIs).</li>
<li><strong><code style="color:#000;">replaceUriQueryKey(uri, key, substring)</code></strong><br />
Replaces a URI query key and its value with a supplied key=value pair. Works with relative and absolute URIs, as well as standalone query strings (with or without a leading "?"). This is also used to support the following two UDFs:</li>
<li><strong><code style="color:#000;">addUriQueryKey(uri, key, value)</code></strong><br />
Removes any existing instances of the supplied key, then appends it together with the provided value to the provided URI.</li>
<li><strong><code style="color:#000;">removeUriQueryKey(uri, key)</code></strong><br />
Removes one or more query keys (comma delimited) and their values from the provided URI.</li>
</ul>
<p>Now that I have these at my disposal, I frequently find myself using them in combination with each other, e.g.,<br />
<code><a href="<cfoutput>#addUriQueryKey(<br />
getPageUri().relative,<br />
"key",<br />
"value"<br />
)#</cfoutput>">Link</a></code>.</p>
<p>Let me know if you find any of these useful…</p>
<pre class="code" style="height: 500px;"><span class="comment"><!--- Returns the relative and absolute URIs of the current page ---></span>
<cffunction name="getPageUri" returntype="struct" output="FALSE">
<cfset var pageProtocol = "http" />
<cfset var pageQuery = "" />
<cfset var uri = structNew() />
<span class="comment"><!--- Get the protocol of the current page ---></span>
<cfif CGI.HTTPS IS "ON">
<cfset pageProtocol = "https" />
</cfif>
<span class="comment"><!--- Get the query of the current page, including the leading question if the query is not empty ---></span>
<cfset pageQuery = reReplace("?" & CGI.QUERY_STRING, "\?$", "") />
<span class="comment"><!--- Construct the relative URI of the current page (excludes the protocol and domain) ---></span>
<cfset uri.relative = CGI.SCRIPT_NAME & pageQuery />
<span class="comment"><!--- Construct the absolute URI of the current page ---></span>
<cfset uri.absolute = pageProtocol & "://" & CGI.SERVER_NAME & uri.relative />
<cfreturn uri />
</cffunction>
<span class="comment"><!--- Returns a Boolean indicating whether or not two URIs are the same, disregarding the following differences:
• Fragments (page anchors), e.g., "#top".
• Inclusion of "index.cfm" in paths, e.g., "/dir/" vs. "/dir/index.cfm" (supports trailing query strings).
If masterUri is not provided, the current page is used for comparison (supports both relative and absolute URIs) ---></span>
<cffunction name="matchUri" returntype="boolean" output="FALSE">
<cfargument name="testUri" type="string" required="TRUE" />
<cfargument name="masterUri" type="string" required="FALSE" default="" />
<span class="comment"><!--- If a masterUri was not provided ---></span>
<cfif len(masterUri) EQ 0>
<span class="comment"><!--- If testUri is an absolute URI ---></span>
<cfif reFindNoCase("^https?://", testUri) EQ 1>
<cfset masterUri = getPageUri().absolute />
<cfelse>
<cfset masterUri = getPageUri().relative />
</cfif>
</cfif>
<cfreturn reReplaceNoCase(reReplace(testUri, "##.*", ""), "/index\.cfm(?=\?|$)", "/", "ONE") IS reReplaceNoCase(reReplace(masterUri, "##.*", ""), "/index\.cfm(?=\?|$)", "/", "ONE") />
</cffunction>
<span class="comment"><!--- Replace a URI query key and its value with a supplied key=value pair.
Works with relative and absolute URIs, as well as standalone query strings (with or without a leading "?") ---></span>
<cffunction name="replaceUriQueryKey" returntype="string" output="FALSE">
<cfargument name="uri" type="string" required="TRUE" />
<cfargument name="key" type="string" required="TRUE" />
<cfargument name="substring" type="string" required="TRUE" />
<cfset var preQueryComponents = "" />
<cfset var currentKey = "" />
<span class="comment"><!--- Remove any existing fragment (page anchor) from uri, since it will mess with our processing, and is unlikely to be relevant and/or correct in the new URI ---></span>
<cfset uri = reReplace(uri, "##.*", "", "ONE") />
<span class="comment"><!--- Store any pre-query URI components. For this to work, the string must start with "protocol:", "//authority", or "/" (path). Otherwise, we will assume the uri is comprised entirely of a query component ---></span>
<cfset preQueryComponents = reReplace(uri, "^((?:(?:[^:/?.]+:)?//[^/?]+)?(?:/[^?]*)?)?.*", "\1", "ONE") />
<span class="comment"><!--- Remove any pre-query components and the leading question mark from uri ---></span>
<cfset uri = reReplace(uri, "^(?:(?:[^:/?.]+:)?//[^/?]+)?(?:/[^?]*)?\??(.*)", "\1", "ONE") />
<span class="comment"><!--- Remove any superfluous ampersands in the query (this cleans up the query but is not required, and in any case this function doesn't generate superfluous ampersands) ---></span>
<cfset uri = reReplace(uri, "&(?=&)|&$", "", "ALL") />
<span class="comment"><!--- For each key specified, remove the corresponding key=value pair from uri. Note that key names which contain regex special characters (.,*,+,?,^,$,{,},(,),|,[,],\) which are not percent-encoded may behave unpredictably ---></span>
<cfloop index="currentKey" list="#key#" delimiters=",">
<cfif len(currentKey) GT 0>
<cfset uri = reReplaceNoCase(uri, ("(?:^|&)" & currentKey & "(?:=[^&]*)?"), "", "ALL") />
</cfif>
</cfloop>
<span class="comment"><!--- If we still have a value in uri after the above processing (beyond what we're about to add) ---></span>
<cfif len(uri) GT 0>
<span class="comment"><!--- Ensure the query is returned with only the necessary separator characters (? and &) ---></span>
<cfreturn (preQueryComponents & "?" & reReplace(uri, "^&", "") & reReplace("&" & substring, "&$", "")) />
<cfelse>
<span class="comment"><!--- Append substring, including a leading question mark if substring is not empty ---></span>
<cfreturn (preQueryComponents & reReplace("?" & substring, "\?$", "")) />
</cfif>
</cffunction>
<cffunction name="addUriQueryKey" returntype="string" output="FALSE">
<cfargument name="uri" type="string" required="TRUE" />
<cfargument name="key" type="string" required="TRUE" />
<cfargument name="value" type="string" required="TRUE" />
<span class="comment"><!--- Until proper support is included for adding multiple keys with one call, use only the first key ---></span>
<cfset key = listFirst(key, ",") />
<span class="comment"><!--- Remove any existing instances of the key from uri, then add the new key=value pair.
Do not include the trailing equals sign (=) if we're assigning an empty value to the added key ---></span>
<cfreturn replaceUriQueryKey(removeUriQueryKey(uri, key), "", (key & reReplace("=" & value, "=$", ""))) />
</cffunction>
<cffunction name="removeUriQueryKey" returntype="string" output="FALSE">
<cfargument name="uri" type="string" required="TRUE" />
<span class="comment"><!--- Use a comma-delimited list to remove multiple keys with one call ---></span>
<cfargument name="key" type="string" required="TRUE" />
<cfreturn replaceUriQueryKey(uri, key, "") />
</cffunction></pre>
<p>In other news, <a href="http://xkcd.com/c208.html">this</a> cracked me up.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-29220244796904209762007-02-01T00:40:00.000-05:002007-05-30T17:53:16.641-05:00parseUri(): Split URLs in ColdFusion<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please view the updated version of this post on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/coldfusion/parseuri-split-url-coldfusion/">parseUri: Split URLs in ColdFusion</a>.</p>
</div>
<p>Here's a UDF I wrote recently which allows me to show off my regex skillz. <code>parseUri()</code> splits any well-formed URI into its components (all are optional).</p>
<p>The core code is already very brief, but I could replace the entire contents of the <code><cfloop></code> with one line of code if I didn't have to account for bugs in the <code>reFind()</code> function (tested in CF7). Note that all components are split with a single regex (using backreferences). My favorite part of this UDF is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers.</p>
<p>Since the function returns a struct, you can do, e.g., <code>parseUri(someUri).anchor</code>, etc. Check it out:</p>
<pre class="code" style="height:520px;">
<span class="comment"><!--- By Steven Levithan. Splits any well-formed URI into its components ---></span>
<cffunction name="parseUri" returntype="struct" output="FALSE">
<cfargument name="sourceUri" type="string" required="TRUE" />
<span class="comment"><!--- Get arrays named len and pos, containing the lengths and positions of each URI component (all are optional) ---></span>
<cfset var uriPattern = reFind("^(?:([^:/?##.]+):)?(?://)?(([^:/?##]*)(?::(\d*))?)?((/(?:[^?##](?![^?##/]*\.[^?##/.]+(?:[\?##]|$)))*/?)?([^?##/]*))?(?:\?([^##]*))?(?:##(.*))?", sourceUri, 1, TRUE) />
<span class="comment"><!--- Create an array containing the names of each key we will add to the uri struct ---></span>
<cfset var uriComponentNames = listToArray("source,protocol,authority,domain,port,path,directoryPath,fileName,query,anchor") />
<cfset var uri = structNew() />
<cfset var i = 1 />
<span class="comment"><!--- Add the following keys to the uri struct:
• source (when using returnSubExpressions, reFind() returns backreference 0 [i.e., the entire match] as array element 1, so we might as well use it)
• protocol (scheme)
• authority (includes both the domain and port)
• domain (part of the authority component; can be an IP address)
• port (part of the authority component)
• path (includes both the directory path and filename)
• directoryPath (part of the path component; supports directories with periods, and without a trailing backslash)
• fileName (part of the path component)
• query (does not include the leading question mark)
• anchor (fragment) ---></span>
<cfloop index="i" from="1" to="10"><span class="comment"><!--- Could also use to="#arrayLen(uriComponentNames)#" ---></span>
<span class="comment"><!--- If the component was found in the source URI...
• The arrayLen() check is needed to prevent a CF error when sourceUri is empty, because due to an apparent bug, reFind() does not populate backreferences for zero-length capturing groups when run against an empty string (though it does still populate backreference 0)
• The pos[i] value check is needed to prevent a CF error when mid() is passed a start value of 0, because of the way reFind() considers an optional capturing group that does not match anything to have a pos of 0 ---></span>
<cfif (arrayLen(uriPattern.pos) GT 1) AND (uriPattern.pos[i] GT 0)>
<span class="comment"><!--- Add the component to its corresponding key in the uri struct ---></span>
<cfset uri[uriComponentNames[i]] = mid(sourceUri, uriPattern.pos[i], uriPattern.len[i]) />
<span class="comment"><!--- Otherwise, set the key value to an empty string ---></span>
<cfelse>
<cfset uri[uriComponentNames[i]] = "" />
</cfif>
</cfloop>
<span class="comment"><!--- Always end directoryPath with a trailing backslash if the path component was present in the source URI (Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key) ---></span>
<cfif len(uri.directoryPath) GT 0>
<cfset uri.directoryPath = reReplace(uri.directoryPath, "/?$", "/") />
</cfif>
<cfreturn uri />
</cffunction></pre>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit:</strong> I've written a JavaScript implementation of the above UDF. See <a href="http://badassery.blogspot.com/2007/02/parseuri-split-urls-in-javascript.html">parseUri(): Split URLs in JavaScript</a>.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-60553331352404823322007-02-01T00:03:00.000-05:002007-12-12T01:16:57.047-05:00reMatch(): Improving ColdFusion's regex support<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please see this post on my new blog, which includes a demo of the REMatch function:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/archives/rematch-coldfusion">REMatch (ColdFusion)</a>.</p>
</div>
<p>Following are some UDFs I wrote recently to make using regexes in ColdFusion a bit easier. The biggest deal here is my <code>reMatch()</code> function.</p>
<p><code>reMatch()</code>, in its most basic usage, is similar to JavaScript's <code>String.match()</code> method. Compare getting the first number in a string using <code>reMatch()</code> vs. built-in ColdFusion functions:</p>
<ul>
<li><strong><code style="color:#000;">reMatch()</code>:</strong><br />
<code><cfset num = reMatch("\d+", string) /></code></li>
<li><strong><code style="color:#000;">reReplace()</code>:</strong><br />
<code><cfset num = reReplace(string, "\D*(\d+).*", "\1") /></code></li>
<li><strong><code style="color:#000;">reFind()</code>:</strong><br />
<code><cfset matchInfo = reFind("\d+", string, 1, TRUE) /><br />
<cfset num = mid(string, matchInfo.pos[1], matchInfo.len[1]) /></code></li>
</ul>
<p>All of the above would return the same result, unless a number wasn't found in the string, in which case the <code>reFind()</code>-based method would throw an error since the <code>mid()</code> function would be passed a <code>start</code> value of 0. I think it's pretty clear from the above which approach is easiest to use for a situation like this.</p>
<p>Still, that's just the beginning of what <code>reMatch()</code> can do. Change the <code>scope</code> argument from the default of "ONE" to "ALL" (to follow the convention used by <code>reReplace()</code>, etc.), and the function will return an array of all matches. Finally, set the <code>returnLenPos</code> argument to TRUE and the function will return either a struct or array of structs (based on the value of <code>scope</code>) containing the len, pos, AND value of each match. This is very different from how the <code>returnSubExpressions</code> argument of <code>reFind()</code> works. When using <code>returnSubExpressions</code>, you get back a struct containing arrays of the len and pos (but not value) of each backreference from the first match.</p>
<p>Here's the code, with four additional UDFs (<code>reMatchNoCase()</code>, <code>match()</code>, <code>matchNoCase()</code>, and <code>escapeReChars()</code>) added for good measure:</p>
<pre class="code" style="height: 500px;"><span class="comment"><!--- UDFs by Steven Levithan ---></span>
<cffunction name="reMatch" output="FALSE">
<cfargument name="regEx" type="string" required="TRUE" />
<cfargument name="string" type="string" required="TRUE" />
<cfargument name="start" type="numeric" required="FALSE" default="1" />
<cfargument name="scope" type="string" required="FALSE" default="ONE" />
<cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
<cfargument name="caseSensitive" type="boolean" required="FALSE" default="TRUE" />
<cfset var thisMatch = "" />
<cfset var matchInfo = structNew() />
<cfset var matches = arrayNew(1) />
<span class="comment"><!--- Set the time before entering the loop ---></span>
<cfset var timeout = now() />
<span class="comment"><!--- Build the matches array. Continue looping until additional instances of regEx are not found. If scope is "ONE", the loop will end after the first iteration ---></span>
<cfloop condition="TRUE">
<span class="comment"><!--- By using returnSubExpressions (the fourth reFind argument), the position and length of the first match is captured in arrays named len and pos ---></span>
<cfif caseSensitive>
<cfset thisMatch = reFind(regEx, string, start, TRUE) />
<cfelse>
<cfset thisMatch = reFindNoCase(regEx, string, start, TRUE) />
</cfif>
<span class="comment"><!--- If a match was not found, end the loop ---></span>
<cfif thisMatch.pos[1] EQ 0>
<cfbreak />
<span class="comment"><!--- If a match was found, and extended info was requested, append a struct containing the value, length, and position of the match to the matches array ---></span>
<cfelseif returnLenPos>
<cfset matchInfo.value = mid(string, thisMatch.pos[1], thisMatch.len[1]) />
<cfset matchInfo.len = thisMatch.len[1] />
<cfset matchInfo.pos = thisMatch.pos[1] />
<cfset arrayAppend(matches, matchInfo) />
<span class="comment"><!--- Otherwise, just append the match value to the matches array ---></span>
<cfelse>
<cfset arrayAppend(matches, mid(string, thisMatch.pos[1], thisMatch.len[1])) />
</cfif>
<span class="comment"><!--- If only the first match was requested, end the loop ---></span>
<cfif scope IS "ONE">
<cfbreak />
<span class="comment"><!--- If the match length was greater than zero ---></span>
<cfelseif thisMatch.pos[1] + thisMatch.len[1] GT start>
<span class="comment"><!--- Set the start position for the next iteration of the loop to the end position of the match ---></span>
<cfset start = thisMatch.pos[1] + thisMatch.len[1] />
<span class="comment"><!--- If the match was zero length ---></span>
<cfelse>
<span class="comment"><!--- Advance the start position for the next iteration of the loop by one, to avoid infinite iteration ---></span>
<cfset start = start + 1 />
</cfif>
<span class="comment"><!--- If the loop has run for 20 seconds, throw an error, to mitigate against overlong processing. However, note that even one pass using a poorly-written regex which triggers catastrophic backtracking could take longer than 20 seconds ---></span>
<cfif dateDiff("s", timeout, now()) GTE 20>
<cfthrow message="Processing too long. Optimize regular expression for better performance" />
</cfif>
</cfloop>
<cfif scope IS "ONE">
<cfparam name="matches[1]" default="" />
<cfreturn matches[1] />
<cfelse>
<cfreturn matches />
</cfif>
</cffunction>
<cffunction name="reMatchNoCase" output="FALSE">
<cfargument name="regEx" type="string" required="TRUE" />
<cfargument name="string" type="string" required="TRUE" />
<cfargument name="start" type="numeric" required="FALSE" default="1" />
<cfargument name="scope" type="string" required="FALSE" default="ONE" />
<cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
<cfreturn reMatch(regEx, string, start, scope, returnLenPos, FALSE) />
</cffunction>
<cffunction name="match" output="FALSE">
<cfargument name="substring" type="string" required="TRUE" />
<cfargument name="string" type="string" required="TRUE" />
<cfargument name="start" type="numeric" required="FALSE" default="1" />
<cfargument name="scope" type="string" required="FALSE" default="ONE" />
<cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
<cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, TRUE) />
</cffunction>
<cffunction name="matchNoCase" output="FALSE">
<cfargument name="substring" type="string" required="TRUE" />
<cfargument name="string" type="string" required="TRUE" />
<cfargument name="start" type="numeric" required="FALSE" default="1" />
<cfargument name="scope" type="string" required="FALSE" default="ONE" />
<cfargument name="returnLenPos" type="boolean" required="FALSE" default="FALSE" />
<cfreturn reMatch(escapeReChars(substring), string, start, scope, returnLenPos, FALSE) />
</cffunction>
<span class="comment"><!--- Escape special regular expression characters (.,*,+,?,^,$,{,},(,),|,[,],\) within a string by preceding them with a forward slash (\). This allows safely using literal strings within regular expressions ---></span>
<cffunction name="escapeReChars" returntype="string" output="FALSE">
<cfargument name="string" type="string" required="TRUE" />
<cfreturn reReplace(string, "[.*+?^${}()|[\]\\]", "\\\0", "ALL") />
</cffunction></pre>
<p>Now that I've got a deeply featured match function, all I need Adobe to add to ColdFusion in the way to regex support is lookbehinds, atomic groups, possessive quantifiers, conditionals, balancing groups, etc., etc. :-)</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com5tag:blogger.com,1999:blog-24744374.post-1146437473840309832006-04-30T17:43:00.000-05:002007-02-02T21:26:49.145-05:00Hot Naked Sushi Action<p style="float:right"><img src="http://img435.imageshack.us/img435/2105/nakedsushi0en.jpg" alt="Naked sushi" style="margin-left:10px"></p>
<p>Someone was telling me yesterday about how ancient and wonderful an art form <em>nyotaimori</em> (<acronym title="also known as">aka</acronym> naked sushi) is. This may or may <a href="http://en.wikipedia.org/wiki/Nyotaimori">not exactly</a> <span class="small">(wikipedia.org)</span> be true, but it did get me looking online for more information. The practice has actually been <a href="http://news.bbc.co.uk/2/hi/asia-pacific/4570901.stm">outlawed in China</a> <span class="small">(bbc.co.uk)</span>, but then the Chinese don’t allow <a href="http://mdn.mainichi-msn.co.jp/waiwai/www/news/20060424p2g00m0dm003000c.html">snoring in the military</a> or <a href="http://mdn.mainichi-msn.co.jp/waiwai/www/archive/news/2006/04/14/20060414p2g00m0dm029000c.html">cooking children’s arms</a> <span class="small">(mainichi-msn.co.jp)</span> either. Communists.</p>
<p>Fortunately for the sophisticated misogynist, there are a few establishments Stateside that offer <em>nyotaimori</em>, most famously <a href="http://www.globalcuisinecatering.com/">Gary Arabia’s</a> <span class="small">(globalcuisinecatering.com)</span> <a href="http://www.aolcityguide.com/losangeles/dining/venue.adp?sbid=138232">Global Cuisine</a> <span class="small">(aolcityguide.com)</span> in LA.</p>
<p>I spent a few minutes searching for local spots, and according to ClubZone.com, <a href="http://www.clubzone.com/c/Washington_DC/Lounge_Bar/Cafe_Japone.html">Café Japone</a> <span class="small">(clubzone.com)</span> in Dupont Circle does naked sushi Saturday nights. Hmm… why haven’t I heard about this before? I’m skeptical. If it is true, however, I may have to pay them a visit, being a fan of both sushi and naked women.
<p>Thinking…I'm sure they have rules about not talking to the models, but really, where would a conversation with someone you were eating off go?</p>
<blockquote>
<p><strong>Me:</strong> <em>So… Sushi… you a big sushi person?</em><br/>
<strong>Table:</strong> <em>Well, not really.</em><br/>
<strong>Me:</strong> <em>Ah. Well. Being a table, then. How's that working out for you?</em><br/>
<strong>Table:</strong> <em>Not too bad. Pays the rent. I, uh, go home smelling like fish though.</em><br/>
<strong>Me:</strong> <em>Oh.</em></p>
</blockquote>
<p>…It could go on like that for a very long, akward time.</p>
<p>For those who know I grew up in Japan, I’ll note that I haven’t tried eating <em>off</em> a naked woman yet, though I have on many occasions ea…… [Post interrupted by Poor Taste Alert®]</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com3tag:blogger.com,1999:blog-24744374.post-1144516363596095602006-04-08T11:54:00.000-05:002007-02-04T22:55:35.284-05:00Dymanic Properties<h3 class="subtitle">A cool, little-known, <acronym title="Internet Explorer version 5 or greater">IE5+</acronym> only <acronym title="Cascading Style Sheets">CSS</acronym> bastardization</h3>
<p>I'm just starting to learn about Dynamic Properties (<acronym title="also known as">aka</acronym> <acronym title="Cascading Style Sheets">CSS</acronym> Expressions). Thought this might be interesting to my hordes of readers (not) in the WebDev community…</p>
<ul>
<li><a href="http://msdn.microsoft.com/workshop/author/dhtml/overview/recalc.asp">About Dynamic Properties</a> <span class="small">(msdn.microsoft.com)</span></li>
<li><a href="http://msdn.microsoft.com/library/en-us/dndude/html/dude061198.asp">Be More Dynamic</a> <span class="small">(msdn.microsoft.com)</span></li>
<li><a href="http://www.javascriptkit.com/dhtmltutors/dynproperty.shtml">Introduction to Dynamic Properties (<acronym title="Internet Explorer">IE</acronym>)</a> <span class="small">(javascriptkit.com)</span></li>
</ul>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com2tag:blogger.com,1999:blog-24744374.post-1144377464543386802006-04-06T21:37:00.000-05:002007-02-04T18:59:16.511-05:00You Got Mightily Owned, My Friend<p>Discovered this via my buddy Dave… Check out <a href="http://www.aninote.com/">Aninote.com</a>. It allows you to dynamically generate Flash presentations … you can insert any recipient name in the animations simply by sticking it before the domain name for the animation you wish to use.</p>
<p>See, for example:</p>
<ul>
<li><a href="http://steven.segal.justgotowned.com/">http://steven.segal.justgotowned.com</a></li>
<li><a href="http://michael.jackson.youaremighty.com/">http://michael.jackson.youaremighty.com</a></li>
<li><a href="http://chocolate.balls.youaremyfriend.com/">http://chocolate.balls.youaremyfriend.com</a></li>
</ul>
<p>Mighty awesome.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-1144034940492563782006-04-02T21:41:00.000-05:002007-02-06T11:59:39.645-05:00World RPS Society<p>Check it: <a href="http://www.worldrps.com/">Worldwide Governing Body of the Sport of Rock Paper Scissors</a> <span class="small">(worldrps.com)</span>.</p>
<div style="float:right">
<img src="http://www.worldrps.com/images/archive/leadon.jpg" alt="World RPS Society - Lead on with your rock paper and scissors" style="width:250px; height:179px;"/>
</div>
<p>Looks like it’s a big thing. Weird. Still, this has to be one of the most unintentionally funny sites ever. Check out, for example, their <a href="http://www.worldrps.com/index.php?option=com_content&task=view&id=154&Itemid=37">2002 dedication to the Official Year of the Rock</a> (in particular, check out the last two photo captions within the article).</p>
<!-- <div style="float:left">
<img src="http://www.worldrps.com/images/rpsver3/Museum/youryouryour.jpg" alt="Your rock, Your paper, Your scissors, Will bring us victory"/>
</div> -->
<p>At first glance, the rules of Rock Paper Scissors seem simple. As you look deeper, however, they’re still pretty simple. Let’s not kid ourselves.</p>
<ul>
<li>Rock smashes scissors (rock wins)</li>
<li>Scissors cut paper (scissors win)</li>
<li>Paper covers rock (paper wins)</li>
<li>Flounder slaps penguin (flounder wins … for expert use only)</li>
</ul>
<p>…Or so I thought. Here is the 3-page <em><A href="http://www.worldrps.com/index.php?option=com_content&task=view&id=14&Itemid=31">How to Play – Quick Start</a></em>, and the 7-page must read for all aspiring RPS gurus: <em><a href="http://www.worldrps.com/index.php?option=com_content&task=view&id=13&Itemid=28">Advanced RPS tactics</a></em>.</p>
<p>Check out this excerpt from the World RPS Society, showing just how high-demand of a sport Rock Paper Scissors can really be:</p>
<blockquote><p>… In other events, Chad Leatherstep (Co-Chair Disciplinary Committee) in his address delivered a landmark speech pledging a crackdown on performance enhancing drugs in professional level play. “It is the worst kept secret that the dressing rooms at many tournaments have become literal ‘hotboxes’ of abuse. We will be targeting specific suspicious players for random drug testing.<!-- They should be easy to spot as they tend to spend more time hanging around the vending machine and concession stands than the drug-free players.-->”</p></blockquote>
<p>Imagine that. Your friendly, local Rock Paper Scissors tournament, unbeknowest to you, might have become a literal “hotbox” of performance-enhancing drug abuse!</p>
<p>It makes sense in a way, I guess … these people have to be on <em>something</em> potent to be at an RPS tournament in the first place.</p>
<p>Related stuff:</p>
<ul>
<li><a href="http://www.emf.net/~estephen/roshambo/">The cult of roshambo</a> <span class="small">(emf.net)</span> — devoted to the religion of Rock Paper Scissors</li>
<li><a href="http://www.brunching.com/psr.html">Play Rock Paper Scissors via email</a> <span class="small">(brunching.com)</span></li>
</ul>
<p>And here’s some stirring RPS haiku gleaned from the RPS Society’s <a href="http://www.exprod.ca/bullboard/">Bullboard</a>:</p>
<blockquote>
<p>Always throw paper.<br/>
How can you lose with paper?<br/>
Forget scissors, man.</p>
<p>You delayed your prime<br/>
Won’t synchronize your rhythm<br/>
That’s just dirty play</p>
<p>Paper is the throw<br/>
For the narcissistic fool<br/>
The masturbator</p>
<p>Few are perfect forms<br/>
The rock however is one<br/>
Likewise breasts are too</p>
</blockquote>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com1tag:blogger.com,1999:blog-24744374.post-1143861608046628942006-03-31T22:15:00.000-05:002007-02-01T20:29:39.484-05:00Ye Goode Olde Dayes<p>Check out this most excellent series of promotional computer images from the ’60s and ’70s, back when taking a photo of a computer required, as the author put it, “A wide-angle lens. And a woman in a thigh-high skirt.”</p>
<p><a href="http://www.lileks.com/institute/compupromo/1.html">Compu-promo</a> <span class="small">(lileks.com)</span></p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-1143600971504547672006-03-28T21:47:00.000-05:002007-05-30T17:58:18.644-05:00English > L337 Translator (ColdFusion)<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> A demo is available on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/coldfusion/leet-translator/">Leet Translator</a>.</p>
</div>
<p>Apparently I had time to waste writing a <a href="http://en.wikipedia.org/wiki/Leet">L337</a> hax0r translator in ColdFusion (okay, so it only took about 15 minutes). I figured I might as well pass it on...the output is different every time & it's reasonably badass. I'll try to put it on a publically accessible ColdFusion server or rewrite it in JavaScript within a few days so you can see it in action.</p>
<pre class="code" style="overflow: auto; height: 400px;">
<h1>L337 Translator!!</h1>
<span class="comment"><!--- If form submitted with value ---></span>
<cfif isDefined("Form.message") AND len(Form.message)>
<cfset Variables.alphabet = "a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z" />
<cfset Variables.cipher = "4,8,[,),3,ƒ,6,##,1,_|,X,1,|v|,|\|,0,|*,()_,2,5,+,(_),\/,\/\/,×,`/,2" />
<cfset Variables.output = "" />
<span class="comment"><!--- Loop over received text, one character at a time ---></span>
<cfloop index="i" from="1" to="#len(Form.message)#">
<span class="comment"><!--- Gives 50% odds ---></span>
<cfif round(rand())>
<span class="comment"><!--- Add leet version of character to output ---></span>
<cfset Variables.output = Variables.output & replaceList(lCase(mid(Form.message, i, 1)), Variables.alphabet, Variables.cipher) />
<cfelse>
<cfif round(rand())>
<span class="comment"><!--- Add uppercase version of character to output ---></span>
<cfset Variables.output = Variables.output & uCase(mid(Form.message, i, 1)) />
<cfelse>
<span class="comment"><!--- Add unviolated character to output ---></span>
<cfset Variables.output = Variables.output & mid(Form.message, i, 1) />
</cfif>
</cfif>
</cfloop>
<cfif round(rand())>
<cfset Variables.suffixes = "w00t!,d00d!,pwnd!,!!!11!one!,teh l337!,hax0r!,sux0rs!" />
<span class="comment"><!--- Append random suffix from list to output ---></span>
<cfset Variables.output = Variables.output & " " & listGetAt(Variables.suffixes, int(listLen(Variables.suffixes) * rand()) + 1) />
</cfif>
<h2>Original Text:</h2>
<div style="background:#d2e2ff; border:2px solid #369; padding:0 10px;">
<p><cfoutput>#paragraphFormat(Form.message)#</cfoutput></p>
</div>
<h2>Translation:</h2>
<div style="color:#0f0; background:#000; border:2px solid #0f0; padding:0 10px;">
<p><cfoutput>#paragraphFormat(Variables.output)#</cfoutput></p>
</div>
</cfif>
<form action="<cfoutput>#CGI.SCRIPT_NAME#</cfoutput>" method="post" style="margin-top:20px;">
<cfparam name="Form.message" default="Enter text to translate" />
<textarea name="message" style="width:300px; height:75px;"><cfoutput>#Form.message#</cfoutput></textarea>
<br/><br/>
<input type="submit"/>
</form>
</pre>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-1143519115223273072006-03-27T22:53:00.000-05:002007-02-06T12:11:52.355-05:00God’s Pet Peeves<p>So, I was flipping through a Bible & came to <a href="http://www.biblegateway.com/passage/?search=LEV%2011&version=50;">Leviticus 11</a>, with its variety of swimming, walking, crawling & flapping abominations. ’Twas an interesting re-read.</p>
<p>Some highlights:</p>
<p>We aren’t actually supposed to be eating ostriches. I was a bit surprised by that, as I didn't think ostriches were too bountiful in Hebrew lands in those days, so much so that there were ostrich-avoidance rules in place.</p>
<p>God is <em>really</em> peeved at “unclean” animals, & takes his time calling them names like abomination & defiler.</p>
<p>Apparently no one noticed that four-footed insects are mighty scarce, & that laws banning their consumption (<a href="http://www.biblegateway.com/passage/?search=LEV%2011:20-23;&version=50;">see here</a>…it’s good stuff) are a bit superfluous. Also, there is a distinction made between jumping & non-jumping bugs. Some of the former are yummy, while the latter are abominations without exception (so if mom tries to sneak one into your dinner, pick it out, feed it to the dog & then stone her to death).</p>
<p>Special mention is made that you aught not to “boil young goats in their mother’s milk” (<a href="http://www.biblegateway.com/passage/?book_id=5&chapter=14&verse=21&version=50">Deut 14:21</a>).</p>
<p>And let us never forget that <a href="http://www.godhatesfigs.com/">god hates figs</a> & <a href="http://www.godhatesshrimp.com">shrimp</a>.</p>
<p>Pondering further, I’d bet god is a shrimp lover too, like me. Think about it, he fools us into thinking they’re nasty so we don’t eat them & they can flourish in the ocean … Thousands of years of Christianity, Judaism & Islam might boil down to nothing more than an elaborate shrimp-saving conspiracy.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com2tag:blogger.com,1999:blog-24744374.post-1143402625997118892006-03-26T14:01:00.000-05:002007-05-30T17:56:42.546-05:00Regex Recursion Without Balancing Groups (Matching Nested Constructs)<div style="background:#ffff6a; border:1px solid #fc3; padding:10px 10px 0 10px; margin-bottom:15px;">
<p><strong>Update:</strong> Please view the updated version of this post on my new blog:</p>
<p style="padding-left:25px;"><a href="http://blog.stevenlevithan.com/regular-expressions/recursion/">Regex Recursion (Matching Nested Constructs)</a>.</p>
</div>
<p>Some dude posed the following problem on a <acronym title="regular expression">regex</acronym> advice forum I visit every once in a while…</p>
<p>He was trying to scrape <a href="http://en.wikipedia.org/wiki/BibTeX">BibTeX</a> entries from Web pages using JavaScript (from within a Firefox extension).</p>
<p>A single BibTeX entry looks roughly like this:</p>
<pre class="code">@resourceType{
field1 = value,
field2 = "value in quotation marks",
field3 = "value in quotation marks, with {brackets} in the value",
field4 = {brackets}
}</pre>
<p>The resident <acronym title="regular expression">regex</acronym> experts were quick to point out that <acronym title="regular expressions">regexes</acronym> are only capable of recursion through the use of “Balancing Groups”, a feature supported exclusively by .NET (see the <a href="http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf">chapter on .NET</a><img src="http://www.netdes.com/img/icon_pdf.gif" alt="PDF icon"/> in <cite>Mastering Regular Expressions</cite>).</p>
<p>Basically, searches requiring recursion have typically been the domain of parsers, not <acronym title="regular expressions">regexes</acronym>. The problem in this particular case lies in how you distinguish between the last closing bracket of the <code>@resourceType{…}</code> block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (<acronym title="id est">i.e.</acronym> they form an open–close pair). This logic is impossible to implement by simple lookaround assertion.</p>
<p>Still, given that there was only one level of recursion, I figured it was possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced <acronym title="regular expression">regex</acronym> features, actually):</p>
<p><code style="font-size:105%; color:#000;">@<span style="background:#ffc080;">[^{]</span><span style="background:#80c0ff;">+</span>{<span style="background:#00c000; color:#fff;">(?:</span><span style="background:#ffc080;">[^{}]</span><span style="background:#00c000; color:#fff;">|</span>{<span style="background:#ffc080;">[^{}]</span><span style="background:#80c0ff;">*</span>}<span style="background:#00c000; color:#fff;">)*</span>}</code></p>
<p class="small">(Note: I've used <a href="http://www.regexbuddy.com/">RegexBuddy's</a> formatting style to color it.)</p>
<p>This, however, works only if:</p>
<ul>
<li>braces are always balanced, and</li>
<li>the level of brace nesting is no more than one.</li>
</ul>
<p>Still, <acronym title="regular expression">regex</acronym> users might find the logic interesting, and better yet, find some actual use for it.</p>
<!--
<p>Obviously, the point of this story, aside from that limited recursion is possible, is that you shouldn't mess with my <acronym title="regular expression">regex</acronym> skills, because I was trained by ninjas.</p>
<div style="float:right; margin-top:-10px;">
<img src="http://img95.imageshack.us/img95/949/regexninja5ly.gif" alt="^reg(?:ular expressions?|ex(?:p|es)?)\x20ninja!*$"/>
</div>
-->
<p>For those unfamiliar with regular expressions or who are looking for a good tutorial, see <a href="http://www.regular-expressions.info/">www.regular-expressions.info</a><!-- (created by the maker of <a href="http://www.regexbuddy.com/">RegexBuddy</a>, <a href="http://www.powergrep.com/">PowerGREP</a>, etc.)-->.</p>
<p>And feel free to post your own <acronym title="regular expression">regex</acronym> problems in the comments if you think I might be able to help.</p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit:</strong> This logic is easy to extend to support more levels of recursion, as long as you know in advance the maximum levels of recursion you need to support. Here's a simple example of matching HTML elements and their contents (yes, I know, element names can't start with a number, and I'm not supporting attributes or singleton elements, but that would make the regexes longer and this is only supposed to be for demonstration purposes):</p>
<p><em>No recursion:</em><br />
<code><([a-z\d]+)>.*?</\1></code><br />
<em>Up to one level of recursion:</em><br />
<code><([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1></code><br />
<em>Up to two levels of recursion:</em><br />
<code><([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1></code><br />
<em>And so on...</em></p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit 2 (2007-04-04):</strong> I've since learned that, in addition to using .NET's balancing groups, true recursion is possible in Perl 5.6+ using Perl's <code>qr//</code> operator to compile a regex as a variable, used together with the little known <code>(??{ })</code> operator which instructs Perl that when compiling the regex it should not interpolate the encapsulated code until it's actually used. See the article <a href="http://www.perl.com/pub/a/2003/06/06/regexps.html">Regexp Power</a> by Simon Cozens for details.</p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit 3 (2007-04-10):</strong> Here's another Perl-specific way of achieving true recursion, this time supported by Perl 5.005+: <a href="http://perl.plover.com/yak/regex/samples/slide083.html">Perl Regular Expression Mastery by Mark Jason Dominus: Slide 83</a>.</p>
<p style="padding-top:10px; border-top:1px dashed #999;"><strong>Edit 4 (2007-04-16):</strong> Yet another method for recursion, this time supported by Perl 5.6+, PCRE, and Python: the special item <code>(?R)</code>. See <a href="http://www.pcre.org/pcre.txt">pcre.org/pcre.txt</a> for details.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com5tag:blogger.com,1999:blog-24744374.post-1143350945988545552006-03-26T00:27:00.000-05:002007-02-01T20:28:36.870-05:00Brokeback to the Future<p>Funny shit: <a href="http://www.youtube.com/watch?v=zfODSPIYwpQ">Brokeback to the Future</a> <span class="small">(youtube.com)</span>.</p>
<p>I never realized how different you could make a movie seem just by adding a mournful guitar track in the background.</p>
<!--<p>See also: <a href="http://www.youtube.com/watch?v=UKeDWCLajQk">Brokeback Snake Mountain</a> <span class="small">(youtube.com)</span></p>-->Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0tag:blogger.com,1999:blog-24744374.post-1143343945782089672006-03-25T22:11:00.000-05:002007-02-02T21:41:36.965-05:00The SNAFU Principle<h3 class="subtitle">A priceless bit from <cite><a style="font-weight:normal;" href="http://www.outpost9.com/reference/jargon/jargon_toc.html">The New Hacker’s Dictionary</a></cite></h3>
<p><strong><acronym title="Situation Normal, All Fucked Up">SNAFU</acronym> principle</strong> /sna'foo prin'si-pl/ /n./</p>
<p>[from a <acronym title="World War 2">WWII</acronym> Army acronym for ‘Situation Normal, All Fucked Up’] “True communication is possible only between equals, because inferiors are more consistently rewarded for telling their superiors pleasant lies than for telling the truth.” — a central tenet of <a href="http://www.outpost9.com/reference/jargon/jargon_19.html#TAG480">Discordianism</a>, often invoked by hackers to explain why authoritarian hierarchies screw up so reliably and systematically. The effect of the <acronym title="Situation Normal, All Fucked Up">SNAFU</acronym> principle is a progressive disconnection of decision-makers from reality. This lightly adapted version of a fable dating back to the early 1960s illustrates the phenomenon perfectly:</p>
<div style="margin:15px 0 15px 25px; font-style:italic; font-size:95%;">
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">I</span>n the beginning was the plan,<br>
and then the specification;<br>
And the plan was without form,<br>
and the specification was void.</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">A</span>nd darkness was on the faces of the implementors thereof;<br>
And they spake unto their leader, saying:<br>
"It is a crock of shit,<br>
and smells as of a sewer."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">A</span>nd the leader took pity on them,<br>
and spoke to the project leader:<br>
"It is a crock of excrement,<br>
and none may abide the odor thereof."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">A</span>nd the project leader<br>
spake unto his section head, saying:<br>
"It is a container of excrement,<br>
and it is very strong, such that none may abide it."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">T</span>he section head then hurried to his department manager,<br>
and informed him thus:<br>
"It is a vessel of fertilizer,<br>
and none may abide its strength."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">T</span>he department manager carried these words to his general manager,<br>
and spoke unto him, saying:<br>
"It containeth that which aideth the growth of plants,<br>
and it is very strong."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">A</span>nd so it was that the general manager rejoiced<br>
and delivered the good news unto the Vice President.<br>
"It promoteth growth,<br>
and it is very powerful."</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">T</span>he Vice President rushed to the President's side,<br>
and joyously exclaimed:<br>
"This powerful new software product<br>
will promote the growth of the company!"</p>
<p><span style="font-size:190%; font-weight:bold; font-style:normal; font-family:'Edwardian Script ITC','Monotype Corsiva';">A</span>nd the President looked upon the product,<br>
and saw that it was very good.</p>
</div>
<p>After the subsequent and inevitable disaster, the <a href="http://www.outpost9.com/reference/jargon/jargon_34.html#TAG1716">suit</a>s protect themselves by saying “I was misinformed!”, and the implementors are demoted or fired.</p>Stevehttp://www.blogger.com/profile/18374441096323901069noreply@blogger.com0