Saturday, October 30, 2010

Concerning mod_rewrite as Related to URI Manifestation

While not a tenant of the Semantic-Web concept, certainly an ideal is the substantial manifestation of resource URI's insomuch as a web-page can be "substantial."  As defined by Semantic-Web methodologies (and RDF in general), each Resource is identified by a URI.  These URIs, being identifiers and not true locations as URLs are, are not required to point to anything substantial.  It is, however, helpful if they do, as this exposes much more robust access to the Resource.  

To exemplify this point, let us say that I have identified a particular resource, for the sake of example we'll say this resource is a single comic strip, with the uri  As a URI this is completely valid, regardless of whether the identifier points to anything useful when typed into a web-browser.  Let us further say that, while attributing identifiers to comics, you also set yourself to the task of creating an application housing these identified resources and exposing access to them via a webpage.  A user of the webpage is to enter the URI of the resource they wish to see and, if said resource exists within your system, it is returned to them.  Here we have a situation where the resource itself is uniquely identified by a URI and is accessible by an application.  

It would be much more convenient however if, instead of needing to know both the URI of the desired resource and the location and usage of the application via which said resource may be retrieved, the seeker of a resource could simply type the URI into a third application with which the user is readily familiar, such as a web-browser, in order to retrieve the desired resource.  In the particular example which we have been exploring this is somewhat trivial assuming you own the domain "" (obviously, in practice, would be replaced by a domain you do in fact own).  You could create a directory structure on your server equivalent to the URI requested and place an index file within the final child directory.  Then a request to the URI specified above would retrieve   

This is a reasonable starting place but the pitfalls are numerous and some not so obvious.  There is the clear downside of maintenance, be it manual or automatic.  There is also the less obvious downside of URI support.  Taking this approach forces you to use URLs as the URIs of all of your resources, limiting your naming potential.  To illustrate this, let us say that instead of the URI chosen above, you have identified your comic with the slightly different URI  The change is a single character but how a web-browser will handle this request is vastly different than how it would handle the previous request.  As per section 2.2 of RFC1738,

The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.

You might be tempted to skirt around the issue by percent-encoding such characters, however this is not a viable solution either since, per section 2.2 of RFC398 (coincidence that these notes are both from section 2.2 of their respective documents?), "#" is a reserved general delimiter and, 

URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.

Given this let us return to a previous portion of the example.  What would be ideal is if you could leverage the URIs full scope of delimiters, and instead of the URI typed into the web-browser being treated as a URL, it was passed to a subsequent application which would properly treat it as a URI and return to the user the identified resource.


mod_rewrite is an Apache module which allows you to rewrite a requested URL based on a series of conditions and rules.  The rewriting happens on the sever side so the user is not made aware of the change in request (that is to say, the address which the user has entered into the address bar does not change).  This module gives us the ability to "catch" a requested URI before it is fully processed as a URL and rewrite the request into something the server will actually be able to process as the user expects.  mod_rewrite may be implemented at either the server or directory level and while it is always best to implement such modules at the server level, there are cases where it is more appropriate (or necessary in the case of a shared server) to implement the module at a directory level via a .htaccess file.

Returning again to our example, we might write rewrite rules stating "any time a user makes a request to an address starting with, hand that request to our resource retrieval application (introduced earlier in this post) passing the fully requested URI as a GET parameter."  Such a rule might look like this (assuming implementation at the directory level starting at the root directory of

RewriteEngine on

RewriteRule ^comics/1.2/(.*) request-resource.php?uri=$1 [L]

Breaking this down, the RewriteRule first specifies the condition under which the rule should fire.  This condition is presented as a PERL regular expression.  The regular expression presented in this example states that this rule will fire whenever it encounters a request "starting with comics/1.2/ and followed by any number of characters.  The RewriteRule then specifies the action which should be taken when the rule is fired.  In this case, the request is rewritten as a request to the code "request-resource.php" and the requested URI is passed as a GET parameter (the $1 within the URI indicates the remaining portion of the request which was denoted by "(.*)" in the first portion of the RewriteRule).  Finally a set of optional flags may be presented to the RewriteRule.  In this example we present [L] meaning that this is the last rule which should fire.

Given this code (and the assumption that the previously discussed application is in-fact accessible via the php file request-resource.php) users may now enter URIs for resources defined within the namespace simply by entering the URI into a web browser.  

Note: this technique will not handle all URIs, for example, "urn:oasis:names:specification:docbook:dtd:xml:4.1.2" would not be accessible via this methodology though it still stands as a valid URI.