No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help
View Responses


Grex Web Item 8: An Exercise in URL Rewriting
Entered by remmers on Thu Mar 8 17:49:56 UTC 2007:

The mod_rewrite module of the Apache webserver can be used to translate 
a requested URL to a different one.  This is useful for example if a 
page has been moved to a new location, but you still want the old URL to 
work.

If the webserver has been configured to allow it, this facility is 
available to users on a per-directory basis.  You create a file 
named .htaccess in the directory in which you want URL translations to 
apply and put some directives in the file that specify how the 
translation should be done.

As an exercise for myself in writing .htaccess files, I've implemented a 
simplified URL scheme for read-only access to Grex conferences, items, 
and responses.  It works as follows:

  http://jremmers.org/grex/bbs               -list of all conferences
  http://jremmers.org/grex/bbs/CONF          -index of conference CONF
  http://jremmers.org/grex/bbs/CONF/ITEM     -content of an item
  http://jremmers.org/grex/bbs/CONF/ITEM/SEL -selected part of an item

Examples:

  http://jremmers.org/grex/bbs/kitchen     -index of kitchen cf.
  http://jremmers.org/grex/bbs/web/5       -item 5 of web cf.
  http://jremmers.org/grex/bbs/web/5/2     -resp 2 of item 5 of web cf.
  http://jremmers.org/grex/bbs/web/5/1-4   -resps 1-4 of that item

Note that even though the domain given in the URLs is my website, no 
Grex conference content is actually stored there.

Feel free to play around with this.  I'll explain how I did it in a 
subsequent response.

11 responses total.



#1 of 11 by other on Fri Mar 9 13:21:28 2007:

What does that do to web logs?  Is the rewritten URL logged as referrer?


#2 of 11 by remmers on Fri Mar 9 21:08:46 2007:

Hm, dunno.  Anybody?


#3 of 11 by remmers on Fri Mar 9 21:59:38 2007:

I did some reading up on and experimentation with the HTTP referer 
header.  Typically it's sent by a browser when you follow a link; its 
value is the URL of the page on which the link occurs.  It's a reverse 
link from the target of the original link back to the source.

If you're reading this in Backtalk, you can see the value of the referer 
header by clicking on this link:  http://c2.com/cgi/test/
You'll get a display of the list of HTTP headers that your browser sent 
to the server at c2.com.  Unless your browser is configured not to send 
referer headers, one of the headers will be HTTP_REFERER; its value is 
the URL of the Backtalk page on which the link occurs.

On the other hand, if you go to a URL by simply typing it into your 
browser address window, your browser shouldn't send a referer header.  
You can try this out with c2.com too.

I think all that is completely independent of any rewriting that 
mod_rewrite does, though, since the referer header is sent by the 
browser before any rewriting on the server takes place.

(My starting point for the above was looking at the "HTTP referer" 
article in Wikipedia (http://en.wikipedia.org/wiki/HTTP_Referer).  The 
article points out that the correct spelling is "referrer" and that 
whoever made up the HTTP header name misspelled it.)


#4 of 11 by remmers on Fri Mar 9 22:15:26 2007:

Hm... In testing the link in resp:3, it seems that no referer header is 
sent, at least by my browser (Safari).  However, I made a page
http://grex.org/~remmers/referer.html that links to c2.com/cgi/test/;
when I click on *that* link, Safari sends the expected referer header.

Same results with Firefox.

Not sure what's going on.  If I click on the link in resp:3 from my usual 
Backtalk interface (pistachio), no referer is sent.  However, it is from 
the vanilla interface.


#5 of 11 by cmcgee on Sat Mar 10 13:42:51 2007:

John, I'm reading, even if it's mostly over my head.  Thank you for musing
outloud about this stuff.


#6 of 11 by remmers on Sat Mar 10 14:05:29 2007:

You're welcome.

Okay, a little more investigation seems to indicate that a referer is sent 
if you're using Backtalk in readonly mode and not if you're using it as an 
authenticated user.  Maybe it's a security feature.


#7 of 11 by remmers on Sat Mar 10 15:05:19 2007:

Getting back to the original topic, here are the details on how the URL 
rewriting is done.  In the root web directory of my website, I created a 
directory called "grex" and a subdirectory of that called "bbs" (which 
you can see via the link http://jremmers.org/grex/).

The only file in the bbs directory is a .htaccess file that specifies 
how anything following "grex/" is translated.  (The line numbers are 
supplied for ease of reference and aren't actually part of the file.  
Also, for readability I've done some line wrapping; each number 
corresponds to one line of the file.  You can see the actual .htaccess 
file at http://jremmers.org/htaccess-example.txt)

-----------------------------------------------------------------------
1. RewriteEngine 
   on
2. RewriteRule 
   ^/*$ 
   http://grex.org/cgi-bin/backtalk/vanilla/conflist
3. RewriteRule 
   ^([^/]+)/*$ 
   http://grex.org/cgi-bin/backtalk/vanilla/browse?conf=$1
4. RewriteRule 
   ^([^/]+)/+([0-9]+)/*$ 
   http://grex.org/cgi-bin/backtalk/vanilla/read?conf=$1&item=
   $2&rsel=all
5. RewriteRule ^([^/]+)/+([0-9]+)/([^/]+)/*$ http://grex.org/cgi-bin
   /backtalk/vanilla/read?conf=$1&item=$2&rsel=$3
----------------------------------------------------------------------
Line 1 tells Apache to pay attention to the rewriting rules.

The next 4 lines are rewriting rules, each of the form

  RewriteRule PATTERN REPLACEMENT

The PATTERN is a "regular expression" that specifies the form of what is 
to be replaced.  The REPLACEMENT is what to replace anything with that 
matches the pattern.  I won't attempt to explain regular expressions in 
general here (see Google or Wikipedia), but for example the regular 
expression "^([^/]+)/*$" matches any string of one or more non-slash 
characters, followed by 0 or more slashes.  The parentheses around 
"[^/]+" tells the rewrite engine to store the string of non-slashes in a 
variable named "$1" which can then be referenced in the replacement.

For example, in the URL "http://jremmers.org/grex/bbs/coop", the string 
"coop" matches the string-of-non-slashes expression "[^/]+".  Since the 
latter is parenthesized, it's stored as "$1" and then dumped into the 
corresponding replacement.  The rewritten URL is thus

  http://grex.org/cgi-bin/backtalk/vanilla/browse?conf=coop

which is a standard Backtalk URL for generating the index to a 
conference.


#8 of 11 by remmers on Sat Mar 10 15:16:12 2007:

Late-breaking news:  I created a .htaccess file in the above-mentioned 
"grex" directory that makes the browsing scheme a little more 
hierarchical; the URL "http://jremmers.org/grex" takes you to the Grex 
homepage.

Exercise for the technically inclined:  What .htaccess file would achieve 
this effect?


#9 of 11 by fuzzball on Fri Mar 23 15:51:57 2007:

very nice john....
wouldent have thought of doing this.....
:)


#10 of 11 by remmers on Sat Mar 24 13:05:57 2007:

What led me to do this is some recent reading about "well-designed URLs". 
This got me to thinking about what a simple, clean, human-friendly URL 
scheme for bbs items and responses might look like.

For some good ideas on the issue of well-designed URLs, see Mike 
Schinkel's post 
  http://www.mikeschinkel.com/blog/welldesignedurlsarebeautiful/
and the various references he gives.


#11 of 11 by madmike on Fri Sep 26 16:04:20 2008:

Very good stuff. Thanks for the link to mikeschinkel.com. I will have 
to mess around with this. 
Isn't .htaccess strictly an Apache thing, or is it also supported by 
the Windows server platforms? 

Response not possible - You must register and login before posting.

No Next Item No Next Conference Can't Favor Can't Forget Item List Conference Home Entrance    Help

- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss