SAGE - Sage feature


webmaster's toolbox

Apache Modules

kohl_neil

by Neil Kohl
<nkohl@mail.acponline.org>

Neil Kohl has been the Webmaster for the American College of Physicians — American Society of Internal Medicine for three years. He fell into a career in computers while studying neuropsychology in graduate school.



I'm the Webmaster for a nonprofit medical membership organization. We have a very large Web site — about 20,000 files — and a very small staff. Like many Webmasters, I rely on a motley collection of tools to do a lot of the daily grunt work of content management, link checking, and statistics. One of the most powerful tools in my toolbox is the server software itself. We use Apache.

Apache is the most popular server on the Internet[1] for a number of reasons: It's fast, it's stable, and it's flexible. Apache's modular architecture makes it easy to modify the server to suit your needs.

For example, the standard distribution comes with three different authentication modules that allow you a choice of back end for storing user information: mod_auth for flat files, mod_auth_db for Berkeley DB files, and mod_auth_dbm for DBM files. If that doesn't fit the bill, you can find modules for authenticating using mSQL, Postgres, Kerberos, Windows NT domain servers, or any external program (mod_external). Add cookie authentication with mod_cookie_auth. Or you can roll your own. (I've always wanted to write an authentication module that will admit only a quarter of users at random. I'd call it mod_studio_54.)

The Apache FTP site has a /contrib directory with a number of modules, written by others, available for downloading. In this article I'd like to take a closer look at three modules that come with the standard distribution that can make the Webmaster's job a whole lot easier: mod_speling (yes, that's really how it's spelled), mod_rewrite, and mod_usertrack.

Making Modules Available

Before you can use a module in Apache, it has to be compiled in to the httpd binary. Building Apache is beyond the scope of this article; I assume you've already got it built and running.[2]

To see which modules are available, open the file apache_1.3.x/src/Configuration (where apache_1.3.x is the top level of the source tree for your version of Apache). The modules I describe here are not compiled in by default. If you'd like to use these modules, make sure the following lines in Configuration are uncommented:

   AddModule /modules/standard/mod_speling.o
   AddModule /modules/standard/mod_rewrite.o
   AddModule /modules/standard/mod_usertrack.o

(If you have Apache prior to 1.3, the Module command is used to add modules, and the syntax is a little different. Check the documentation.)

Uncomment lines for modules you'd like to use, save Configuration, and make to build the module and link it into Apache. Copy the new httpd binary over your old binary. Restart the Web server and make sure everything is working.

Directives and Context

Each module makes one or more configuration directives available for you to use. The documentation lists the directives that the modules provide and the context in which these directives can be used.

In the examples here, I place the directives in httpd.conf, so the entire server is affected. Most of these directives can also be placed within a VirtualHost or Directory context to limit effects to a subset of the server.

Now let's explore.

mod_speling

The problem: UNIX filesystems are case-sensitive. Most users aren't. Novices are especially fond of the caps-lock key. A caps-lock fan trying to get to the catalog directory at some site might try:

   HTTP://WWW.SOMESITE.COM/CATALOG/

While the hostname part of a URL is case-insensitive, the path and filename are not, and the address above will cause a "File not found" error, assuming that the directory is really /catalog.

mod_speling makes URLs case-insensitive. The all-caps URL above would return the catalog index page instead of a 404 error. An added bonus is that it will correct up to one spelling error in a URL. DOS users who are in the habit of using the ".htm" extension will be able to get to your ".html" pages.

mod_speling provides one directive, SpellCheck. Add the following lines to your httpd.conf:

   # turn on mod_speling...
   SpellCheck on

Restart Apache and turn on caps-lock!

A feature (or drawback, depending upon your point of view) of mod_speling is that under certain conditions it will present the user with multiple choices if it can't find a suitable file. For example, if a user types in: http:/www.yoursite.com/index without the .html extension, mod_speling will present the following message:

Multiple Choices

The document name you requested (/index) could not be found on this server. However, we found documents with names similar to the one you requested. Available documents:
/index.html (common basename)

Apache/1.3.6 Server at neilk.acponline.org Port 80

On my workstation I have a directory that contains files named index1.htm to index25.htm. If I accidentally type in http://neilk.acponline.org/brownbag/index53.htm I get the following:

Multiple Choices

The document name you requested (/brownbag/index53.htm) could not be found on this server. However, we found documents with names similar to the one you requested.

Available documents:

/brownbag/index3.htm (extra character)
/brownbag/index5.htm (extra character)
/brownbag/index13.htm (mistyped character)

Apache/1.3.6 Server at neilk.acponline.org Port 80

Spelling correction does consume some resources, which may be a consideration if you have a busy site.

mod_rewrite

A good rule of thumb for Webmasters is that once you put a page up, it should be available at the same URL until the end of time. Even if you track down every last link to a page on your site, do an AltaVista search to find the publicly accessible Web pages that link to it[3] and send email to every Webmaster telling them to update their pages, you still don't know how many intranets and bookmark files have the old address. It looks bad when people see a "File not found" page on your site, even if they followed a link from a page that hasn't been updated since Netscape went public.

While having a policy of keeping URLs active makes for a good user experience, it makes it hard for a Webmaster to do housecleaning. Some of the common problems that crop up:

  • You decide to rename a page, and you'd like to bump anyone going to the old address to the new address.

  • Registration material for an annual meeting that happened three years ago really makes a site look stale. You'd like to remove these pages and redirect users who attempt to access them to the entry page for this year's annual meeting.

  • You have to change a directory name, but the names of the files in the directory remain the same. You'd like to point anyone looking for a file in the old directory to the corresponding file in the new directory.[4]

There are three types of redirection you can do: one-to-one, where all requests for one file are mapped to a different file; many-to-one, where requests for any one of a group of files are mapped to a single file; and many-to-many, where requests for a file in one group of files are mapped to the corresponding file in a second group.

The standard redirection directive, Redirect, can only map a URL path to a full URL. mod_rewrite allows regular-expression-based URL mapping and redirection. It's incredibly flexible and powerful. As the documentation says, "Welcome to mod_rewrite, the Swiss Army Knife of URL manipulation!"

Before you can start creating remappings, you have to turn on the rewrite engine. In httpd.conf, add the following line:

   # turn on mod_rewrite
   RewriteEngine On

Let's examine each of the different situations I've outlined.

In the first case, let's say that you have to map http://www.somesite.com/foo.htm to http://www.somesite.com/bar.htm. This is an easy one:

   # one-to-one redirect: foo.htm to bar.htm
   RewriteRule ^/foo.htm /bar.htm

Rewrite rules are based on regular expressions. The general form of a rewrite rule is RewriteRule regexp remapping. Notice the ^ before /foo.htm. This ensures that the rule will be triggered only if the pattern /foo.htm is found at the beginning of the path. Here, anytime the page foo.htm in the document root directory is requested, the page /bar.htm will be displayed. Without the ^, any page called foo.htm will be redirected to bar.htm.

The redirection is invisible to the end user — the URL in the browser's location bar will still be http://www.somesite.com/foo.htm. All of the sleight-of-hand is happening on the server side.

In the second case, assume that you're deleting the 1998meeting directory and all of the pages in it, and you'd like to redirect requests for any page in the 1998meeting directory to the main page for this year's meeting, 1999meeting/index.html.

   # many-to-one mapping
   RewriteRule ^/1998meeting.* /1999meeting/index.html [R]

Notice the .* after 1998meeting — it's a regular-expression wildcard. Any requests for the directory alone or anything in the directory will match the regex, trigger the RewriteRule, and redirect users to the 1999 meeting home page.

In the first example, the user sees the original URL in the browser's location bar even though the server sent a different file. The [R] in this case forces a redirect and the "real" URL appears in the end user's browser. If we had left off the [R], the user would see 1998meeting as the location of the page even though content for 1999 is being displayed. It might look like a mistake.

For the final case, suppose we have a set of pages in /oldname and the directory name gets changed to /newname. The files in the directory remain unchanged. The task here is to map a request for a specific file in /oldname to the corresponding file in newname.

   # many-to-many mapping
   RewriteRule ^/oldname(.*) /newname$1 [R]

Yep, mod_rewrite supports backreferences! In this case, any request for pages in /oldname will trigger the RewriteRule. The part of the request after oldname will be stored in $1 and will be appended to the new URL after /newname. Here again the [R] forces a redirect so users will see the new location in their browser.

I've only scratched the surface of what mod_rewrite can do. It's an incredibly powerful module that can do conditional URL remapping based on request headers. It can remap requests based on lookups in a file, dbm database, or calls to an external program. You can proxy requests to other Web servers.[5]

mod_usertrack

"How many visitors do you get?" It's one of the most persistent and unanswerable questions a Webmaster faces. You can explain that HTTP is an anonymous stateless protocol and that caching and proxies cloud the picture to the point of hopelessness. People will still ask.

You can count the number of distinct hosts. But how many individual visitors are hiding behind AOL's proxy servers? You could require registration and password-protect your entire site. You'd have a very accurate picture of your number of visitors — declining, most likely, because no one wants another user name and password to remember.

mod_usertrack offers a solution that offers a fairly accurate picture of visitors without requiring a burdensome login. When mod_usertrack is activated, Apache checks each incoming request for a cookie header containing a unique identifier. Cookies received by the server are logged. If the client doesn't pass a cookie in the request, the server generates a new unique identifier and a cookie containing the ID is sent with the server's response.

There is no personal information in the cookie, just the client IP address and a random number. Even though cookies as used here are pretty innocuous, people still get spooked about them. You should think through the privacy implications of being able to uniquely identify users.

To enable mod_usertrack, define a log for your cookie tracking data in httpd.conf:

   # define the clickstream log format and logfile:
   LogFormat "%t %u %{cookie}n %r" clickstream
   CustomLog var/log/clickstream clickstream

The LogFormat directive shown here will record time, authenticated user name if it's available, the identifier that Apache generates, and the user's request.

Next, decide how long you want cookies to last with the CookieExpires directive. If you don't set an expiration date for cookies, they last for the duration of the user's session, and a count of unique IDs will give you the number of different sessions. If you're interested in how many repeat visitors you get, you'll want to set the expiration to some point in the future:

   # expiration date for cookie
   CookieExpires "3 months"

You can specify the expiration in seconds (one number, no label) or as "number time-period [number time-period ..]" where number is a number and time-period is one of years, months, weeks, hours, minutes, or seconds. The expiration time must be enclosed in quotes if it contains spaces.

Then turn user tracking on with this directive:

   # turn on user tracking...
   CookieTracking on | off

Restart your server. To prove that it's really sending out cookies, set your browser to ask before accepting cookies and go to your site. You should get a cookie with a value like:

   Apache=172.168.1.254.12512917405203642

You can also look in your browser's cookie file for your site's domain name and the Apache cookie. The clickstream log file will accumulate entries that look like this:

   [28/Apr/1999:00:00:28 -0400] - 172.24.48.11.3103925272028252
              GET /index.html HTTP/1.1

   [28/Apr/1999:00:00:28 -0400] - 172.188.154.3.2997925271804334
              GET /outline.htm

   [28/Apr/1999:00:00:28 -0400] - 172.188.154.70.3027925271992432
              GET /search.htm HTTP/1.0

Want to know how many different visitors (technically, Web browsers that accepted cookies) you had yesterday? Assuming you used the format above, the answer is:

         % cut -d " " -f 4 clickstream.| sort | uniq | wc -l
         -> 8698

Our organization has over 100,000 members. Most of the Web site is publicly accessible, but some pages are available to members only. Any member can register for a user name and password; over 12,000 have. One of the questions I am frequently asked is: What percentage of traffic to public pages is by members? Before we initiated user tracking via mod_cookie, there was no way to answer this question.

Members-only pages are protected with HTTP basic authentication. When a member logs in and views a members-only page, the REMOTE_USER name gets recorded in the clickstream log along with the Apache ID. Now we can track member hits in the public area of the Web site by looking for Apache IDs of members who have logged in. It's easy to write a quickie Perl script to estimate the percentage of member hits in the public area:

    #!/usr/bin/perl -na
    # pubhits.pl
    # usage: pubhits.pl clickstream-log [ clickstream-log .. ]

    if ($F[2] ne '-'){
    $memb_ids{$F[3]}++ # record member's Apache ID
    } else {
    $all_pub_ids{$F[3]}++ # record Apache ID
    }

    END {
    foreach (keys %all_pub_ids){
    $t_pub_hits += $all_pub_ids{$_};
    $memb_pub_hits += $all_pub_ids{$_} if (defined $memb_ids{$_});
    }
    printf "Publ hits: %d; Memb publ hits: %d; Pct publ hits by             memb: %.2f\%\n",

    $t_pub_hits, $memb_pub_hits, $memb_pub_hits / $t_pub_hits *100;
    }

This doesn't give a perfect answer — members who don't log in to the private area are not counted, nor are members who logged in on a different day — but it's a decent estimate.

Combining the Apache ID with the authenticated user name also allows us to estimate the number of browsers per user.

More Information

Taking the time to dig through the Apache documentation and source tree can have significant rewards — your Web server may turn out to be the most useful tool in your toolbox.

Learn more about Apache, modules, and the module API:

Books

Laurie, B., and Laurie, P. Apache: The Definitive Guide, 2e. Sebastopol, CA: O'Reilly & Associates, 1999. Compiling and configuring Apache; a tour of the Apache module API.

Stein, L., and MacEachern, D. Writing Apache Modules in Perl and C. Sebastopol, CA: O'Reilly & Associates, 1999. Brand new. The Apache module API in detail, with examples of extending Apache using C and mod_perl.

Web

Apache documentation: <http://www.apache.org/docs/index.html>

Apache module registry: <http://modules.apache.org>

Apache user-contributed modules: <ftp://ftp.apache.org/dist/contrib/modules/>

ApacheWeek: <http://www.apacheweek.com/>. Apache news and feature articles that discuss server configuration topics in detail.

Notes

[1] According to Netcraft, <http://www.netcraft.co.uk/>.

[2] Documentation for building Apache is at <http://www.apache.org/docs/install.html>.

[3] This AltaVista query will let you know how many pages have links to your site: link:www.yoursite.com -url:www.yoursite.com. The first part (link:) searches for links to your site; the second part excludes links that come from your own site from the results.

[4] True story: We have the text of a book on our Web site, The Home Care Guide to Cancer. It was living in a directory called "homecare." One day we received a letter from an intellectual-property law firm informing us that "Homecare" was a registered trademark of another publisher and we had to change the directory name to avoid legal action. They never mentioned the content of the book. Just the directory name.

[5] We have a document management system (DMS) available on our internal network. The DMS has a Web gateway that allows users to search public-policy documents and view them as HTML or PDF. I was asked to make the collection available on our public Web site. This was no easy task, since our Web server is isolated from our internal network in its own DMZ. Both the Web server and the internal network are protected by a firewall.

Moving the DMS was not an option for security and logistical reasons. We didn't have time to implement a CGI solution. After talking this through with our security consultant, we reconfigured the firewall to allow the public Web server HTTP access to the DMS Web gateway. Then I configured Apache to proxy requests for policy documents to the DMS using three RewriteRules and one ProxyPassReverse directive. One rule in the firewall and four lines of Apache configuration!



?Need help? Use our Contacts page.
Last changed: 18 Nov. 1999 mc
Issue index
;login: index
SAGE home