The Web MasterUSENIX

  taylor, dave

by Dave Taylor
<taylor@intuitive.com>

Dave Taylor has been hacking on the Net since 1980 and has created thousands of Web pages, most of which format correctly. He's also the author of Creating Cool Web Pages with HTML and Teach Yourself Unix In a Week.

Graft a Smart Error Page System to your Web Site

I usually talk about standalone CGI programs in this column. But I just set up a new Web server (RedHat Linux 5.0 on a 300Mhz Pentium II box, if you're curious) and I decided that, instead of the ugly generic error messages given to people when they encounter an error on my Web site, I'd like to offer something more useful. My error message would not just have my company logo (which is a great first step, of course), but actually help people find what they seek on the site itself.

The process of adding this error page to the Web server, writing the page, and then writing a simple underlying search engine (using grep) is what I talk about in this column.

Hooking in Your Own Error Page

The first step is to delve into the Apache Web server configuration file. (If you're running a Web server other than Apache ­ which I think is fabulous ­ then you'll probably have to do something slightly different in this spot.)

The file, usually named /etc/httpd/conf/httpd.conf contains quite a few lines of different configuration elements, the vast majority of which you should definitely not touch until you're an Apache configuration expert. Fortunately, what we want to do is straightforward.

Apache Web servers can serve up lots of different domains on the same IP address and Web server. Indeed, my system is host for about 15 different Web sites. The error page I'm adding here is only for the .intuitive.com domain, so the trick is to find the .intuitive.com virtual host configuration section in the file and then add a specific line.

Before my changes, the file looked like:

<VirtualHost www.intuitive.com>
ServerAdmin webmaster@intuitive.com
DocumentRoot /web/intuitive
ServerName www.intuitive.com
ErrorLog /log/intuitive/error_log
TransferLog /log/intuitive/access_log
AgentLog /log/intuitive/agent_log
RefererLog /log/intuitive/referer_log
</VirtualHost>

This defined the actual filesystem location of the root directory of this domain (/web/intuitive/) and the location of all the log files (/log/intuitive/). To hook in the new error page, I simply added:

ErrorDocument 404 /error-page.html

somewhere in this configuration section. Where you place it doesn't matter.

Creating the Error Page

Though the previous configuration appears to have the error page in the topmost directory of the filesystem, Apache is smart enough to know already that you've specified a root in the system for the specified domain, so in fact this file needs to be located at DocumentRoot/ErrorDocument or, a bit more clearly, /web/intuitive/error-page.html.

Part of the goal of the error page is to offer visitors the ability to enter a keyword or two and search through all documents on the site to try and find that which they were originally seeking. That's reflected in the middle of the simple HTML document created as the error page:

<HTML>
<HEAD>
<TITLE>Intuitive Systems: Start Making Sense</TITLE>
</HEAD>
<BODY BG TEXT=#FFFFFF LINK=#ffffff VLINK=#ffffff
ALINK=#ff0000>
<CENTER>
<IMG SRC=/Graphics/banner.gif ALT="INTUITIVE SYSTEMS" WIDTH=485
HEIGHT=62>
<P>
<TABLE BORDER=3 CELLSPACING=15 CELLPADDING=10 WIDTH=75%>
<TR><TD ALIGN=center>
<h1>Error!</h1>
<h2>You've requested a page that can't be found!</h2>
</TD></TR>
</TABLE>
<BR><BR><BR><BR>
<font size=+3><B>Search for a specific page</B></font><br>
<HR width=75% >
<FORM ACTION=/apps/search-everything.cgi METHOD=get>
<font size=+2>Enter a few key words:</font>
<INPUT TYPE=text NAME=p >
<INPUT TYPE=submit VALUE="Find Page">
</FORM>
<BR><BR><BR><BR>
<a href=http://www.intuitive.com/><font size=+1>[The Intuitive Systems Site Home]</font></a>
</CENTER>
</BODY>
</HTML>

I won't belabor the HTML here ­ it's all pretty straightforward ­ other than to note that the error page includes a form that prompts for a few key words and then feeds the user entry to the CGI script /apps/search-everything.cgi on the server. One trick worth mentioning: explicit tables can be a nice way to box an important message on the page!

Figure 1 shows you how this page looks. You can, of course, search for some gobbledygook URL on my site and find it for yourself quite easily!

E.taylor1

Figure 1: The new improved Intuitive.com error page

Without the search capability, we'd be done. New error page, much cooler than the default "404 File not found."

However . . .

Building a Search Engine

The good news about building a search system for your Web site is that you've already made the smart move: you're running a UNIX-based operating system. This means that you can let the grep command do all the work. Because we're

using a METHOD=GET in the form itself, the pattern entered is held in the environment variable QUERY_STRING. Sent as name=value, a quick invocation to sed strips it to its basics:

pattern="`echo $QUERY_STRING | sed 's/p=//g'`"

Armed with the search pattern, you then use the find command to look through all the HTML files on the site:

find /web/intuitive -name '*tml' -print | xargs grep -il '$pattern'

This gives us the ability to display matching files, and we can easily add clickable links to them all by first stripping out the actual file root (because remember that /web/intuitive/index.html is the URL /index.html) with two lines buried in a for loop that steps through the output of the above find command:

for filename in `cat $outputfile' ; do
newfilename="`echo $filename | sed 's/\/web\/intuitive//g'`"
echo "<a href=$newfilename>newfilename/<a>"
done

But we can do better than this and have output that's considerably more attractive and interesting. The solution is to extract the TITLE of each document by again using grep, stripping the HTML tags therein, then using that as the text of the link:

for filename in `cat $outputfile` ; do
newfilename="`echo $filename | sed 's/\/web\/intuitive//g'`"
title="`grep '<TITLE>' $filename | \
sed 's/<TITLE>//g;s/<\/TITLE>//g'`"
echo "<a href=$newfilename>$title</a>"
done

There's one problem with this. Going through my Web site for a quick analysis reveals that several documents have remarkably similar titles (some of which aren't even useful, if you can believe it!). As a result, the output really needs to list both the filename and the TITLE of the document, as available.

Now all that's left is to do some error checking (What if they skipped entering a pattern? What if there are no matches to the pattern?) and wrap it in some nice HTML formatting. Again, I opt for a TABLE to have it look nice on the screen, as you can see in Figure 2.

E.taylor2

Figure 2: The results of a search for "Linux"

The final CGI script, written as a Bourne shell script, is shown here:

#!/bin/sh -f
tempout=/tmp/searchout.$$
pattern="`echo $QUERY_STRING | \
sed 's/p=//g'`"
#
echo "Content-type: text/html"
echo ""
echo "<HTML><TITLE>Intuitive Systems Search Results for
$pattern</TITLE>"
echo "<BODY BG TEXT=white LINK=white VLINK=white
ALINK=red>"
echo "<CENTER>"
echo "<P>"
echo "<IMG SRC=/Graphics/banner.gif

ALT=\"INTUITIVE SYSTEMS\" WIDTH=485 HEIGHT=62"
echo " LOWSRC=/Graphics/banner-lowres.gif>"
echo "<P>"
# ---- no pattern entered?
if [ X$pattern = X ] ; then
echo "No matches: no pattern entered"
echo "<P>"
echo "<HR width=75%><P>"
echo "<a href=http://www.intuitive.com/>[Intuitive Systems]</a>"
echo "</BODY></HTML>"
exit 0
fi
#
find /web/intuitive -type f -name "*tml" -print | \
xargs grep -il "$pattern" | sort > $tempout
#
matches="`wc -l < $tempout`"
# is there a pattern?
echo "Your search for pattern <tt>"$pattern"</tt> produced"
if [ $matches -eq 0 ] ; then
echo "<b>zero</b> matches. Sorry.<P>"
else
echo "<b>$matches</b> matches:<p>"
echo "<TABLE BORDER=0 CELLSPACING=1 CELLPADDING=4>"
echo "<TR BG><TH><font color=black>Filename</TH>"
echo "<TH><font color=black>Title Link</TH></TR>"
# now let's step through the pages, extracting TITLEs, and
# displaying them all on-screen
for filename in `cat $tempout` ; do
title="`grep -i '<title>' $filename | grep -i '</title>' |\
sed 's/<TITLE>//g;s/<\/TITLE>//g'`"
if [ X$title = X ] ; then # no title in document
title=$filename
fi
newfilename="`echo $filename | sed 's/\/web\/intuitive//g'`"
echo "<TR><TD>$newfilename</TD>"
echo "<TD><a href=$newfilename><font size=+1>"
"$title< /a></TD></TR>"
done
echo "</TABLE>"
fi
echo "<P>"
echo "<HR width=75%><P>"
echo "<a href=http://www.intuitive.com/>[Intuitive Systems]</a>"
echo "</BODY></HTML>"
/bin/rm -f $tempout
exit 0

Conclusions

I encourage you to jump onto my Web site and enter a URL that you are sure won't work correctly. Try <http://www.intuitive.com/missing-page.html>. Once you're there, type in a word or two as a search pattern to see what kind of results you get.

It'd be nice to refine this further so that you could have an HTML tag in specific pages that prevent them showing up as matches to a sitewide search, and for the search results to be smart enough to show you a META DESCRIPTION value if one is present in the file as further information.

 

?Need help? Use our Contacts page.
First posted: 8th July 1998 efc
Last changed: 8th July 1998 efc
Issue index
;login: index
USENIX home