The AltaVista Web Search EngineBy Louis Monier, Digital Equipment Corporation
Summary by Jerry Peek
The AltaVista Web search engine is popular. Louis Monier asked for a show of hands to see how many people had used it in the last week; more than half the audience had. Louis and his group had no idea AltaVista would take off like wildfire. It started in June 1995 as a research project inside Digital. It was so popular in-house that it started as a public service in December 1995. Although the launch was done quietly, traffic grew right away.
AltaVista ate up hardware faster than expected. Monier and his group also rewrite code frequently. Louis thinks that AltaVista can claim to be the highest-traffic Web site: the day before his talk, for instance, they had 24 million hits (from 15 or 16 million queries, plus graphics). Each time they add hardware or speed up the code, this raises the "ceiling," and traffic immediately increases. In January 1997, AltaVista was using 9 Digital Turbolasers (Alpha Server 8400s) with 80 300-MHz processors, 66 GB of main memory, 1.3 TB or more of RAID disks, "a lot of" networking gear, and five T3 networks into four large ISPs.
Now AltaVista has two international mirror sites, in Sweden and Australia. Louis would like to see a total of 12 servers around the world, especially in Europe, to avoid the bottlenecks caused by the slow links between continents.
The software is homegrown: C under Digital UNIX. The 64-bit addressing in the Alphas is a must, Louis said: there are files over 2-4 GB everywhere. Memory is very important to performance: the entire index is memory-mapped (it takes ten minutes to fill the cache!), and a special query cache for "Next" pages gets a 35% hit rate. The disks are striped on the filesystems and as RAID.
Their Web searcher, a spider named "scooter," runs 1,500 threads. It can crawl about six million pages a day. They're careful to crawl politely. Scooter respects robot exclusion files, of course, but it also uses an algorithm to avoid overloading servers: by measuring how long it takes to get a page from a server and hitting the site only some multiple of that amount of time. One of the biggest problems at AltaVista, Louis said, is that running the spider takes a lot of work: not only the technical operation, but also the flood of email he gets from sites. (AltaVista staff try to read and answer all of their email personally.) Because of that, the spider hadn't run for more than a month before Louis's January 9 talk.
The index that scooter builds is "no big secret," Louis told us. It's an inverted index with many tricks ("clean" tricks, he insisted) for speed. It has a large set of operators, but many of them are used only by advanced users. There are no stop words, but common words (like "the") are not used in the relevance rankings. The index is fast: the average query takes less than one second.
One serious annoyance is "spammers" who try to turn AltaVista into a promotion tool by making their Web page come out first in the index rankings. For instance, some pages contain thousands of unrelated keywords (like "sex") found in many users' queries-or a page could have the same related keyword hundreds of times. He spends a lot of time and effort defeating these schemes.
Their Web server is specialized; they wrote it themselves. It runs on a standard workstation. They've been getting 150 queries (260 hits) per second, which is manageable.
Another problem is fair use of the server and its index. Some sites use AltaVista as a backend, repackaging users' queries and then doing their search with AltaVista. (There are some authorized backends, such as Yahoo, that do this legitimately.) Other sites try to get a huge list of URLs by sending thousands or millions of queries to AltaVista. Louis and crew are glad to cooperate to use their index more efficiently, he said -for instance, by shipping a tape of the database-if people would simply ask!
Louis mentioned other problems they face. One big problem is relevance: how well a document matches a query. For instance, what should AltaVista do with a query like "china"? Most users don't refine their queries much. He promises a new approach to relevance soon. But, as with other developments in the works at AltaVista, he refused to give hints.
In the future, they want to recognize and handle languages. Now AltaVista assumes Western European languages and the Latin-1 character set. But users should be able to say, for example, "only show me the results in Latvian" and skip pages in other languages. Proper treatment of languages is important: Russian, Polish, Hebrew, and Greek should be displayed correctly. And there are languages with multiple encodings, like Japanese, Chinese, and Korean. For example, if a user queries AltaVista with a shift-JIS encoding of Japanese, should AltaVista return only pages that were written with that encoding? Or should it translate the encoding and also find Japanese pages in other encodings?
One fundamental catch-22 is that users want huge indexes (the more documents indexed, the better), but they also want to get a small number of results to their query (not, say, 40,000 matching documents).
Unlike Inktomi, AltaVista runs just one spider for the entire World Wide Web. Louis felt that the problem of coordinating snapshots of the Web is too hard. As an example, think of a Web page in the US which is the only link to a Web page somewhere in Asia: how could separate crawlers on the two continents cooperate?
An audience member asked about searches that return duplicate documents, like 40 copies of the Solaris FAQ. There are several problems with detecting these. One is that duplicates aren't exact; the pages can differ by a few words (although fuzzy matching may be possible). Another is to decide which copy of the document is best: for example, AltaVista wouldn't want to return only a link to a server in Russia that's connected by a 1,200-baud modem. Their solution for now is to try to group all matching pages close to each other in the rankings and let the user choose which link is best.
If you've (been living in a cave and) never seen AltaVista, it's on the Web at http://altavista.digital.com/.
Originally published in ;login: Vol. 22, No.2, April 1997.
Last changed: May 28, 1997 pc