SIFT -- A Tool for Wide-Area Information Dissemination
		  Tak W. Yan   Hector Garcia-Molina
		    Department of Computer Science
			 Stanford University
			  Stanford, CA 94305
		    {tyan, hector}@cs.stanford.edu
			       Abstract
The dissemination model is becoming increasingly important in wide-area 
information system. In this model, the user subscribes to an information 
dissemination service by submitting profiles that describe his interests. 
He then passively receives new, filtered information. The Stanford 
Information Filtering Tool (SIFT) is a tool to help provide such service. It  
supports full-text filtering using well-known information retrieval models. 
The SIFT filtering engine implements novel indexing techniques, capable of
processing large volumes of information against a large number of
profiles. It runs on several major Unix platforms and is freely available 
to the public. In this paper we present SIFT's approach 
to user interest modeling and user-server communication. We demonstrate 
the processing capability of SIFT by describing a running server 
that disseminates USENET News. We present an empirical study of SIFT's 
performance, examining its main memory requirement and ability to scale 
with information volume and user population.


1 Introduction 

Technological advances have made wide-area information sharing commonplace. 
A suite of tools have emerged for network information finding
and discovery; e.g., Wide-Area Information Servers (WAIS) [KM91], 
archie [ED92], World-Wide Web (WWW) [BLCGP92], and gopher [McC92]. 
However, these new tools have one important missing element. They 
provide a means to search for existing information, but lack a mechanism 
for continuously informing the user of new information. The exploding 
volume of digital information makes it difficult for the user, 
equipped with only search capability, to keep up with the fast pace of 
information generation. Instead of making the user go after the information, 
it is desirable to have information selectively flow to the user. 
In an information dissemination  (aka. alert, information filtering, 
selective dissemination of information) service, the user expresses 
his interests in a number of long-term, continuously evaluated queries, 
called profiles. He will then passively receive documents filtered according
to the profiles. Such a service will become increasingly important
and form an indispensable tool for the dynamic environment of wide-area 
information systems.

A very simple kind of information dissemination service is already 
available on the Internet: mailing lists (see e.g., [Kro92] ). Hundreds of 
mailing lists exist, covering a wide variety of topics. 
The user subscribes to 
mailing lists of interest to him and receives messages
on the topic via email. He may also send 
messages to the mailing list to reach 
other subscribers. LISTSERV is a software system for 
maintaining mailing lists. A problem with the 
mailing list mechanism as a tool
for information dissemination is that it provides 
a crude granularity of
interest matching. A user whose information need does not
exactly match certain mailing lists will either receive too many irrelevant
or too few relevant messages. 
The USENET News (or Netnews) system [Kro92], an
electronic bulletin board system on the Internet, 
is similar in nature to mailing lists.
While Netnews is extremely successful with millions of 
users and megabytes of daily traffic, at the same time it often 
creates an information overload. Like mailing
lists, the coarse classification
of topics into newsgroups means that a user subscribing 
to certain newsgroups
may not find all articles interesting, and also he may miss
interesting articles that are not posted in the subscribed
newsgroups.

Recent research efforts on information 
filtering focus on the filtering effectiveness , 
attempting to provide more fine-grained filtering
using relational, rule-based, information retrieval (IR),
and artificial intelligence approaches. 
With few exceptions,
they are typically small scale
(i.e., involving a small number of users or profiles) 
and thus the need to 
provide efficient  filtering is not apparent.
However, in a large-scale wide-area system where 
the number of information providers and seekers are large, 
efficiency in the dissemination process is an important issue
and must be addressed.

The Stanford Information Filtering Tool (SIFT) 
is a tool for information providers to perform
large-scale information dissemination. It can be used to
set up a clearinghouse service that gathers large 
amount of information and 
selectively disseminates the information to a large population of users.
It supports full-text filtering, using well-known and 
well-studied IR models. The SIFT filtering engine implements
novel indexing techniques, capable of scaling to 
large number of documents and profiles. 
It runs on several major Unix platforms and
is freely available to the public by anonymous ftp at URL:
   
ftp://db.stanford.edu/pub/sift/sift-1.0.tar.Z

In this paper we describe SIFT. We present its approach 
to user interest modeling and user-server communication. We demonstrate 
the processing capability of SIFT by describing a running server 
that disseminates tens of thousands of Netnews articles 
daily to some 13,000 subscriptions. We describe the implementation
of SIFT, focusing on the filtering engine. Finally we  
present an empirical study of SIFT's performance, examining its
main memory requirement and ability to scale with information volume and 
user population. 

2 Other Previous Work 

The Boston Community Information System [GBBL85] 
is an experimental information dissemination
system. Like SIFT, it allows a finer
granularity of interest matching than mailing lists or Netnews.   
Users can express their interests with 
IR-style, keyword-based profiles.
The system broadcasts new information via radio channel 
to all users, who then apply their
own filters locally. While the radio communication
channel makes broadcast inexpensive,
local processing of mostly irrelevant information
is very expensive. (For every user, a 
personal computer is dedicated for this purpose --
this is cited as a major source of complaints 
from the users.) 

Information Lens [MGT+87] provides categorization and
filtering of semi-structured messages such as email. 
The user defines rules for filtering, and the processing 
is done at the user site. It provides effective 
filtering, but local processing 
is expensive for large-scale information dissemination.
Similarly, the ``kill file'' mechanism in 
certain news reader programs 
allows the user to locally screen out irrelevant articles.
A kill file only removes specified articles from newsgroups 
that a user subscribes to, but it does not discover relevant
articles in other newsgroups. To provide the same kind of 
filtering power as SIFT would require much local processing.
It is more cost-effective to pool profiles together to 
share the processing overhead.

The Tapestry system [GNOT92]  is a research
prototype that uses the relational model for matching
user interests and documents; filtering computation is done not 
on the properties of individual documents, 
but rather on the entire append-only database.
Efficient query processing techniques are proposed
for handling this kind of queries. 
Tapestry is built on top of a commercial relational 
database system. The Pasadena system [WF91]  investigates
the effectiveness of different IR techniques in filtering.
It collects new documents from several
Internet information sources, and periodically run profiles against them.
Similar to SIFT, it uses a clearinghouse approach, but
efficient filtering is not addressed.

In [YGM94a,YGM94b] we proposed a variety of indexing techniques
for speeding up information filtering under IR models. There 
we evaluated these techniques using analysis and simulation. 
The SIFT filtering engine is a real implementation of one class
of index structures that we found to be efficient.

Figure 1: An overview of SIFT 


3 SIFT 

We begin our presentation of SIFT with an example.
Suppose a user is interested in underwater archeology (Figure 1).
He sends an email subscribe request to a SIFT server specifying
the profile ``underwater archeology,'' and optionally 
parameters that control, for example, how often he wants to be updated or
how long his subscription is for. He may alternatively 
access the SIFT server via WWW, using
a graphical WWW client interface to
fill out a form with the subscription information. (Before he
subscribes, he may test run his profile against an index of
a sample document collection.)
His subscription is stored in the subscription database. As
the SIFT server receives new documents, the filtering engine
will process them against the stored subscriptions, and
notifications will be sent out based on the user-specified parameters. 

In the following we detail the SIFT system from the perspectives of 
user interest modeling and communication protocol. We then 
illustrate SIFT using the Netnews SIFT server. 

User Interest Modeling 

A user subscribes to a SIFT server with one or more subscriptions, one 
for each topic of interest. A subscription includes an IR-style profile, and
as mentioned additional parameters to control the frequency of updates, the 
amount of information to receive, and the length of the subscription. A 
subscription is identified by the email address of the user and a 
subscription identifier.

Filtering Model 

The interest profile can be expressed in one of two IR
models: boolean  and    vector space  [Sal89]. 
We first focus on the vector space model.

In vector space model, queries and documents are identified by terms, 
usually words. If there are n terms for content identification, then a 
document is conceptually represented as an n-dimensional vector
D = <w1, ..., wm>, where weight wi for term i 
signifies its statistical importance, such as its frequency in the 
document. We may also write D as <(ti1, wi2), ..., (tik, wik)>
where wij is not equal to 0; e.g., <(underwater, 60), (archeology, 60)>. 
A query is similarly represented. For a document-query pair, a 
similarity measure (such as the dot product) can be computed to 
determine how ``similar'' the two are. In an IR
environment, the top ranked documents are retrieved for a query.

In an information filtering setting, a key question is 
how many documents to return to the user. We could have 
allowed the user to receive a fixed number of top 
ranked documents per update period, e.g., as done in [dumais . 
However, this is not very desirable: in a 
period when there are many relevant documents, 
he may miss some (low recall -- recall is the proportion
of relevant documents returned);
in a period when there are few interesting 
documents, he may receive irrelevant ones (low precision --
precision is the proportion of documents returned that are relevant). 
We instead allow the user to specify a relevance threshold, which 
is the minimum similarity score that a document must have against the profile 
for it to be delivered. A default value is supplied to the user 
for convenience. 

Instead of using the vector space model, the user may use boolean profiles
to specify words that he wants in 
documents received, and words to be excluded.
For example, the boolean profile ``fly fishing not underwater''
is for documents that contain both words ``fly'' and 
``fishing'' but not the word ``underwater.'' The 
reader may note that the SIFT boolean 
model only allows conjunction and negation of words. However, the user 
may approximate disjunction semantics by submitting multiple subscriptions
(though a document may match more than one subscriptions).

Profile Construction and Modification 

To assist the user with the construction of a profile, a SIFT server provides 
a test run facility. The user may run his initial profile against an existing, 
representative collection of documents to test the effectiveness of his 
profile. He may interactively change the profile and (for weighted profiles) 
adjust the threshold to the desired level of precision and recall. When he is 
satisfied with the performance of the filtering, he may then subscribe with 
the selected settings. 

After the user receives some periodic updates, he may decide to modify 
his profile or change the threshold for vector space profiles. He may do this 
by accessing the SIFT server. Furthermore,
for vector space profiles, relevance feedback [Sal89], 
a well-known technique in IR to improve 
retrieval effectiveness, can be used. The user simply gives SIFT
the documents that he finds interesting; after
examining them, the server adjusts the weights of the words in the user's
profile accordingly.

Communication Protocol 

There are two modes of communication between the user and a SIFT server. In 
the interactive mode, the user subscribes, test-runs a profile, 
views, updates, or cancels his subscriptions. In the passive mode, the user 
periodically receives information updates. Instead of developing a new 
communication protocol, we make use of current technologies: email 
[Cro82] and World-Wide Web HTTP [BLCGP92].

First we discuss the interactive communication mode.
Email communication is the lowest common denominator 
of network connectivity. By having an email interface, a SIFT server is 
accessible from users with less powerful machines, with limited network 
capability, or behind Internet-access firewalls.
We adopt the LISTSERV mailing list syntax wherever possible, 
to ensure that minimal learning is required. We also use default settings to 
reduce the complexity of email requests for novice users.

As the appeal of hypermedia navigation and the development
of sophisticated client interfaces are making 
WWW the preferred tool for wide-area information sharing, 
we also developed a WWW access interface for SIFT. 
Using a WWW client program, the user interacts 
with the SIFT server through a user-friendly graphical interface.
We believe the dual email and WWW access covers
the tradeoff between wide availability 
and ease of use.

In the passive mode of user notification, 
a SIFT server sends out email messages that contain excerpts 
of new, potentially relevant documents (certain number of 
lines from the beginning, as specified by the user).
After the user reads the excerpts, he may access the SIFT 
server to retrieve the entire documents. 
Currently the excerpts
are formatted to be read from regular mail readers. 
We plan to offer the option of formatting them in HTML, so that the user 
may view the notifications from sophisticated WWW viewers,
and then interactively retrieve interesting documents from SIFT
or provide feedback via HTTP.

SIFTing Netnews 

Using SIFT, we have set up a server for selectively
disseminating Netnews articles (text articles only; binary 
ones are first screened out).
Like any other SIFT server, the user accesses the Netnews SIFT server
via email or WWW. The reader is encouraged to try it out:
for email access, please
send an electronic message to netnews@db.stanford.edu, with 
the word ``help'' in the body; for WWW access, 
please connect to https://sift.stanford.edu.
In February 1994, we publicized the Netnews server in two newsgroups; 
within ten days of the announcement, 
we received well over a thousand profiles. 
The number of profiles keeps increasing and now 
(November 1994) exceeds 13,000, 
submitted by users from almost all continents. 
Table 1 shows some interesting statistics obtained from the server on
the day of November 10, 1994. The average number of articles is over
the week of November 4 - 10, 1994.

Number of subscriptions                13,381
Number of users                         5,146
Daily average number of articles       45,127  
Average notification period (in days)     1.4

Table 1: Statistics collected from Netnews SIFT server 

Apparent from these numbers is that the load on the SIFT server is 
very high. It is necessary to match an average of over 45,000 articles
against some 13,000 profiles and deliver most updates within a day. 
The efficient implementation of the SIFT filtering engine
enables the job to be done on regular hardware, a DECstation 5000/240.
Even though we believe SIFT is efficient, there is a limit 
to the load that it can handle. It is necessary to replicate the
server; in fact, we have been in contact with several 
sites (in the U.S. and Europe) that expressed interests
in providing the same service.

The Netnews SIFT server is not meant to be a replacement of the Netnews
system, which is an invaluable tool for carrying on 
distributed discussions. Rather, it provides a complementary filtering
service. Only the beginning lines of 
matched articles are sent out. 
The user may, after determining that an article is 
relevant, access his own local news host to 
access the article. In cases where he does not 
have access to a news host, he may 
request the whole article be sent from the SIFT server. 

Sifting Netnews is just one application of SIFT. We 
have set up another SIFT server, disseminating Computer Science
Technical Reports (email access: elib@db.stanford.edu, 
WWW access will soon be available).
This server has close to 1,000 subscriptions (November 1994). 

4 Implementation 

SIFT is implemented in C and has been compiled on several Unix platforms:
DEC Ultrix 4.2, HPUX 8.07, and SunOS 4.1. In the following we 
first give an overview of the components of the system and  
then focus on the implementation of the filtering engine. 

System Components 

The program mailui implements the email request handler shown in Figure 1. 
It parses email messages, assuming the format in [Cro82]. 
User requests are processed by accessing the 
subscription database and the document index (for test runs). 

The program wwwui is a so-called ``cgi-script'' for use
with the HTTP daemon released by National Center for Supercomputing
Applications. A user fills out a form (using 
a WWW graphical interface) for his request, 
which is then encoded as a Uniform Resource Locator (URL).
The URL is accepted by the HTTP daemon, which invokes the
wwwui program. Similar to mailui, it 
accesses the subscription database and
document index to process the request.

The program filter implements the filtering engine.
It accepts as input a list of documents (in the form of
Unix path names) and matches them against the stored profiles. 
The implementation details are discussed in the next 
subsection. The output from filter is a file of matchings. 

After the filtering is completed, the matchings are sorted 
by users and subscription identifiers. This is necessary 
because SIFT sends out updates on a per subscription basis. 
After sorting, the program alert can be used
to send out the message one by one, using 
Unix sendmail. Included in the messages are
excerpts from the matched documents (a few lines from the 
beginning of the document).

To provide the test run capability, we 
need to index a collection of documents.
This can be done by some publicly available software, and
we choose to use the index engine in the WAIS distribution
(called waisindex). From our experience it is a very 
robust implementation. However, since we use a different 
scoring scheme than WAIS, we made slight modifications to the WAIS 
code to make it compatible with our filtering engine (i.e., 
giving the same results). The major changes are to support 
the use of relevance threshold and the processing of boolean queries. 
The modified waisindex code is included in our 
SIFT distribution.


Filtering Engine Implementation 

In IR, to efficient process queries against a collection
of documents, an inverted index  [Sal89] of documents is built, which
associates a word with its occurrences in the documents.   
In the information filtering scenario, we may use the 
same approach: an index is built of a batch of new documents, 
and profiles are treated as stored queries that are run periodically. 
This is the approach used in an earlier SIFT prototype. 
We refer to this as the document index
approach. This approach may not be very appropriate with a 
high volume of new information. The document index
is large and resides on disk. Running a large number of profiles
against it requires a lot of disk I/Os. 

In the current SIFT implementation, we use a different approach.
The idea is to consider information filtering as the dual problem of 
IR, and treat profiles as documents and documents as 
queries. Instead of building a document index,
we build a profile index . In [YGM94a,YGM94b] we propose 
and analyze a variety of profile indexing techniques
for the vector space and boolean models respectively. 
In our SIFT implementation, we have chosen
good indexing techniques for the two models that are compatible in nature and 
combined them together in the same index structure. 
Below we briefly present this combined index structure through an example.
For a detailed description, as well as descriptions of 
other profile indexing techniques and analysis, 
the reader is referred to [YGM94a,YGM94b].

Referring to Figure 2, suppose P1 is a weighted profile 
<(underwater, 60), (archeology, 60)>
and P2 is a boolean profile ``fly fishing not underwater.'' 
For each word, we associate it with an inverted list of 
postings that reference profiles containing it. 
For example, in Figure 2 the inverted
list for word ``underwater'' is made up of two postings referencing
P1 and P2. In the posting, if the profile is weighted, we
include the weight of the word in the profile; e.g., 
60 for ``underwater'' in P1. If the profile
is a boolean one, we assign a weight of 1 to a non-negated 
word, and -1 to negated words.

Figure 2: Profile index and auxiliary data structures 

In addition, we make use of two arrays called Threshold and Score; 
each profile has an entry in each array. For each profile, its
Threshold entry stores the minimum score that a document must have 
against it to be a match. 
For a boolean profile, this minimum is equal to the number of 
non-negated words. For a weighted profile, the minimum
is derived using the relevance threshold specified by the user. 
The Score entry is used to accumulate the score that a document
has against the profile. 

Now suppose a document D containing the words ``underwater archeology'' (and 
none of the other profile words) is being processed.
Suppose the vector space representation for D is <(underwater, 80), 
(archeology, 80), ...>; i.e., 
the words ``underwater'' and ``archeology'' are both 
assigned weights of 80. 
To process the word ``underwater,'' we look up the directory and
access its inverted list. For a weighted profile like  
P1, the document weight of ``underwater'' is multiplied
with the weight in the posting (60), and 
the product is accumulated in P1's
Score entry (note that 
we are calculating the dot product of the two vectors). 
For a boolean profile like P2, we simply add its
posting weight (-1 for ``underwater'') 
to the Score entry. Similarly the word ``archeology'' is processed. Finally, 
profiles whose resulting Score entry is no less than the Threshold
entry match the document, such as P1 in Figure 2.

5 Performance Study 

A practical challenge for SIFT is to process a new batch of documents
within the smallest time unit of notification. For example, 
in the Netnews SIFT server, it is critical that 
the filter and notify process is finished within 24 hours. 
Thus, it is important to understand and characterize the 
performance of the SIFT system, and measure how it scales with the 
number of profiles and volume of information. 
 
First we study the performance of the processing time with
respect to the number of documents and number of profiles. 
We then compare the performance of 
the filtering engine with and the alternative document index
approach discussed in Section 4.2.
Finally we investigate the main memory requirement of the filtering
engine. All experiments are run on a dedicated 
DECstation 5000/240 
running Ultrix 4.2 and all times reported are real (wall clock) time.

The study was performed in early July, 1994.
The test data consisted of 38,000 articles received
on the day of July 6, 1994 by our department's news host and 7,000 randomly
selected subscriptions from the Netnews SIFT server at that time.
The average article size was 269 words 
(a word is defined as an alphanumeric string longer than 2 characters). 
The subscriptions were stored on a local disk, 
while the news articles were stored on an NFS-mounted disk 
(from the news host). We repeated the experiments with test data
from two other days and the results were within   
of those reported below.

In the result graphs that follow, we divide SIFT processing time
into these four steps:
  
Build Time   
The profile index structure is built. 
Other auxiliary data structures are allocated and initialized.

Filtering Time   
Documents are run against the index one by one. 
Document-profile matchings are written into a file.

Sorting Time   
The document-profile matching file is sorted
by user email address and subscription identifier, using Unix sort
command. 

Notify Time   
The sorted matching file is read. Excerpts of matchings 
for each subscription are prepared into an email message.
Unix sendmail is invoked to send out each message.


Coping with Information Volume 

First we investigate the relationship 
between the processing time and the number of documents.
Figure 3 shows the results of processing the 38,000 articles against
the 7,000 subscriptions. The time it 
takes to build the profile index is very small and is as expected
independent of the number of documents.  
The filtering time is linear with respect to the number of 
documents, with slope 0.21 sec./document (value obtained by
curve-fitting using Mathematica). 
The time it takes to 
sort the matching file is negligible.
The proportionality constant for the total running time is
0.63 sec./document.
(Recall that the articles are stored on a remote news host,
so some fraction of this time is spent in reading the articles via
local network. 
This is confirmed by looking at the CPU utilization, which is
found to be 42.8%.)

Figure 3: SIFT performance vs. number of documents 

Figure 4: SIFT performance vs. number of profiles 

To measure the notify time, since it would be impossible
to send the same updates repeatedly to our
real users, the notify times shown
in Figure 3 (except the rightmost data point)  
are interpolated results. We assume that
the time it takes to send out notifications  
is proportional to the number of 
matchings in the matching file. We have verified this assumption
with a separate experiment and derived the proportionality constant. 
Then in the experiment for Figure 3, we
count the number of matchings in the matching file
at the data points shown. 
We compute the notify time as the product of
the number of matchings and the proportionality constant.
We find that a large fraction (  for 38,000 documents)
of the total time is spent in sending out subscriptions. 

Coping with Subscription Growth 

In the second experiment, we look at the relationship
between the processing time and the number of profiles. 
We repeat the process of filtering 38,000 documents
against 1, 1,000, 4,000, 7,000, and 14,000 subscriptions. 
The last run is done simply by duplicating the 7,000 
subscriptions in the database. 
Figure 4 shows the results. 
Again, the profile index build time is negligible. 
For the filtering time, we find that even for one profile, 
it takes 5,105 sec. to filter. This is the time spent reading
in the articles via NFS. 
Beyond the initial start-up cost, 
filter time is apparently linear to the number of profiles, with slope
0.62 sec./profile. 
The notify time is similarly obtained as discussed above. 
We find the total running time to be linear 
with respect to the number of profiles, 
and the slope is 2.63 sec./profile. 

Performance Improvement 

To quantify the performance improvement obtained by 
using the filtering engine, 
we compare it with the earlier SIFT implementation which runs 
profiles against a document index to perform the filtering. 
This process consists of these three steps:

  
Build Time   
The document index structure is built.

Filter Time   
Profiles are run against the document index 
one by one. Document-profile matchings are written into a file.

Notify Time    
Update messages are sent out.

Comparing this with the filtering engine process, 
we note that although the first steps have similar names, they represent
totally different work. The document index takes much longer to built
since there are many more documents than 
profiles and a document is much larger than a profile.
Also the sorting of matchings
is not needed in this approach because the filtering is done on a 
per subscription basis. In fact, a notification may be sent out 
as soon as a subscription is processed. However, for comparison purposes
we opt to separate the work into the shown sequence. 

First we compare the total running time to filter and alert 
using the two approaches. Figure 5 shows the results of 
processing different number of profiles against 38,000 documents.
The slope for the document index 
total running time is 6.68 sec./profile, compared
with 2.63 sec./profile for the filtering engine. 
If we focus on the 7,000 profile runs,  
the total running time with the document index
is 57,745 sec., compared
with 21,652 sec. with the filtering engine. Thus, 
the SIFT filtering engine is more than twice as fast as the document
index approach.

Figure 6 shows the time breakdown for the document index approach.
We see that the document index build stage
makes up a significant portion (10,803 sec.) of the whole process.
The majority of the time is spent in the second stage; for the
7,000 profiles run, the filter time is 33,345 sec. or 58  of
the whole processing time. On the other hand, the notify time now takes up
a smaller fraction of the running time. 

Figure 5: Comparing performances of document index and filtering engine

Figure 6: Breakdown of running time for filtering using document index

It may be argued that the document index build time should also 
be included in computing the total running time for the 
filtering engine case, since it would be built in that case 
to provide the test run facility. However, if the document index
is used strictly for 
test run purpose, we might simply build an index with any representative
collection and not once for every batch. Even if we do include
the document index build time, the total time sums up to
32,455 sec., still just 56  of that for the document index approach.   

Main Memory Requirement 

By inspection of the filtering engine code, 
the formula for main memory requirement (in
bytes) for the SIFT filtering
engine is: 

  
Main memory requirement = # profiles x 132 + distinct # profile words x 32 +  
                          # profile word occurrences x 12

The first term on the right-hand-side is for the auxiliary
data structures (132 bytes per profile; besides keeping 
the Threshold and Score entries, we also need to 
keep the email address and other information in main memory
to efficiently output the matchings). 
The second term is for the directory
nodes (one per distinct profile word). The last term is
for the posting nodes (one per profile word occurrence).
Using the subscriptions to the Netnews SIFT server, 
we count the various numbers on the right-hand-side of the above
formula and compute the main memory requirement for various
number of subscriptions. The results are shown in Table 2.
 
# profiles # distinct # totol    Main memory 
           words      words      required (bytes)
---------- ---------- ---------- ---------- 
     1,000      1,595      2,334    211,048   
     2,000      2,806      4,756    410,864  
     3,000      3,746      6,893    598,588  
     4,000      4,657      9,244    787,952  
     5,000      5,535     11,735    977,940  
     6,000      6,278     14,102  1,162,120  
     7,000      7,001     16,489  1,345,900

Table 2: Main memory requirement for filtering engine  
 
Several interesting statistics can be obtained from these data.
The total number of word occurrences is linear 
with respect to the number of profiles, and the 
slope is found by curve-fitting to be 2.34 words/profile. 
On the other hand, the number of distinct words increases
more rapidly in the beginning, but at large number of profiles, seems
to approach the slope 1 word/profile. 
Making the rough assumption that the number of distinct words is 
also linear with the number of profiles, we use 
curve-fitting to derive the 
ratio (main memory requirement / number of profiles) and find it to be   
195 bytes/profile. This translates to 
a capacity of 5,128 profiles per megabyte. Thus, with
today's large memories, SIFT can process hundreds of thousands
of profiles. 


6 Conclusion 

The user population of our SIFT Netnews server has grown at a rate
of approximately 1,400 subscriptions a month; this indicates 
that there is a recognized need for wide-area information 
dissemination. Further, since the server was made 
public, we have received a lot of 
feedback and comments, which are predominately positive. 
The server was reported in numerous magazines and newsletters
(e.g., WIRED, Internet World, MicroTimes),
resulting in the continued growth of its population. 
A number of sites have requested the SIFT code,
which is now publicly available by anonymous ftp. 
Anticipating further subscription growth for the service, 
we are in our initial steps to replicate the server at other sites.

One of the main suggestions for improvement is to provide
more expressive filtering models to screen out irrelevant
information. Netnews 
is an extremely diverse source of information, with topics that are 
extremely disparate in nature. Thus the noise level 
of the filtered results is in some cases high. In the 
server for disseminating Computer Science Technical Reports, 
we found the noise level to be much lower. 
Still we have taken steps to 
provide more filtering capability, and the boolean model has
been added as a result. Extension to phrases and 
proximity matching are also planned. 

Evident from the SIFT Netnews server, the efficiency of the filtering 
process is critical in providing such a service. We believe
the indexing method implemented in SIFT scales 
to large number of users and high information 
generation rate. The time taken to process
a batch of documents against a collection of 
subscriptions is linear with respect to 
to the number of documents and the number of subscriptions. 
The proportionality constants are roughly half of those
based on a document index. 

The sending out of email notifications 
consumes a major portion of the SIFT running time. To improve this, 
a more sophisticated document delivery scheme 
is needed. One idea is to group users geographically.  
If a document matches one of the users in a
group, a single copy of the document is sent to the group. A local distribution
site for the group receives the copy and retransmits it to users locally. 
This is beneficial if users have overlapping interests, and is  
also useful in reducing wide-area network traffic.
However, to install such a scheme, cooperation from the user side 
is needed.

Acknowledgements 

Thanks to Anthony Tomasic for suggesting filtering Netnews as an 
application of SIFT and for providing comments to improve the readability
of this paper. 

Bibliography

Tak W. Yan is a Ph.D. student in the Department of Computer Science at
Stanford University, Stanford, California. 
His research interests are in the areas of
database systems, digital libraries, and information 
retrieval and filtering systems. He received a B.S. 
in Electrical Engineering and Computer Science
from University of California, Berkeley in 1991. 
In 1993 he received a M.S. in Computer Science 
from Stanford University.

Hector Garcia-Molina is Professor in the Departments
of Computer Science and Electrical Engineering at Stanford
University, Stanford, California. His research interests
include distributed computing systems, database systems, 
and digital libraries. He received a B.S. in Electrical
Engineering from the Instituto Tecnologico de Monterrey,
Mexico, in 1974. From Stanford University he received 
in 1975 a M.S. in Electrical Engineering and a Ph.D. 
in Computer Science in 1979.