Check out the new USENIX Web site. next up previous
Next: GreedyDual-Size Algorithm Up: Web Proxy Traces Previous: Web Proxy Traces

Locality in Web Accesses

In the search for an effective replacement algorithm, we analyzed the traces to understand the access patterns of Web requests seen by the proxies. The striking property we found is that all traces exhibit excellent long-term locality.

   figure151
Figure 1: Percentage of references to documents whose last accesses are t minutes ago, for t from 5 to 10000.

   figure159
Figure 2: Percentage of references as a function of time since last access by the same user.

Figure 1 shows the percentage of references to a document whose last reference is t minutes ago, for t from 5 to 10000, in the DEC traces for the period from Sep. 12 to Sep. 18. In other words, the figure shows the probability of a document being accessed again as a function of the time since the last access to this document. The graphs for other traces are similar to the one shown here. Clearly, the probability of reference drops significantly as the time since last reference increases (note that the y-axis is in logarithmic scale), with occasional spikes around multiples of 24 hours.

Figure 3 shows the accumulative percentage of references to documents whose last references are less than t minutes ago, for the entire DEC traces from Aug. 29 to Sep. 22. The dashed curve on the graph shows the corresponding percentage of bytes referenced. In Figure 3, which uses linear scale for the y-axis, and logarithmic scale for the x-axis, we see that the curves are nearly linear. That is, the probability of a document being referenced again within t minutes is proportionally to log(t), indicating that the probability of re-reference to documents referenced exactly t minutes ago can be modeled as k/t, where k is a constant.

A different study [LRV97] reached very similar conclusions on a different set of traces. Indeed, it is this observation that promoted the design of the function D(t) in LRV. Since the studies find similar temporal locality patterns in the Web access traces, the probability density function of k/t has been used to simulate temporal locality behavior in a recent Web proxy benchmark [WPB].

There are two reasons for the good locality in Web accesses seen by the proxy. One is that each user's accesses tend to exhibit locality -- figure 2 shows the probability that a document is accessed by a user t minutes after the last access by the same user, for DEC traces in the period from Sep. 12 to Sep. 18 (again, the figures for other traces are similar). Clearly, each user tends to re-access recently-read documents, and re-access documents that are read on a daily basis (note the spikes around 24 hours, 48 hours, etc. in the figure). Though one might expect that browsers' caches absorb the locality among the same user's accesses seen by the proxy, the results seems to indicate that this is not necessarily the case, and users are using proxy caches as an extension to the browser cache.  [LRV97] observes the same phenomenon.

The other reason is that users' interests overlap in time -- comparing figures 2 and  1, we can see that for the same t, the percentage in figure 1 is higher than that in figure 2. This indicates that part of the locality observed by the proxy comes from the fact that the proxy sees a merged stream of accesses from many independent users, who share a certain amount of common interests. Thus, we believe the locality observed is not particular to the traces described here, but rather a common characteristic of accesses seen by proxies with a large enough user community.

   figure176
Figure 3: Percentage of references and percentage of bytes referenced to documents accessed less than t minutes ago (the accumulative version of Figure 1). Note that the y-axis is in linear scale and the x-axis is in log scale.

To further demonstrate the effect of inter-user sharing on hit ratios, Figure 4 shows the hit ratio and byte hit ratio as a function of the size of the user group sharing the cache. The figures are quartile graphs [Tufte], the middle curve showing the median of the hit ratios of individual groups of clients in the DEC traces, and the other four points for each group size showing the minimum, the 25% percentile, the 75% percentile, and the maximum of the hit ratios of individual groups. The median hit ratios show an almost linear increase as the group size doubles.

 

  figure177
Figure 4: Hit ratio and byte hit ratio as a function of the size of the user group sharing the cache. The x-axis is in log scale.

In the absence of cost and size concerns, LRU is the optimal online algorithm for reference streams exhibiting good locality [CD73] (strictly speaking, those conforming to the LRU-stack model). However, in the Web context, replacing a more recently used but large file can yield a higher hit ratio than replacing a less recently used but small file. Similarly, replacing a more recently used but inexpensive file may yield a lower total cost than replacing a less recently used but expensive file. Thus, we need an algorithm that combines locality, size and cost considerations in a simple, online way that does not require tuning paramters according to the particular traces, and yet maximizes the overall performance.


next up previous
Next: GreedyDual-Size Algorithm Up: Web Proxy Traces Previous: Web Proxy Traces

Pei Cao
Thu Oct 23 18:04:42 CDT 1997