Our compression algorithms are roughly similar to the well-known MTF (``move-to-front'') algorithm, which maintains an LRU ordering, but is unusual in its use of partial matching and a fixed 32-bit word as its basic granularity of operation. (The general MTF scheme is fairly obvious and has been invented independently at least four times [BCW90] before we reinvented it yet again.)
The use of partial matching (only the high bits) can be viewed as a simple and fast approximation of delta coding, a technique used for purely numeric data (such as sensor input data or digitized audio) [Nel95].4 Delta coding (a form of differential coding) encodes a numerical value as a numerical difference from the previous numerical value. Unlike a traditional delta coder, our algorithm can encode a value by its difference (low bits) from any of the values in an MTF dictionary, rather than the unique previous value.
In [KGJ96], Kjelso, Gooch, and Jones presented a compression algorithm also designed for in-memory data. Their X-match algorithm (which is designed for hardware implementation) is similar to ours in that both use a small dictionary of recently used words. Rizzo, in [Riz97], also devised a compression algorithm specific to in-memory data. His approach was to compress away the large number of zeros found in such data. Rizzo asserts that more complex modeling would be too costly. We have shown that it is possible to find more regularities without great computational expense.
While we have not addressed the compression of machine code, others have shown that it is possible to compress machine code by a factor of 3 using a specially tuned version of a conventional compressor [Yu96] and by as much as a factor of 5 using a compressor that understands the instruction set [EEF +97]. We believe that similar techniques can be made very fast and achieve a compression ratio of at least 2, similar to the ratios we get for data, so an overall compression ratio of 2 for both code and data should generally be achievable. This is within 20% of the size reduction found by Cogswell and Russinovich using an extremely fast, simple, and untuned ``general purpose'' compression algorithm [RC96]. (Their paging data also support the assumption that full workloads exhibit the kind of locality needed for compressed paging, making our focus on data paging more reasonable.)
A significant previous study of compressed caching was done by Douglis, who implemented a compressed virtual memory for the Sprite operating system and evaluated it on a DECStation 5000, which is several times to an order of magnitude slower than the machines we used in our experiments.
Douglis's results were mixed, in that compressed virtual memory was beneficial for some programs and detrimental to others. As should be apparent from our discussion of performance modeling, we believe that this was primarily due to the slow hardware (by today's standards) used. This is supported by our sensitivity analysis, which showed that an 8 times slower machine than a 300 MHz UltraSPARC would yield mixed results, even with better compression algorithms than those available to Douglis.
As discussed earlier, Russinovich and Cogswell's study [RC96] showed that a simple compression cache was unlikely to achieve significant benefits for the PC application workload they studied. Nevertheless, their results do not seem to accurately reflect the trade-offs involved. On one hand, they reported compression overheads that seem unrealistically low (0.05ms per compression on an Intel 80486 DX2/66, which is improbable even taking only the memory bandwidth limitations into account). But the single factor responsible for their results is the very high overhead for handling a page fault that they incurred (2ms--this is overhead not containing the actual seek time). This overhead is certainly a result of using a slow processor but it is possibly also an artifact of the OS used (Windows 95) and their implementation.
A study on compressed caching, performed in 1997 but only very recently published, was done by Kjelso, Gooch, and Jones [KGJ99]. They, too, used simulations to demonstrate the efficacy of compressed caching. Additionally, they addressed the problem of memory management for the variable-size compressed pages. Their experiments used the LZRW1 compression algorithm in software and showed for most programs the same kinds of reduction in paging costs that we observed. These benefits become even greater with a hardware implementation of their X-match algorithm.
Kjelso, Gooch, and Jones did not, however, address the issue of adaptively resizing the compressed cache in response to reference behavior. Instead, they assumed that it is always beneficial to compress more pages to avoid disk faults. This is clearly not true as when more pages are compressed, many more memory accesses may suffer a decompression overhead, while only a few disk faults may be avoided. The purpose of our adaptive mechanism is to determine when the trade-off is beneficial and compression should actually be performed. Kjelso, Gooch, and Jones did acknowledge that some compressed cache sizes can damage performance. Indeed, their results strongly suggest the need for adaptivity: two of their four test programs exhibit performance deterioration under software compression for several memory sizes.