Check out the new USENIX Web site. next up previous
Next: In-Memory Data Representations Up: Background: Compression Previous: Background: Compression

Ziv-Lempel compression.

Most compression algorithms, including the overwhelmingly popular Ziv-Lempel family, are based on detection of exact repetitions of strings of atomic tokens. The token size is usually one byte, for speed reasons and because much data is in some sense byte-oriented (e.g., characters in a text file) or multiple-byte oriented (e.g., some kinds of image data, Intel Architecture machine code, unicode).

A Ziv-Lempel compressor models by reading through the input data token by token, constructing a dictionary of observed sequences, and looking for repetitions as it goes. It encodes by writing strings to its output the first time they are observed, but writing special codes when a repetition is encountered (e.g., the number of the dictionary entry). The output thus consists of appropriately labeled ``new'' data and references to ``old'' data (repetitions).

The corresponding LZ decompressor reads through this data much like an interpreter, reconstructing the dictionary created during compression. When it sees a new string, it adds it to the dictionary just as the compressor did, as well as sending it to its (uncompressed) output. When it sees a code for a repetition of a dictionary item, it copies that item to its output. In this way, its dictionary always matches the dictionary that the compressor had at the same point in the data stream, and its output replicates the original input by expanding the repetition codes into the strings they represent.

The main assumption embodied by this kind of compressor is that literal repetitions of multi-token strings will occur in the input--e.g., you'll often see several bytes in a row that are exactly the same bytes in the same order as something you saw before. This is a natural assumption in text, and reasonable in some other kinds of data, but often wrong for in-memory data.


next up previous
Next: In-Memory Data Representations Up: Background: Compression Previous: Background: Compression
Scott F. Kaplan
1999-04-27