Figure 2 shows the performance of several lossless data compression applications using metrics of compression ratio, execution time, and static memory allocation. The datasets are the first megabyte (English books and a bibliography) from the Calgary Corpus [5] and one megabyte of easily compressible web data (mostly HTML, Javascript, and CSS) obtained from the homepages of the Internet's most popular websites [32,25]. Graphics were omitted as they are usually in compressed form already and can be recognized by application-layer software via their file extensions. Most popular repositories ([4,10,11]) for comparison of data compression do not examine the memory footprint required for compression or decompression. Though static memory usage may not always reflect the size of the application's working set, it is an essential consideration in mobile computing where memory is a more precious resource. A detailed look at the memory used by each application, and its effect on time, compression ratio, and energy will be presented in Section 3.3.
Figure 2 confirms that we have chosen an array of applications that span a range of compression ratios and execution times. Each application represents a different family of compression algorithms as noted in Table 1. Consideration was also given to popularity and documentation, as well as quality, parameterizability, and portability of the source code. The table includes the default parameters used with each program. To avoid unduly handicapping any algorithm, it is important to work with well-implemented code. Mature applications such as compress, bzip2, and zlib reflect a series of optimizations that have been applied since their introduction. While PPMd is an experimental program, it is effectively an optimization of the Prediction by Partial Match (PPM) compressors that came before it. LZO represents an approach for achieving great speed with LZ77. Each of the five applications is summarized below assuming some familiarity with each algorithm. A more complete treatment with citations may be found in [36].
|
zlib combines LZ77 and Huffman coding to form an algorithm known as ``deflate.'' The LZ77 sliding window size and hash table memory size may be set by the user. LZ77 tries to replace a string of symbols with a pointer to the longest prefix match previously encountered. A larger window improves the ability to find such a match. More memory allows for less collisions in the zlib hash table. Users may also set an ``effort'' parameter which dictates how hard the compressor should try to extend matches it finds in its history buffer. zlib is the library form of the popular gzip utility (the library form was chosen as it provides more options for trading off memory and performance). Unless specified, it is configured with parameters similar to gzip.
LZO is a compression library meant for ``real-time'' compression. Like zlib, it uses LZ77 with a hash table to perform searches. LZO is unique in that its hash table can be sized to fit in 16KB of memory so it can remain in cache. Its small footprint, coding style (it is written completely with macros to avoid function call overhead), and ability to read and write data ``in-place'' without additional copies make LZO extremely fast. In the interest of speed, its hash table can only store pointers to 4096 matches, and no effort is made to find the longest match. Match length and offset are encoded more simply than in zlib.
compress is a popular Unix utility. It implements the LZW algorithm with codewords beginning at nine bits. Though a bit is wasted for each single 8-bit character, once longer strings have been seen, they may be replaced with short codes. When all nine-bit codes have been used, the codebook size is doubled and the use of ten-bit codes begins. This doubling continues until codes are sixteen bits long. The dictionary becomes static once it is entirely full. Whenever compress detects decreasing compression ratio, the dictionary is cleared and the process beings anew. Dictionary entries are stored in a hash table. Hashing allows an average constant-time access to any entry, but has the disadvantage of poor spatial locality when combining multiple entries to form a string. Despite the random dispersal of codes to the table, common strings may benefit from temporal locality. To reduce collisions, the table should be sparsely filled which results in wasted memory. During decompression, each table entry may be inserted without collision.
PPMd is a recent implementation of the PPM algorithm. Windows users may unknowingly be using PPMd as it is the text compression engine in the popular WinRAR program. PPM takes advantage of the fact that the occurrence of a certain symbol can be highly dependent on its context (the string of symbols which preceded it). The PPM scheme maintains such context information to estimate the probability of the next input symbol to appear. An arithmetic coder uses this stream of probabilities to efficiently code the source. As the model becomes more accurate, the occurrence of a highly likely symbol requires fewer bits to encode. Clearly, longer contexts will improve the probability estimation, but it requires time to amass large contexts (this is similar to the startup effect in LZ78). To account for this, ``escape symbols'' exist to progressively step down to shorter context lengths. This introduces a trade-off in which encoding a long series of escape symbols can require more space than is saved by the use of large contexts. Storing and searching through each context accounts for the large memory requirements of PPM schemes. The length of the maximum context can be varied by PPMd, but defaults to four. When the context tree fills up, PPMd can clear and start from scratch, freeze the model and continue statically, or prune sections of the tree until the model fits into memory.
bzip2 is based on the Burrows Wheeler Transform (BWT)
[8]. The BWT converts a block
of length
into a pair consisting of a permutation of
(call it
) and an
integer in the interval
. More important than the details
of the transformation is its effect. The transform collects groups of
identical input symbols such that the probability of finding a symbol
in a region of
is very high if another instance of
is
nearby. Such an
can be processed with a ``move-to-front'' coder
which will yield a series consisting of a small alphabet: runs of
zeros punctuated with low numbers which in turn can be processed with
a Huffman or Arithmetic coder. For processing efficiency, long
runs can be filtered with a run length encoder. As block size is
increased, compression ratio improves. Diminishing returns (with
English text) do not occur until block size reaches several tens of
megabytes. Unlike the other algorithms, one could consider BWT to
take advantage of symbols which appear in the ``future'', not
just those that have passed. bzip2 reads in blocks of data,
run-length-encoding them to improve sort speed. It then applies the
BWT and uses a variant of move-to-front coding to produce a
compressible stream. Though the alphabet may be large, codes are only
created for symbols in use. This stream is run-length encoded to
remove any long runs of zeros. Finally Huffman encoding is applied.
To speed sorting, bzip2 applies a modified quicksort which
has memory requirements over five times the size of the block.