In-Memory Data Representations

Next: Exploiting In-Memory Data Regularities Up: Compression Algorithms Previous: Ziv-Lempel compression.

In-Memory Data Representations

It is commonly thought that LZ-style compression is ``general purpose,'' and that in-memory data are fairly arbitrary--different programs operate on different kinds of data in different ways, so there's not much hope for a better algorithm than LZ for compressing in-memory data. The first assumption is basically false,² and the second is hasty, so the conclusion is dubious.

While different programs do different things, there are some common regularities, which is all a compression algorithm needs to work well on average. Rather than consisting of byte strings, the data in memory are often best viewed as records and data structures--the overall array of memory words is typically used to store records, whose fields are mostly one or two words. Note that fields of records are usually word-aligned and that the data in those words are frequently numbers or pointers. Pointers can be usefully viewed as numbers--they are integer indices into the array of memory itself.

Integer and pointer data often have certain strong regularities. Integer values are usually numerically small (so that only their low-order bytes have significant information content), or else similar to other integers very nearby in memory.

Likewise, pointers are likely to point to other objects nearby in memory, or be similar to other nearby pointers--that is, they may point to another area of memory, but other pointers nearby may point to the same area. These regularities are quite common and strong. One reason is that heap data are often well-clustered; common memory allocators tend to allocate mostly within a small area of memory most of the time; data structures constructed during a particular phase of program execution are often well-clustered and consist of one or a few types of similar objects [WJNB95].

Other kinds of data often show similar regularities. Examples include the hidden headers many allocators put on heap objects, virtual function table pointers in C++ objects, booleans, etc.

These regularities are strong largely because in-memory data representations are designed primarily for speed, not space, and because real programs do not usually use random data or do random things with data. (Even randomized data can be very regular in this way; consider an array of random integers less than 1000--all of them will have zeroes in their upper 22 bits.)

Next: Exploiting In-Memory Data Regularities Up: Compression Algorithms Previous: Ziv-Lempel compression.

Scott F. Kaplan
1999-04-27