University of Massachusetts
Amherst, MA 01003
Fred Douglis Jason LaVoie John M. Tracey
IBM T. J. Watson Research Center
Hawthorne, NY 10532
Despite ever-increasing capacities, significant benefits can still be realized by reducing the number of bytes needed to represent an object when it is stored or sent. The benefits can be especially great for mobile devices with limited storage or bandwidth; reference data (data that are saved permanently and accessed infrequently); e-mail, in which large byte sequences are commonly repeated; and data transferred over low-bandwidth or congested links.
Reducing bytes generally equates to eliminating unneeded data, and there are numerous techniques for reducing redundancy when objects are stored or sent. The most longstanding example is data compression , which eliminates redundancy internal to an object and generally reduces textual data by factors of two to six. Duplicate suppression eliminates redundancy caused by identical objects which can be detected efficiently by comparing hashes of the objects' content . Delta-encoding eliminates redundancy of one object relative to another, often an earlier version of the object by the same name . Delta-encoding can in some cases eliminate an object almost entirely, but the availability of base versions against which to compute a delta can be problematic.
Recently, much work has been performed on applying these techniques to pieces of individual objects. This includes suppressing duplicate pieces of files [7,8,17,20] and web pages . Delta-encoding has also been extended to pairs of files that do not share an explicit versioning relationship [6,9,18]. There are also approaches that combine multiple techniques; for instance, the vcdiff program not only encodes differences between a ``version'' file and a ``reference'' file, it compresses redundancy within the version file . Delta-encoding that simultaneously compresses is sometimes called ``delta compression'' .
In fact, no single technique can be expected to work best across a wide variety of data sets. There are numerous trade-offs between the effectiveness of data reduction and the resources required to achieve it, i.e. its efficiency. The relative importance of these metrics, effectiveness versus efficiency, depends on the environment in which techniques are applied. Execution time for example, which is an important aspect of efficiency, tends to be more important in interactive contexts than in asynchronous ones. In this paper, we describe a new data reduction technique that achieves comparable effectiveness to current delta-encoding techniques but with greater efficiency. It simultaneously offers better effectiveness than current duplicate suppression techniques at moderately higher cost.
We argue that performing comparisons at the granularity of files can miss opportunities for redundancy elimination, as can techniques that rely on large contiguous pieces of files to be identical. (Henson's study of issues relating to comparing blocks by hashes of their content made a similar argument .) Instead, we consider what happens if some of the above techniques are further combined. Specifically, we describe a system that supports the union of three techniques: compression, elimination of identical content-defined chunks, and delta-compression of similar chunks. We refer to this technique as Redundancy Elimination at the Block Level (REBL). The key insight of this work is the ability to achieve more effective data reduction by exploiting relationships among similar blocks, rather than only among identical blocks, while keeping computational and memory overheads comparable to techniques that perform redundancy detection with coarser granularity.
We compare our new approach with a number of baseline techniques, which are summarized here and described in detail in the next section:
The remainder of this paper is organized as follows: Section 2 describes current techniques and their limitations. Details of the REBL technique are presented in Section 3. Section 4 describes the data sets and methodology used to evaluate REBL. Section 5 presents an empirical evaluation of REBL and several other techniques. Section 6 concludes.
We discuss current techniques in Section 2.1 and elaborate on their limitations in Section 2.2.
A common approach to storing a collection of files compactly is to combine the files into a single object, which is then compressed on-the-fly. In Windows, this function is served by the family of zip programs, though they do not necessarily identify inter-file redundancy in addition to intra-file redundancy. In UNIX, files can be combined using tar with the output compressed using gzip or another compression program. However, TGZ does not scale well to extremely large file sets. Access to a single file in the set can potentially require the entire collection to be uncompressed. Furthermore, traditional compression algorithms maintain a limited amount of state information. This can cause them to miss redundancy between sections of an object that are distant from one another thus reducing their effectiveness. Historically the window used to detect redundancy is small, for instance 32 KB, but at least one new compression program uses memory-mapped I/O and increased state to find repeated substrings across hundreds of MB .
There are two general methods for compressing a collection of files with greater effectiveness and scalability than TGZ. One involves finding identical chunks of data within and across files. The other involves effective encoding of differences between files. We now describe each of these approaches in turn.
Finding identical files in a self-contained collection is straightforward through the use of strong hashes of their content. For example, Bolosky et al. have described a system to save only one instance of duplicate files in a Windows file system . Mogul et al. have described a method for computing checksums over web resources (HTML pages, images, etc.) and eliminating retrieval of identical resources, even when accessed via different URIs .
Suppressing redundancy within a file is also important. One simple approach is to divide the file into fixed-length blocks and compute a checksum for each. Identical blocks are detected by searching for repeated checksums. The checksum algorithm must be ``strong'' enough to decrease the probability of a collision to an negligible value. SHA-1  (``SHA'') is commonly used for this purpose.
Venti  is a network-based storage system intended primarily for archival purposes. Each stored file is broken into fixed-sized blocks, which are represented by their SHA hashes. As files are incrementally stored, duplicate blocks, indicated by identical SHA values, are stored only once. Each file is represented as a list of SHA hashes which index blocks in the storage system.
Another algorithm used to minimize the bandwidth required to propagate updates to a file is rsync [23,29]. With rsync, the receiver divides its out-of-date copy of a file into fixed-length blocks, calculates two checksums for each block (a weak 32-bit checksum and a strong 128-bit MD4 checksum), and transmits the checksums to the sender to inform it which blocks the receiver possesses. The sender calculates a 32-bit checksum along a fixed-length window that it slides throughout the file to be sent. If the 32-bit checksum matches a value sent by the receiver, the sender confirms that the receiver already possesses the corresponding block by computing and comparing the 128-bit checksum. Use of a rolling checksum over fixed-sized blocks has recently been extended to large collections of files regardless of name . We refer to this as the SLIDINGBLOCK approach, which is one of the techniques against which we compare REBL below.
One can maintain a large replicated collection of files in a distributed environment using a technique similar to SLIDINGBLOCK . Suel et al. point out two main parameters for rsync performance: block size and location of changes within the file. To enhance the performance of rsync, they propose a multi-phase protocol in which the server sends hashes to the client and client returns a bitmap indicating the blocks it already has (similar to rsync). In this approach, the server uses the bitmap of existing blocks to create a set of reference blocks used to encode all blocks not present at the client. The delta sent to the client by the server is used in conjunction with blocks in the bitmap to recreate the original file. This technique has some similarity to Spring and Wetherall's approach to finding redundant data on a network link by caching ``interesting'' fingerprints of sliding windows of data and then finding the fingerprints in a cache of past transmissions .
The Low-Bandwidth File System (LBFS)  is a network file system designed to minimize bandwidth consumption. LBFS divides files into content-defined chunks using Rabin fingerprints . Fingerprints are computed over a sliding window; a subset of possible fingerprint values denotes chunk boundaries, with the subset determining a probabilistic average chunk size. LBFS also imposes a minimum and maximum chunk size, irrespective of fingerprint values.
LBFS computes and stores an SHA hash for each content-defined chunk stored in a given file system. Before a file is transmitted, the SHA values of each chunk in the file are sent first. The receiver looks up each hash value in a database of values for all chunks it possesses. Only chunks not already available at the receiver are sent; chunks that are sent are compressed. Content-defined chunks have also been used used in the web  and backup systems . We refer to the overall technique of eliminating duplicate content-defined chunks and compressing remaining chunks as CDC, and we compare REBL with this combined technique in the evaluation section below.
A second general approach to compressing multiple data objects is delta-encoding. This approach has been used in many applications including source control , backup , and web retrieval [14,15]. Delta-encoding has also been used on web pages identified by the similarity of their URIs .
Effective delta-encoding relies on the ability to detect similar files. Name-based file pairing works only in very limited cases. With large file sets, the best way to detect similar files is to examine the file contents. Manber  discusses a basic approach to finding similar files in a large collection of files. His technique summarizes each file by computing a set of polynomial-based fingerprints; the similarity between two files is proportional to the fraction of fingerprints common between them. Rabin fingerprints have been used for this purpose in numerous studies. Broder developed a similar approach , which he extended with an heuristic to coalesce multiple fingerprints into super-fingerprints. A single matching super-fingerprint implies high similarity, allowing similarity detection to scale to very large file sets such as web search engines .
While these techniques allow similar files to be identified, only recently have they been combined with delta-encoding to save space. Douglis and Iyengar describe ``Delta-Encoding via Resemblance Detection'' (DERD), a system that uses Rabin fingerprints and delta-encoding to compress similar files . The similarity of files is based on a subset of all fingerprints generated for each file. Ouyang et al. also study the use of delta compression to store a collection of files ; their approach to scalability is discussed in the next subsection.
Duplicate elimination exploits only files, blocks or chunks that are exactly the same. Even a single bit difference in otherwise identical content can prevent any similarity from being recognized. Thus, a version of a file that has many minor changes scattered throughout sees no benefit from the CDC or SLIDINGBLOCK techniques. Section 5.1 includes graphs of the overlap of fingerprints in CDC chunks that indicate how common this issue can be.
DERD uses delta-encoding, which eliminates redundancy at fine granularity when similar files can be identified. Resemblance detection using Rabin fingerprints is more efficient than the brute force approach of delta-encoding every possible pair of files. A straightforward approach to identifying similar files is to count the number of files that share even a single fingerprint with a given file. However, repeating this for every fingerprint of every file results in an algorithm with O(n2) complexity in the worst case, where n is the number of files. For large file sets, run time is dominated by the number of pairwise comparisons and can grow quite large even if the time for each comparison is small. DERD's performance therefore does not scale well with large data sets.
The computational complexity of delta-encoding file sets motivates cluster-based delta compression . With this approach, large file sets are first divided into clusters. The intent is to group files expected to bear some resemblance. This can be achieved by grouping files according to a variety of criteria including names, sizes, or fingerprints. (Douglis and Iyengar used name-based clusters to make the processing of a large file set tractable in terms of memory consumption ; similar benefits apply to processing time.) Once files are clustered, the techniques described above can be used to identify good candidate pairs for delta-encoding within each cluster. Clustering reduces the size of any file set to which the O(n2) algorithm described above is applied. When applied over all clusters, the technique results in an approximation to the optimal delta-encoding. How close this approximation is depends on the amount of overlap across clusters and is therefore extremely data-dependent.
Another important issue is that DERD does not detect matches between an encoded object and pieces of multiple other objects. Consider for example, an object A that consists of the concatenation of objects B-Z. Each object B-Z could be encoded as a byte range within A, but DERD would likely not detect any of the objects B-Z as being similar to A. This is due, in part, to the decision to represent each file by a fixed number of fingerprints regardless of file size. Because Rabin fingerprint values are uniformly distributed, the probability of a small file's fingerprints intersecting a large containing file's fingerprints is proportional to the ratio of their sizes. In the case of 25 files contained within a single 26th file, if the 25 files are of equal size but contain different data, each will contribute about 1/25 of the fingerprints in the container. This makes detection of overlap unlikely.
The problem arises because of the distinction between resemblance and containment. Broder's definition of containment of B in A is the ratio of the intersection of the fingerprints of the two files to the fingerprints in B, i.e. . When the number of fingerprints for a file is fixed regardless of size, the estimator of this intersection no longer approximates the full set. On the other hand, deciding there is a strong resemblance between the two is reasonably accurate, because for two documents to resemble each other, they need to be of similar size.
Finally, it is possible that extremely large data sets would not lend themselves to ``compare-by-hash'' because of the prospect of an undetected collision : hashes can be collision-resistant but there will be collisions given enough inputs. In a system that is deciding whether two local objects are identical, a hash can be used to find the two objects before expending the additional effort to compare the two objects explicitly. Our data sets are not of sufficient scale for that to pose a likely problem, so we did not include this extra step. While we chose to follow the protocols of past systems, explicit comparisons could easily be added.
We have designed and implemented a new technique that applies aspects of several others in a novel way to attain benefits in both effectiveness and efficiency. This technique, called Redundancy Elimination at the Block Level (REBL), includes features of compression, CDC, and DERD. It divides objects into content-defined chunks, which are identified using SHA hashes. First duplicate chunks are removed, and then resemblance detection is performed on each remaining chunk to identify chunks with sufficient redundancy to benefit from delta-encoding. Chunks not handled by either duplicate elimination or resemblance detection are simply compressed. A more detailed description of REBL appears in Section 4.1.
Key to REBL's ability to achieve efficiency comparable to CDC, instead of suffering the scalability problems of DERD, are optimizations that allow resemblance detection to be used more effectively on chunks rather than whole files. Resemblance detection has been optimized for use in Internet search engines to detect nearly identical files. The optimization consists of summarizing a set of fingerprints into a smaller set of super-fingerprints, possibly a single super-fingerprint. Objects that share even a single super-fingerprint are extremely likely to be nearly identical .
Optimized resemblance detection works well for Internet search engines where the goal is to detect documents that are nearly identical. Detecting objects that are merely similar enough to benefit from delta-encoding is harder. We hypothesized that applying super-fingerprints to full files in DERD would significantly improve the time needed to identify similar files but would also dramatically reduce the number of files deemed similar, resulting in lower savings than the brute force technique that counts individual matching fingerprints . In practice, we found using the super-fingerprint technique with whole files works better than we anticipated, but it is still not the most effective approach (see Section 5.1.4 for details).
In contrast, REBL can benefit from the optimized resemblance detection because it divides files into chunks and looks for near duplicates of each chunk separately. This technique can potentially sacrifice some ``marginal'' deltas that would save some space. We quantify this sacrifice by comparing the super-fingerprint approach with the DERD technique that enumerates the best matches from exact fingerprints.
We used several data sets to test REBL's effectiveness and efficiency. Table 1 lists the different data sets, giving their size, the number of files, and the number of content-defined chunks with the targeted average chunk-size set to 1 KB and 4 KB.
The Slashdot and Yahoo data sets are web pages that were downloaded and saved, as a system such as the Internet Archive  might archive web pages. Slashdot represents multiple pages downloaded over a period of about a day, wherein different pages tend to have numerous small changes corresponding mostly to updated counts of user comments. (While the Internet Archive would not currently save pages with such granularity, an archival system might if it could do so efficiently.) As the smallest data set, Slashdot is used below in several cases where there are large numbers of experiments with varying parameters to keep the total execution time within reason. Yahoo represents a number of different pages downloaded recursively at a single point in time. Emacs contains the source trees for GNU Emacs 20.6 and 20.7. The MH data set refers to individual files consisting of entire e-mail messages. Finally, Users is the contents of one partition of a shared storage system within IBM, containing data from 33 users totaling nearly 7 GB.
REBL is intended for much larger data sets than the ones presented here. However, the analysis was implemented using in-memory data structures based on the GNU C++ Standard Template Library, and as a result the application's virtual address space places limits on the metadata (particularly matching pairs of fingerprints) kept in memory. The results presented below demonstrate the scalability issues mentioned previously, and lead us to believe that one can extrapolate to larger data sets once out-of-memory data structures are used. Furthermore, one can vary parameters such as block size and the number of super-fingerprints per block to keep the meta-data requirements low. In particular, the Users data set is an order of magnitude larger than the next-largest data set, containing many large files, so the REBL analysis of it is done with an average chunk size of 4 KB rather than 1 KB.
A separate application reads the fingerprints and chunk information to perform REBL or DERD analysis. First, it computes super-fingerprints from the fingerprints, given a specific ratio of fingerprints per super-fingerprint (with a ratio of one, this step would be skipped). At this point, we have the option of FirstFit or BestFit.
Next we compress any remaining chunks that have not already been delta-encoded, including the reference chunks.
Finally, each of the files in the data set is compressed to determine if compressing the entire file is more effective than eliminating duplicate blocks and delta-encoding similar blocks. If so, the WFC size is used instead. We found that CDC in the absence of WFC was much less effective than the combination of the two, but adding WFC to REBL usually made only a small difference because most chunks already benefitted from delta-encoding (see Section 5.2).
One technique we did not explicitly evaluate is duplication detection using fixed-size blocks, like that performed by Venti : we worked under the assumption that CDC would be preferable to fixed-size blocks. Other studies that compare the two techniques head-to-head have found that CDC frequently compresses better, at the cost of increased computation [8,19].
This section provides an empirical evaluation of several data reduction techniques with an emphasis on REBL. For REBL, we study the effect of parameters such as the average chunk size, the number of super-fingerprints, the similarity threshold above which delta-encoding is applied, and the shingle size. The techniques are compared along primarily two-dimensions: effectiveness (space savings) and efficiency (run time).
The techniques evaluated in at least one scenario include:
Our experiments were performed on an unmodified RedHat Linux 2.4.18-10 kernel running on an IBM eServer xSeries 360 with dual 1.60 GHz Pentium Xeon processors, 6 GB RAM (2x2 GB plus 2x1 GB), and three 36 GB 10k-RPM SCSI disks connected to an IBM Netfinity ServeRAID 5 controller. All data sets resided in an untuned ext3 file system on local disks. Although an SMP kernel was used, our tests were not optimized to utilize both processors. All times reported are the sums of user and system time as reported by getrusage.
As discussed in Section 2.1, CDC systems such as LBFS  compute SHA hashes of content-defined chunks and use the hash value to detect duplicates. A potential limitation of this approach is that chunks with slight differences get no benefit from the overlap. For REBL to be more effective than CDC, there must be a substantial number of chunks that are similar but not identical.
Figure 1 plots a cumulative distribution of the fraction of chunks that match another chunk in a given number of fingerprints. The graph shows results for the Slashdot and Yahoo data sets with 84 fingerprints per chunk and shows curves corresponding to average chunk sizes of 1 KB, 4 KB, and whole files. Whole files correspond to an infinitely large average chunk size, which is similar to DERD. All chunks match another chunk in at least 0 fingerprints, so each curve meets the 0 value on the x-axis at y=1. The rightmost points on the graph (depicted as x=85) show the fraction of chunks that are duplicated; smaller chunks lead to greater effectiveness for CDC because they allow duplicate content to be detected with finer granularity. Between these extremes, more of the smallest chunks match other chunks in the greatest number of features. However, any chunks that are not exact duplicates but match many fingerprints are ``missed'' by CDC, but they are potentially usable by REBL for delta-encoding and result in improved space savings. A good heuristic for expecting improvement from delta-encoding is to match at least half the fingerprints .
Next we look at the use of a smaller number of super-fingerprints to approximate a larger number of fingerprints. As discussed in Section 3, super-fingerprints are generated by combining fingerprints. For instance, 84 fingerprints can be clustered into groups of six to form 14 super-fingerprints. To generate super-fingerprints, REBL concatenates fingerprints and calculates the corresponding MD5 value of the resulting string.
Note that the clustering of fingerprints into super-fingerprints is necessary to make the FirstFit variant viable. One could delta-encode any pair of chunks that matched a single fingerprint, but unless the shingle size is quite high, there is little assurance of commonality between the chunks. On the other hand, if two chunks share a specific set of fingerprints, the larger the set, the greater the likelihood of a significant overlap .
Figure 2 plots the cumulative distribution of the number of chunks that match another in at least a threshold fraction of fingerprints or super-fingerprints. The data sets used are Slashdot and MH, with 84 fingerprints, 1-KB average chunks, and 21, 14, 6 and 4 super-fingerprints per chunk. (The curve for 84 fingerprints on the Slashdot data set corresponds to the 1-KB curve for Slashdot in the preceding figure.) The results indicate that lowering the threshold for similarity between chunks results in more chunks being considered ``similar.'' The results for super-fingerprints follow a similar trend as for regular fingerprints. A useful observation from both data sets is that we can select a threshold value for super-fingerprints that corresponds to a higher number of matching fingerprints. For example, in Figure 2(a), a threshold of one out of four (25%) super-fingerprints is approximately equivalent to a threshold of 73 out of 84 (87%) fingerprints. Using super-fingerprints therefore decreases REBL's execution time by reducing the number of comparisons, as the next subsection describes.
As discussed in Section 4.1, REBL has two variants, BestFit and FirstFit. In this section, we compare them by contrasting relative effectiveness and efficiency.
Figure 3(a) plots the sizes of the MH and Slashdot data sets, encoded using the FirstFit and BestFit variants, and reported relative to the original data set sizes. The experiment uses 84 fingerprints per chunk and two chunking methods, whole files (DERD) and an average chunk size of 1 KB. In this subsection we consider only the 1-KB chunks, discussing later how this compares to other sizes. The figure demonstrates the effect of varying the number of super-fingerprints from three to 84 (with 84 meaning there is no clustering). Both FirstFit and BestFit have comparable encoding sizes for up to a number of super-fingerprints (21 for these two data sets). After this point, BestFit encodes the most effectively because taking the first fit with too few fingerprints per cluster is a poor predictor of a match.
Figure 3(b) plots the corresponding execution times. As we increase the number of super-fingerprints, the number of comparisons for BestFit to detect similar chunks increases, leading to dramatically greater execution times. The execution times using FirstFit are more or less stable and do not show the same sharp increase. In short, using FirstFit allows a little space to be sacrificed in exchange for dramatically lower execution times. For example, with Slashdot using FirstFit, 1-KB chunks, and six super-fingerprints, REBL produces a relative size of 1.52%; the best BestFit number is 1.18% with 42 super-fingerprints. However, the corresponding absolute execution times are 8.1 and 173.5 seconds respectively.
As mentioned in Section 3.1, one interesting parameter that can be modified when using BestFit is its eagerness to use the most similar reference chunk against which to delta-encode a version chunk. For instance, one might naturally assume that encoding a chunk against one that matches it in 80/84 fingerprints would be preferable to encoding it against another that matches only 70/84. However, consider a case where chunk A matches chunk B in 82 fingerprints, chunk C in 75, and chunk D in 70; C and D resemble each other in 80/84. Encoding A against B and C against D generates two small deltas and two reference chunks, but encoding B, C, and D against A results in slightly larger deltas but only one unencoded chunk. As a result of this eagerness, FirstFit often surprisingly encoded better than BestFit until we added an approximation metric to BestFit, which lets a given chunk be encoded against a specific reference chunk if the latter chunk is within a factor of the best matching chunk. Empirically, allowing matches within 80-90% of the best match improved overall effectiveness, as shown in the sensitivity analysis in Section 5.3.2.
Referring again to Figure 3(a), but this time considering the curves for DERD as well as 1-KB chunks, we see that REBL is always at least as effective as DERD for both FirstFit and BestFit. The curves for 4-KB chunks are omitted in order to keep the graphs readable, but generally follow the 1-KB chunk curves with slightly more space consumed.
The corresponding execution times for DERD and REBL are plotted in Figure 3(b). As expected, the greatest execution time is for REBL with BestFit; breaking each file into chunks results in more comparisons. Execution times for BestFit increase sharply with increasing numbers of super-fingerprints. The best overall results considering both effectiveness and efficiency are with the FirstFit variant of REBL using a small number of super-fingerprints. Using 4-KB chunks (not shown) proves to be moderately faster than 1-KB chunks.
One might ask whether it is sufficient to simply use some number (N) of fingerprints rather than combining a larger number of fingerprints into the same number (N) of super-fingerprints. In fact, with BestFit, using as few as 14 fingerprints is nearly as effective as using 84 fingerprints. However, even with only 14 fingerprints, the execution cost of BestFit is substantially greater than FirstFit. Figure 4 reports relative sizes and execution times for the Slashdot data set as a function of the number of fingerprints or super-fingerprints, using an average chunk size of 4 KB. With super-fingerprints and FirstFit, relative size increases with more super-fingerprints, while with fingerprints and BestFit relative size decreases with more fingerprints. On the other hand, execution time with BestFit increases sharply with more fingerprints. The effectiveness with super-fingerprints using FirstFit is similar to that using a larger number of fingerprints and BestFit.
To quantify these differences with the example in Figure 4, FirstFit with seven super-fingerprints has a relative size of 1.64%; the best effectiveness using fingerprints is 1.57% with 84 fingerprints per chunk. However, the former runs in just 1.4% of the time taken by the latter. The relative execution time with 14 fingerprints and BestFit, which gives comparable results to using 84 fingerprints, is 26.0%. Thus, lowering the number of fingerprints per chunk to reduce comparisons (and increase efficiency) may not yield the best encoding size and execution times. In contrast, the use of super-fingerprints and the FirstFit variant of REBL is both effective and efficient.
In this section, we compare a variety of techniques, focusing on effectiveness and briefly discussing efficiency as indicated by execution times. Table 2 reports sizes compared to the original data set. The techniques include: compression of entire collections (TGZ), individual files (WFC), and individual blocks or chunks (PBC); SLIDINGBLOCK; CDC; REBL; and DERD. The relative sizes with CDC are reported for average chunk sizes of 1 KB (with and without WFC) and 4 KB (with WFC). REBL numbers use average chunk sizes of 1 KB (except the 7-GB Users set) and 4 KB, and they include both PBC and WFC.
For the experiments using these data sets, we strove for consistency whenever possible. However, there are some cases where varying a parameter or application made a huge difference. In particular, gzip produces output somewhat smaller than vcdiff for all our data sets except Slashdot, for which it is nearly an order of magnitude larger. We report the vcdiff number in that case only.
For both REBL and DERD, the table gives numbers for 14 super-fingerprints using FirstFit. Full comparisons of regular fingerprints generally gave a smaller encoding, but at a disproportionately high processing cost as the experiments above demonstrated. REBL had the smallest encoding size in three out of the five data sets above. REBL encoded more effectively than CDC, SLIDINGBLOCK, and WFC with all data sets; it was better than TGZ except with the Yahoo data set; and it was better than DERD except with the Yahoo and Emacs data sets.
Figure 5 graphically depicts the data in Table 2, and Figure 6 shows a scatterplot of how the other techniques compare to REBL. For consistency, we compare the encoding sizes of CDC (1-KB chunks) with REBL (1-KB chunks) and CDC (4-KB chunks) with REBL (4-KB chunks). Users is compared to REBL with 4-KB chunks throughout. Otherwise, REBL with 1-KB chunks is used as the baseline.
As with REBL using 1-KB chunks, REBL using 4-KB chunks is better than TGZ (except with the Yahoo data set), WFC, SLIDINGBLOCK, CDC with either 1-KB and 4-KB chunks and DERD except the Yahoo and Emacs data sets. The effectiveness of REBL compared to TGZ varied by factors of 0.59-2.46, WFC by 1.28-14.25, SLIDINGBLOCK by 1.18-2.56, CDC by 1.03-6.67 and DERD by 0.88-2.91.
Additionally, we compare the effectiveness of REBL with and without using WFC. The relative sizes of the Slashdot, Yahoo, Emacs, and MH data sets using REBL with 1-KB chunks and without WFC are 1.9%, 14.85%, 17.8% and 36.0% respectively. REBL without WFC with 4-KB chunks using the Users data set has a relative size of 30.6%. The corresponding relative sizes of REBL with WFC are reported in Table 2. Without WFC, the relative sizes for two of the data sets (Slashdot and Users) are within 3% of the sizes that include WFC, but REBL without WFC encodes worse than REBL with WFC for the Yahoo, Emacs, MH and data sets by 7.5%, 16.2%, and 11.5% respectively.
An obvious question is how the execution time of REBL compares to the other approaches we have discussed. We have already compared REBL and DERD in some depth. Since REBL relies on CDC, it is inherently more costly than CDC, which in turn requires substantially more computation than a simple technique such as TGZ. How much more costly REBL is depends on how many deltas are computed and how much computation is performed to select chunks to delta. The FirstFit variant requires processing that scales linearly, rather than quadratically, with the input size, just as the CDC processing does. Hence the additional cost is comparable to CDC processing. The additional space savings varies across data sets; for Slashdot the additional savings seems easily warranted, while for MH it seems unlikely to be worthwhile.
One potentially important parameter for REBL is the average chunk size. As discussed in Section 5.1.1, smaller chunk sizes provide more opportunity to find similar chunks for delta-encoding. Figure 7 reports results of experiments with the Slashdot data set, 84 fingerprints per chunk, and varying the average chunk sizes from 512 to 8192 bytes. In the case of Slashdot, for both FirstFit and BestFit, increasing the average chunk size results in larger encoding sizes. The smallest relative size is obtained with the 512-byte chunk size in both variants of REBL.
The same experiment was performed with the MH data set using the FirstFit variant of REBL. In this case too, the smallest relative size (32.3%) was obtained using a chunk size of 512 bytes and other chunk sizes reported slightly larger sizes. However, there was minimal degradation in effectiveness moving to larger sizes (the maximum relative size was 32.4% with a chunk size of 8 KB).
Choosing a smaller chunk size provides more opportunities for delta-encoding and better space savings but at a higher run-time cost. For example, for the MH data set, 8-KB chunks save about 20% of the REBL post-CDC processing time, compared to 4-KB chunks, for almost identical encoding effectiveness.
Another parameter we evaluated is the threshold for approximate matching of BestFit chunks. Without this threshold, using BestFit, a chunk is encoded against another with which it has the most matching fingerprints or super-fingerprints. For a given reference chunk, the ``BestFit threshold'' determines how loose this match can be, permitting the encoding of any chunks within the specified fraction of the best match. Note that in all cases, the system is counting matches and paying the O(n2) complexity. A very small fraction (low threshold) approximates the FirstFit approach in terms of effectiveness, but not efficiency, as it counts the matches but then largely disregards them.
Figure 8 shows the effect of varying the BestFit threshold on the Slashdot data set using 84 fingerprints per chunk and an average chunk size of 1 KB. The graph indicates that as the threshold increases, effectiveness increases, up to about 90%. A threshold of one corresponds to the most precise match, but it actually misses opportunities for delta-encoding, resulting in increased encoding size. A threshold between 0.7 and 0.9 yields the smallest encoding sizes with regular fingerprints for the Slashdot data set; other data sets show similar trends. As expected, the figure also shows increasing relative times as the threshold increases.
A shingle specifies the size of a window that slides over the entire file advancing one byte a time, producing a Rabin fingerprint value for each fixed-size set of bytes. The Rabin fingerprints are used to flag content-defined chunk boundaries and to generate features for each chunk that can be used to identify similar ones. We performed experiments varying the shingle size, and using the Slashdot and MH data sets with 84 fingerprints per chunk, 14 super-fingerprints, an average chunk size of 4 KB, and the FirstFit variant.
With the Slashdot data set, shingles of four or eight bytes get much less benefit from REBL than larger sizes, but otherwise the analysis is noisy and several disjoint values give similar results. Shingle sizes of 20 and 44 bytes yield similar relative encoding sizes of 2.25% and 2.19% respectively and shingle sizes of 12 and 24 bytes yield relative sizes of 2.52% and 2.53% respectively, whereas a shingle size of 16 bytes results in a relative size of 5.12%. The maximum relative size of 6.84% is obtained with a shingle size of 8 bytes and the minimum of 2.19% with a shingle size of 44 bytes.
In the case of the MH data set, we found that varying the size of a shingle did not vary the relative sizes by as much as in the Slashdot data set, though they did impact processing time. The minimum relative size was 31.2% with a shingle size of 8 bytes and maximum relative size was 32.5% with a shingle size of 44 bytes. However, the 8-byte shingles took 156 CPU-seconds, while 12-byte shingles took 114 CPU-seconds (27% less) to encode the data set to 31.5% of the original (0.7% more).
We conclude that past work that used four-byte shingles  may have found their resemblance detection system to be noisier than necessary, but sizes of twelve bytes or more are probably equally arguable.
We have compared two variants for similarity detection among blocks, FirstFit and BestFit, and demonstrated that FirstFit with super-fingerprints produces a good combination of space reduction and execution overhead. Super-fingerprints are good approximations of regular fingerprints in all the data sets we experimented with. A low threshold of matching super-fingerprints usually results in similar effectiveness to that obtained using a higher threshold for regular fingerprints, but with a dramatically lower execution cost (orders of magnitude in some cases).
The effectiveness of REBL in our experiments is always better than WFC and CDC. However, this is because it incorporates the technology of compression at the file and block level, and the suppression of duplicate blocks, before adding delta-encoding. In fact, compressing individual chunks (or blocks) in any sort of CDC or SLIDINGBLOCK system seems an essential optimization unless the rate of duplication is substantially higher than we have seen in these data sets. This is consistent with the earlier SLIDINGBLOCK work , which found that SLIDINGBLOCK needed to incorporate block-level compression to be competitive with gzip. Compressing entire files when none of its pieces are suppressed as a duplicate similarly offers benefits.
In some cases, REBL encodes noticeably better than DERD; in the other cases, REBL is very similar to it. The key difference between REBL and DERD comes from dramatic reductions in execution times. The average block size used to mark content-defined blocks affects the encoding sizes of REBL to a limited extent; the more similarity there is in a data set, such as Slashdot, the more effective smaller blocks are.
We are currently working on extending and experimenting with the REBL and LBFS techniques to reduce network usage in communication environments. This will be helpful to determine the applicability of REBL in reducing redundant network traffic, and it can also be compared with other network-oriented mechanisms like rsync [23,29] and link-level fingerprint-based duplicate detection . We would also like to evaluate the effectiveness of these techniques in new environments, such as the Google GMail system, which may offer additional opportunities for large amounts of data with subtle variations. Finally, additional detailed comparisons of the wide variety of encoding techniques may offer the opportunity to consider new metrics, such as ``bytes saved per cycle,'' in selecting among the alternatives.
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -no_navigation -white paper.tex
The translation was initiated by Fred Douglis and Purushottam Kulkarni on 2004-05-12