DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression

Authors: 

Jisung Park, ETH Zürich; Jeonggyun Kim, Yeseong Kim, and Sungjin Lee, DGIST; Onur Mutlu, ETH Zürich

Abstract: 

Data reduction in storage systems is an effective solution to minimize the management cost of a data center. To maximize data-reduction efficiency, prior works propose post-deduplication delta-compression techniques that perform delta compression along with traditional data deduplication and lossless compression. Compared to the two traditional techniques, delta compression could be more effective because it can provide large data reduction even for non-duplicate and high-entropy data by exploiting the similarity within stored data blocks. Unfortunately, we observe that existing post-deduplication delta-compression techniques achieve significantly lower data-reduction ratios than the optimal due to their limited accuracy in identifying similar data blocks.

In this paper, we propose DeepSketch, a new reference search technique for post-deduplication delta compression that leverages the learning-to-hash method to achieve higher accuracy in reference search for delta compression, thereby improving data-reduction efficiency. DeepSketch uses a deep neural network to extract a data block's sketch, an approximate data signature of the block that can preserve similarity with other blocks. Our evaluation using eleven real-world workloads shows that DeepSketch improves the data-reduction ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication delta-compression technique.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {277826,
title = {{DeepSketch}: A New Machine {Learning-Based} Reference Search Technique for {Post-Deduplication} Delta Compression},
booktitle = {20th USENIX Conference on File and Storage Technologies (FAST 22)},
year = {2022},
isbn = {978-1-939133-26-7},
address = {Santa Clara, CA},
pages = {247--264},
url = {https://www.usenix.org/conference/fast22/presentation/park},
publisher = {USENIX Association},
month = feb,
}

Presentation Video