The Dilemma between Deduplication and Locality: Can Both be Achieved?


Xiangyu Zou and Jingsong Yuan, Harbin Institute of Technology, Shenzhen; Philip Shilane, Dell Technologies; Wen Xia, Harbin Institute of Technology, Shenzhen, and Wuhan National Laboratory for Optoelectronics; Haijun Zhang and Xuan Wang, Harbin Institute of Technology, Shenzhen


Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem, which leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g. rewriting) or caching data in memory or SSD, but fragmentation continues to hurt restore and GC performance.

Investigating the locality issue, we observed that most duplicate chunks in a backup are directly from its previous backup. We therefore propose a novel management-friendly deduplication framework, called MFDedup, that maintains the locality of backup workloads by using a data classification approach to generate an optimal data layout. Specifically, we use two key techniques: Neighbor-Duplicate-Focus indexing (NDF) and Across-Version-Aware Reorganization scheme (AVAR), to perform duplicate detection against a previous backup and then rearrange chunks with an offline and iterative algorithm into a compact, sequential layout that nearly eliminates random I/O during restoration.

Evaluation results with four backup datasets demonstrates that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12x to 2.19x higher and restore throughputs that are 2.63x to 11.64x faster due to the optimal data layout we achieve. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indexes for two backup versions, while the traditional index grows with the number of versions retained.

FAST '21 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {264828,
author = {Xiangyu Zou and Jingsong Yuan and Philip Shilane and Wen Xia and Haijun Zhang and Xuan Wang},
title = {The Dilemma between Deduplication and Locality: Can Both be Achieved?},
booktitle = {19th USENIX Conference on File and Storage Technologies (FAST 21)},
year = {2021},
isbn = {978-1-939133-20-5},
pages = {171--185},
url = {},
publisher = {USENIX Association},
month = feb

Presentation Video