TiDedup: A New Distributed Deduplication Architecture for Ceph

Authors: 

Myoungwon Oh and Sungmin Lee, Samsung Electronics Co.; Samuel Just, IBM; Young Jin Yu and Duck-Ho Bae, Samsung Electronics Co.; Sage Weil, Ceph Foundation; Sangyeun Cho, Samsung Electronics Co.; Heon Y. Yeom, Seoul National University

Abstract: 

This paper presents TiDedup, a new cluster-level deduplication architecture for Ceph, a widely deployed distributed storage system. Ceph introduced a cluster-level deduplication design before; unfortunately, a few shortcomings have made it hard to use in production: (1) Deduplication of unique data incurs excessive metadata consumption; (2) Its serialized tiering mechanism has detrimental effects on foreground I/Os, and by design, only provides fixed-sized chunking algorithms; and (3) The existing reference count mechanism resorts to inefficient full scan of entire objects, and does not work with Ceph’s snapshot. TiDedup effectively overcomes these shortcomings by introducing three novel schemes: Selective cluster-level crawling, an event-driven tiering mechanism with content defined chunking, and a reference correction method using a shared reference back pointer. We have fully validated TiDedup and integrated it into the Ceph mainline, ready for evaluation and deployment in various experimental and production environments. Our evaluation results show that TiDedup achieves up to 34% data reduction on real-world workloads, and when compared with the existing deduplication design, improves foreground I/O throughput by 50% during deduplication, and significantly reduces the scan time for reference correction by more than 50%.

USENIX ATC '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {288727,
author = {Myoungwon Oh and Sungmin Lee and Samuel Just and Young Jin Yu and Duck-Ho Bae and Sage Weil and Sangyeun Cho and Heon Y. Yeom},
title = {{TiDedup}: A New Distributed Deduplication Architecture for Ceph},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
isbn = {978-1-939133-35-9},
address = {Boston, MA},
pages = {117--131},
url = {https://www.usenix.org/conference/atc23/presentation/oh},
publisher = {USENIX Association},
month = jul
}

Presentation Video