The Logic of Physical Garbage Collection in Deduplicating Storage

Authors: 

Fred Douglis, Abhinav Duggal, Philip Shilane, and Tony Wong, Dell EMC; Shiqin Yan, Dell EMC and University of Chicago; Fabiano Botelho, Rubrik, Inc.

Abstract: 

Most storage systems that write in a log-structured manner need a mechanism for garbage collection (GC), reclaiming and consolidating space by identifying unused areas on disk. In a deduplicating storage system, GC is complicated by the possibility of numerous references to the same underlying data. We describe two variants of garbage collection in a commercial deduplicating storage system, a logical GC that operates on the files containing deduplicated data and a physical GC that performs sequential I/O on the underlying data. The need for the second approach arises from a shift in the underlying workloads, in which exceptionally high duplication ratios or the existence of millions of individual small files result in unacceptably slow GC using the file-level approach. Under such workloads, determining the liveness of chunks becomes a slow phase of logical GC. We find that physical GC decreases the execution time of this phase by up to two orders of magnitude in the case of extreme workloads and improves it by approximately 10–60% in the common case, but only after additional optimizations to compensate for its higher initialization overheads.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {202322,
author = {Fred Douglis and Abhinav Duggal and Philip Shilane and Tony Wong and Shiqin Yan and Fabiano Botelho},
title = {The Logic of Physical Garbage Collection in Deduplicating Storage},
booktitle = {15th USENIX Conference on File and Storage Technologies (FAST 17)},
year = {2017},
isbn = {978-1-931971-36-2},
address = {Santa Clara, CA},
pages = {29--44},
url = {https://www.usenix.org/conference/fast17/technical-sessions/presentation/douglis},
publisher = {USENIX Association},
month = feb
}

Presentation Audio