DupHunter: Flexible High-Performance Deduplication for Docker Registries

Authors: 

Nannan Zhao, Hadeel Albahar, Subil Abraham, and Keren Chen, Virginia Tech; Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, and Ali Anwar, IBM Research—Almaden; Ali R. Butt, Virginia Tech

Abstract: 

Containers are increasingly used in a broad spectrum of applications from cloud services to storage to supporting emerging edge computing paradigm. This has led to an explosive proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of registries, which store and serve images. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the severe storage requirements of the growing registries. However, existing deduplication techniques largely degrade the performance of registry because of layer restore overhead. In this paper, we propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layer for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes , which provide different levels of storage efficiency, durability, and performance, to support a range of uses. To mitigate the negative impact of deduplication on the image download times, DupHunter introduces a two-tier storage hierarchy with a novel layer prefetch/preconstruct cache algorithm based on user access patterns. Under real workloads, in the highest data reduction mode, DupHunter reduces storage space by up to 6.9x compared to the current implementations. In the highest performance mode, DupHunter can reduce the GET layer latency up to 2.8x compared to the state-of-the-art.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {254473,
author = {Nannan Zhao and Hadeel Albahar and Subil Abraham and Keren Chen and Vasily Tarasov and Dimitrios Skourtis and Lukas Rupprecht and Ali Anwar and Ali R. Butt},
title = {{DupHunter}: Flexible {High-Performance} Deduplication for Docker Registries},
booktitle = {2020 USENIX Annual Technical Conference (USENIX ATC 20)},
year = {2020},
isbn = {978-1-939133-14-4},
pages = {769--783},
url = {https://www.usenix.org/conference/atc20/presentation/zhao},
publisher = {USENIX Association},
month = jul
}

Presentation Video