Estimating Unseen {Deduplication—from} Theory to Practice

Danny Harnik; Ety Khaitzin; Dmitry Sotnikov

help promote

FAST '17 CFP

Get
Help Promote graphics!

USENIX Conference Policies

Estimating Unseen Deduplication—from Theory to Practice

Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa

Estimating the deduplication ratio of a very large dataset is both extremely useful, but genuinely very hard to perform. In this work we present a new method for accurately estimating deduplication benefits that runs 3X to 15X faster than the state of the art to date. The level of improvement depends on the data itself and on the storage media that it resides on. The technique is based on breakthrough theoretical work by Valiant and Valiant from 2011, that give a provably accurate method for estimating various measures while seeing only a fraction of the data. However, for the use case of deduplication estimation, putting this theory into practice runs into significant obstacles. In this work, we find solutions and novel techniques to enable the use of this new and exciting approach. Our contributions include a novel approach for gauging the estimation accuracy, techniques to run it with low memory consumption, a method to evaluate the combined compression and deduplication ratio, and ways to perform the actual sampling in real storage systems in order to actually reap benefits from these algorithms. We evaluated our work on a number of real world datasets.

Danny Harnik, IBM Research—Haifa

Ety Khaitzin, IBM Research—Haifa

Dmitry Sotnikov, IBM Research—Haifa

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {194446,
author = {Danny Harnik and Ety Khaitzin and Dmitry Sotnikov},
title = {Estimating Unseen {Deduplication{\textemdash}from} Theory to Practice},
booktitle = {14th USENIX Conference on File and Storage Technologies (FAST 16)},
year = {2016},
isbn = {978-1-931971-28-7},
address = {Santa Clara, CA},
pages = {277--290},
url = {https://www.usenix.org/conference/fast16/technical-sessions/presentation/harnik},
publisher = {USENIX Association},
month = feb
}

help promote

USENIX Conference Policies

Estimating Unseen Deduplication—from Theory to Practice

Danny Harnik, IBM Research—Haifa

Ety Khaitzin, IBM Research—Haifa

Dmitry Sotnikov, IBM Research—Haifa

Open Access Media

Presentation Audio

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners

Open Access Publishing Partners

sponsors

help promote

USENIX Conference Policies

Estimating Unseen Deduplication—from Theory to Practice

Danny Harnik, IBM Research—Haifa

Ety Khaitzin, IBM Research—Haifa

Dmitry Sotnikov, IBM Research—Haifa

Open Access Media

Presentation Audio

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners

Open Access Publishing Partners