Skip to main content
Back to USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Attend
    • Registration Information
  • Program
  • Participate
    • Instructions for Participants
    • Call for Participation
  • Sponsorship
  • About
    • Summit Organizers
    • Help Promote
    • Questions
    • Past Summits
  • Home
  • Attend
  • Program
  • Activities
  • Sponsorship
  • Participate
  • About

sponsors

Platinum Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner
Industry Partner

help promote

FAST '17 CFP

Get
Help Promote graphics!

USENIX Conference Policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

Estimating Unseen Deduplication—from Theory to Practice

Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa

Estimating the deduplication ratio of a very large dataset is both extremely useful, but genuinely very hard to perform. In this work we present a new method for accurately estimating deduplication benefits that runs 3X to 15X faster than the state of the art to date. The level of improvement depends on the data itself and on the storage media that it resides on. The technique is based on breakthrough theoretical work by Valiant and Valiant from 2011, that give a provably accurate method for estimating various measures while seeing only a fraction of the data. However, for the use case of deduplication estimation, putting this theory into practice runs into significant obstacles. In this work, we find solutions and novel techniques to enable the use of this new and exciting approach. Our contributions include a novel approach for gauging the estimation accuracy, techniques to run it with low memory consumption, a method to evaluate the combined compression and deduplication ratio, and ways to perform the actual sampling in real storage systems in order to actually reap benefits from these algorithms. We evaluated our work on a number of real world datasets.

Danny Harnik, IBM Research—Haifa

Ety Khaitzin, IBM Research—Haifa

Dmitry Sotnikov, IBM Research—Haifa

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {194446,
author = {Danny Harnik and Ety Khaitzin and Dmitry Sotnikov},
title = {Estimating Unseen {Deduplication{\textemdash}from} Theory to Practice},
booktitle = {14th USENIX Conference on File and Storage Technologies (FAST 16)},
year = {2016},
isbn = {978-1-931971-28-7},
address = {Santa Clara, CA},
pages = {277--290},
url = {https://www.usenix.org/conference/fast16/technical-sessions/presentation/harnik},
publisher = {USENIX Association},
month = feb
}
Download
Harnik PDF
View the slides

Presentation Audio

MP3 Download

Download Audio

  • Log in or register to post comments

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners

Open Access Publishing Partners

© USENIX
EIN 13-3055038

  • Privacy Policy
  • Contact Us