Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Attend
    • Registration Information
    • Registration DIscounts
    • Venue, Hotel, and Travel
    • Students and Grants
  • Program
    • At a Glance
    • Technical Sessions
    • Training Program
    • Poster Sessions
    • WiPs
  • Activities
    • Birds-of-a-Feather Sessions
    • Poster Sessions
  • Sponsorship
  • Participate
    • Call for Papers
    • Call for Posters and WiPs
    • Instructions for Participants
  • About
    • Conference Organizers
    • Questions?
    • Services
    • Help Promote!
    • Past Conferences
  • Home
  • Attend
    • Registration Information
    • Registration DIscounts
    • Venue, Hotel, and Travel
    • Students and Grants
  • Program
    • At a Glance
    • Technical Sessions
    • Training Program
    • Poster Sessions
    • WiPs
  • Activities
  • Sponsorship
  • Participate
    • Call for Papers
    • Call for Posters and WiPs
    • Instructions for Participants
  • About
    • Conference Organizers
    • Questions?
    • Services
    • Help Promote!
    • Past Conferences

sponsors

Platinum Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
General Sponsor
General Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner
Industry Partner
Industry Partner

help promote

FAST '15 button

Get more
Help Promote graphics!

connect with us


  •  Twitter
  •  Facebook
  •  LinkedIn
  •  Google+
  •  YouTube

twitter

Tweets by @usenix

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks
Tweet

connect with us

http://twitter.com/usenix
https://www.facebook.com/pages/USENIX-Association/124487434386
http://www.linkedin.com/groups/USENIX-Association-49559/about
https://plus.google.com/108588319090208187909/posts
http://www.youtube.com/user/USENIXAssociation

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Authors: 

Raúl Gracia-Tinedo, Universitat Rovira i Virgili; Danny Harnik, Dalit Naor, and Dmitry Sotnikov, IBM Research Haifa; Sivan Toledo and Aviad Zuck, Tel-Aviv University

Abstract: 

Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication.

To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance.

In our implementation, called SDGen, characterizations take at most 2:5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).

Raúl Gracia-Tinedo, Universitat Rovira i Virgili

Danny Harnik, IBM Research Haifa

Dalit Naor, IBM Research Haifa

Dmitry Sotnikov, IBM Research Haifa

Sivan Toledo, Tel-Aviv University

Aviad Zuck, Tel-Aviv University

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {188460,
author = {Ra{\'u}l Gracia-Tinedo and Danny Harnik and Dalit Naor and Dmitry Sotnikov and Sivan Toledo and Aviad Zuck},
title = {{SDGen}: Mimicking Datasets for Content Generation in Storage Benchmarks},
booktitle = {13th USENIX Conference on File and Storage Technologies (FAST 15)},
year = {2015},
isbn = {978-1-931971-201},
address = {Santa Clara, CA},
pages = {317--330},
url = {https://www.usenix.org/conference/fast15/technical-sessions/presentation/gracia-tinedo},
publisher = {USENIX Association},
month = feb,
}
Download
Gracia-Tinedo PDF
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

  • Log in or    Register to post comments

Platinum Sponsors

Gold Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

© USENIX

  • Privacy Policy
  • Contact Us