Five Pitfalls for Benchmarking Big Data Systems
LISA: Where systems engineering and operations professionals share real-world knowledge about designing, building, and maintaining the critical systems of our interconnected world.
The LISA conference has long served as the annual vendor-neutral meeting place for the wider system administration community. The LISA14 program recognized the overlap and differences between traditional and modern IT operations and engineering, and developed a highly-curated program around 5 key topics: Systems Engineering, Security, Culture, DevOps, and Monitoring/Metrics. The program included 22 half- and full-day training sessions; 10 workshops; and a conference program consisting of 50 invited talks, panels, refereed paper presentations, and mini-tutorials.
Yanpei Chen and Gwen Shapira, Cloudera, Inc.
Performance is an increasingly important attribute of Big Data systems as focus shifts from batch processing to real-time analysis and to consolidated multi-tenant systems. One of the little-understood challenges in scaling data systems is properly defining and measuring performance. The complexity, diversity, and scale of big data systems make this a difficult task and we frequently encounter haphazard benchmarks that lead to bad technology choices, poor purchasing decisions, and suboptimal cluster operations. This talk draws on performance engineering and field services experience from a leading Big Data vendor. We will talk about the most common performance benchmarking pitfalls and share practical advice on how to avoid them with rigorous metrics and measurement methods.
Yanpei Chen, Cloudera Inc.

Yanpei Chen is a member of the Performance Engineering Team at Cloudera, where he works on internal and competitive performance measurement and optimization. His work touches upon multiple interconnected computation frameworks, including Cloudera Search, Cloudera Impala, Apache Hadoop, Apache HBase, and Apache Hive. He is the lead author of the Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads. SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners. He received his doctorate at the UC Berkeley AMP Lab, where he worked on performance-driven, large-scale system design and evaluation.
Gwen Shapira, Cloudera Inc.

Gwen Shapira is a Solutions Architect at Cloudera. She has 15 years of experience working with customers to design scalable data architectures. Working as a data warehouse DBA, ETL developer and a senior consultant. She specializes in migrating data warehouses to Hadoop, integrating Hadoop with relational databases, building scalable data processing pipelines, and scaling complex data analysis algorithms.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Yanpei Chen and Gwen Shapira},
title = {Five Pitfalls for Benchmarking Big Data Systems},
year = {2014},
address = {Seattle, WA},
publisher = {USENIX Association},
month = nov
}
connect with us