Towards {Pre-Deployment} Detection of Performance Failures in Cloud Distributed Systems

Riza O. Suminto; Agung Laksono; Anang D. Satria; Thanh Do; Haryadi S. Gunawi

help promote

HotCloud '16 button

USENIX Conference Policies

Towards Pre-Deployment Detection of Performance Failures in Cloud Distributed Systems

Riza O. Suminto, University of Chicago; Agung Laksono and Anang D. Satria, Surya University; Thanh Do, Microsoft Gray Systems Lab; Haryadi S. Gunawi, University of Chicago

Modern distributed systems ("cloud systems") have emerged as a dominant backbone for many today's applications. They come in different forms such as scale-out file systems, key-value stores, computing frameworks, synchronization and cluster management services. As these systems collectively become the "cloud operating system", users expect high dependability including performance stability. Unfortunately, the complexity of the software and environment in which they must run has outpaced existing testing and debugging tools. Cloud systems must run at scale with different topologies, execute complex distributed protocols, face load fluctuations and a wide range of hardware faults, and serve users with diverse job characteristics.

One type of important failures is performance failures, a situation where a system (e.g., Hadoop) does not deliver the expected performance (e.g., a job takes 10x longer time than usual). Conversation with cloud engineers reflects that performance stability is often more important than performance optimization; when performance failures happen, users are frustrated, systems waste and underutilize resources, and long debugging efforts are required to find and fix the problems. Sadly, performance failures are still common; our previous work shows that 22% of vital issues reported by cloud system developers relate to performance bugs.

In this paper, our focus is to answer the following three questions: What is the root-cause anatomy of performance bugs that appear in cloud systems? What is missing within the state of the art of detecting performance bugs? What are new novel directions that can prevent performance failures to happen in the field?

Riza O. Suminto, University of Chicago

Agung Laksono, Surya University

Anang D. Satria, Surya University

Thanh Do, Microsoft Gray Systems Lab

Haryadi S. Gunawi, University of Chicago

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {190601,
author = {Riza O. Suminto and Agung Laksono and Anang D. Satria and Thanh Do and Haryadi S. Gunawi},
title = {Towards {Pre-Deployment} Detection of Performance Failures in Cloud Distributed Systems},
booktitle = {7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15)},
year = {2015},
address = {Santa Clara, CA},
url = {https://www.usenix.org/conference/hotcloud15/workshop-program/presentation/suminto},
publisher = {USENIX Association},
month = jul
}

Download

Suminto PDF

View the slides

help promote

USENIX Conference Policies

Towards Pre-Deployment Detection of Performance Failures in Cloud Distributed Systems

Riza O. Suminto, University of Chicago

Agung Laksono, Surya University

Anang D. Satria, Surya University

Thanh Do, Microsoft Gray Systems Lab

Haryadi S. Gunawi, University of Chicago

Open Access Media

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners

sponsors

help promote

USENIX Conference Policies

Towards Pre-Deployment Detection of Performance Failures in Cloud Distributed Systems

Riza O. Suminto, University of Chicago

Agung Laksono, Surya University

Anang D. Satria, Surya University

Thanh Do, Microsoft Gray Systems Lab

Haryadi S. Gunawi, University of Chicago

Open Access Media

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners