Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • OSDI '14 Home
  • Symposium Organizers
  • At a Glance
  • Registration Information
    • Registration Discounts
    • Venue, Hotel, and Travel
  • Technical Sessions
  • Co-Located Workshops
  • Activities
    • Birds-of-a-Feather Sessions
    • Poster Sessions
  • Sponsorship
  • Students and Grants
  • Co-located Workshops
  • Questions?
  • Help Promote!
  • For Participants
  • Call for Papers
  • Past Symposia

sponsors

Diamond Sponsor
Diamond Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
General Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner
Industry Partner

twitter

Tweets by @usenix

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
Tweet

connect with us

http://twitter.com/usenix
https://www.facebook.com/usenixassociation
http://www.linkedin.com/groups/USENIX-Association-49559/about
https://plus.google.com/108588319090208187909/posts
http://www.youtube.com/user/USENIXAssociation

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

Thursday, August 7, 2014 - 12:45pm
Authors: 

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, University of Toronto

Abstract: 

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. RunningAspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

Ding Yuan, University of Toronto

Yu Luo, University of Toronto

Xin Zhuang, University of Toronto

Guilherme Renna Rodrigues, University of Toronto

Xu Zhao, University of Toronto

Yongle Zhang, University of Toronto

Pranay U. Jain, University of Toronto

Michael Stumm, University of Toronto

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Yuan PDF
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

  • Log in or    Register to post comments

Diamond Sponsors

Gold Sponsors

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

© USENIX

  • Privacy Policy
  • Contact Us