• Donate
  • Log In
Home
  • About
    • About
      • About Us
      • Our Board of Directors
      • Board Meeting Minutes
      • Board Elections
      • Updates & Announcements
      • Our Staff
      • Governance & Financials
      • Lifetime Achievement Award
  • Events
    • Events
      • Upcoming
      • Past
      • Conference FAQ
      • Conference Policies
      • Code of Conduct
      • Calls for Papers
      • Author Resources
      • Grant Opportunities
      • Best Papers
      • Test of Time Awards
  • Join & Support
    • Join & Support
      • Become a Member
      • Ways to Give
      • Our Supporters
      • Student Opportunities
      • Sponsorship Opportunities
  • Archive
    • Archive
      • Proceedings
      • Multimedia
      • ;login: Archive
      • Short Topics in System Administration Series
      • Journal of Education in System Administration (JESA)
      • Journal of Election Technology and Systems (JETS)
      • Computing Systems Journal
  • Search

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

Author(s): 

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm

Large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or data loss. Conventional wisdom has it that these failures can only manifest themselves on large production clusters and are extremely difficult to prevent a priori, because these systems are designed to be fault tolerant and are well-tested. By investigating 198 user-reported failures that occurred on production-quality distributed systems, we found that almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors, and, surprisingly, many of them are caused by trivial mistakes such as error handlers that are empty or that contain expressions like "FIXME" or "TODO" in the comments. We therefore developed a simple static checker, Aspirator, capable of locating trivial bugs in error handlers; it found 143 new bugs and bad practices that have been fixed or confirmed by the developers.

Download Article: 
PDF icon Simple Testing Can Prevent Most Critical Failures
Article Section: 
PROGRAMMING
;login: issue: 
February 2015, Vol. 40, No. 1
USENIX logo
  • Contact USENIX
  • Privacy Policy

© USENIX 2025
EIN 13-3055038

Website designed and built by Giant Rabbit LLC
Powered by Backdrop CMS

We need contributions from individuals like you.

USENIX conferences directly influence the development of computing systems and products used worldwide. Contribute today to support this vital work for the next 50 years.

Secure the Future of USENIX

Donate
Close