USITS '03 Abstract
Why Do Internet Services Fail, and What Can Be Done About It?
David Oppenheimer, Archana Ganapathi, and David A. Patterson, University of California, Berkeley
In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures See J. Gray. Why do computers stop and what can be done about it? Symposium on Reliability in Distributed Software and Database Systems, 1986... Seventeen years later, Internet services have replaced fault-tolerant servers as the new kid on the 24x7-availability block. Using data from three large-scale Internet services, we analyzed the causes of their failures and the (potential) effectiveness of various techniques for preventing and mitigating service failure. We find that (1) operator error is the largest single cause of failures in two of the three services, (2) operator errors often take a long time to repair, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems.
- View the full text of this paper in HTML and PDF.
Until March 2004, you will need your USENIX membership identification in order to access the full papers. The Proceedings are published as a collective work, © 2003 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.