NSDI '06 Abstract
Pp. 225238 of the Proceedings
Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems
Suman Nath, Microsoft Research; Haifeng Yu, and Phillip B. Gibbons, Intel Research Pittsburgh; Srinivasan Seshan, Carnegie Mellon University
High availability is widely accepted as an explicit requirement for
distributed storage systems. Tolerating correlated failures is a key
issue in achieving high availability in today's wide-area
environments. This paper systematically revisits previously proposed
techniques for addressing correlated failures. Using several
real-world failure traces, we qualitatively answer four important
questions regarding how to design systems to tolerate such failures.
Based on our results, we identify a set of design principles that
system builders can use to tolerate correlated failures. We show how
these lessons can be effectively used by incorporating them into
IrisStore, a distributed read-write storage layer that provides high
availability. Our results using IrisStore on the PlanetLab over an 8-month
period demonstrate its ability to withstand large correlated
failures and meet preconfigured availability targets.
- View the full text of this paper in HTML and PDF. Listen to the presentation in MP3 format.
Until May 2007, you will need your USENIX membership identification in order to access the full papers. The Proceedings are published as a collective work, © 2006 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.