Check out the new USENIX Web site.
USENIX, The Advanced Computing Systems Association

NSDI '08 Abstract

Pp. 423437 of the Proceedings

D3S: Debugging Deployed Distributed Systems

Xuezheng Liu and Zhenyu Guo, Microsoft Research Asia; Xi Wang, Tsinghua University; Feibo Chen, Fudan University; Xiaochen Lian, Shanghai Jiaotong University; Jian Tang and Ming Wu, Microsoft Research Asia; M. Frans Kaashoek, MIT CSAIL; Zheng Zhang, Microsoft Research Asia

Abstract

Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D3S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D3S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause.

Developers write predicates in a simple and sequential programming style, while D3S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D3S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D3S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).

  • View the full text of this paper in HTML and PDF. Listen to the presentation in MP3 format.
    The Proceedings are published as a collective work, 2008 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
To become a USENIX member, please see our Membership Information.

Last changed: 11 Aug 2008 mn