Efficient Exposure of Partial Failure Bugs in Distributed Systems with Inferred Abstract States

Authors: 

Haoze Wu and Jia Pan, Johns Hopkins University; Peng Huang, University of Michigan

Abstract: 

Many distributed system failures, especially the notorious partial service failures, are caused by bugs that are only triggered by subtle faults at rare timing. Existing testing is inefficient in exposing such bugs. This paper presents Legolas, a fault injection testing framework designed to address this gap. To precisely simulate subtle faults, Legolas statically analyzes the system code and instruments hooks within a system. To efficiently explore numerous faults, Legolas introduces a novel notion of abstract states and automatically infers abstract states from code. During testing, Legolas designs an algorithm that leverages the inferred abstract states to make careful fault injection decisions. We applied Legolas on the latest releases of six popular, extensively tested distributed systems. Legolas found 20 new bugs that result in partial service failures.

NSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {295609,
author = {Haoze Wu and Jia Pan and Peng Huang},
title = {Efficient Exposure of Partial Failure Bugs in Distributed Systems with Inferred Abstract States},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {1267--1283},
url = {https://www.usenix.org/conference/nsdi24/presentation/wu-haoze},
publisher = {USENIX Association},
month = apr
}