Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems

Chang Lou, University of Virginia; Dimas Shidqi Parikesit, University of Virginia and Bandung Institute of Technology; Yujin Huang, The Pennsylvania State University; Zhewen Yang and Senapati Diwangkara, Johns Hopkins University; Yuzhuo Jing, University of Michigan; Achmad Imam Kistijantoro, Bandung Institute of Technology; Ding Yuan, University of Toronto; Suman Nath, Microsoft Research; Peng Huang, University of Michigan

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers.

In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.

OSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {308712,
author = {Chang Lou and Dimas Shidqi Parikesit and Yujin Huang and Zhewen Yang and Senapati Diwangkara and Yuzhuo Jing and Achmad Imam Kistijantoro and Ding Yuan and Suman Nath and Peng Huang},
title = {Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems},
booktitle = {19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)},
year = {2025},
isbn = {978-1-939133-47-2},
address = {Boston, MA},
pages = {19--38},
url = {https://www.usenix.org/conference/osdi25/presentation/lou},
publisher = {USENIX Association},
month = jul
}

Presentation Video