Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems

Gen Dong and Yu Hua, Huazhong University of Science and Technology; Yongle Zhang, Purdue University; Zhangyu Chen and Menglei Chen, Huazhong University of Science and Technology

Fail-slow hardwares are still running and functional, but in a degraded mode, thus slower than their expected performance. Bugs triggered by fail-slow hardwares cause severe cloud system failures. Existing testing tools fail to efficiently detect these bugs due to overlooking their characteristics. In order to address this problem, this paper provides a bug study that analyzes 48 real-world fail-slow hardware failures from typical cloud systems. We observe that (1) fail-slow hardwares make high-level software components vulnerable, including synchronized and timeout mechanisms; (2) the fine granularity of fail-slow hardwares is necessary to trigger these bugs. Based on these two observations, we propose Sieve, a fault injection testing framework for detecting fail-slow hardware failure bugs. Sieve statically analyzes target system codes to identify synchronized and timeout-protected I/O operations as candidate fault points and instruments hooks before candidate fault points to enable fail slow hardware injection. To efficiently explore candidate fault points, Sieve adopts grouping and context-sensitive injection strategies. We have applied Sieve to three widely deployed cloud systems, i.e., ZooKeeper, Kafka, and HDFS. Sieve has detected six unknown bugs, two of which have been confirmed.

USENIX ATC '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {308544,
author = {Gen Dong and Yu Hua and Yongle Zhang and Zhangyu Chen and Menglei Chen},
title = {Understanding and Detecting {Fail-Slow} Hardware Failure Bugs in Cloud Systems},
booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
year = {2025},
isbn = {978-1-939133-48-9},
address = {Boston, MA},
pages = {1127--1142},
url = {https://www.usenix.org/conference/atc25/presentation/dong},
publisher = {USENIX Association},
month = jul
}

Presentation Video