Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Authors: 

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz; H. Birali Runesha, Mingzhe Hao, and Huaicheng Li, University of Chicago

Abstract: 

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {210508,
author = {Haryadi S. Gunawi and Riza O. Suminto and Russell Sears and Casey Golliher and Swaminathan Sundararaman and Xing Lin and Tim Emami and Weiguang Sheng and Nematollah Bidokhti and Caitie McCaffrey and Gary Grider and Parks M. Fields and Kevin Harms and Robert B. Ross and Andree Jacobson and Robert Ricci and Kirk Webb and Peter Alvaro and H. Birali Runesha and Mingzhe Hao and Huaicheng Li},
title = {{Fail-Slow} at Scale: Evidence of Hardware Performance Faults in Large Production Systems},
booktitle = {16th USENIX Conference on File and Storage Technologies (FAST 18)},
year = {2018},
isbn = {978-1-931971-42-3},
address = {Oakland, CA},
pages = {1--14},
url = {https://www.usenix.org/conference/fast18/presentation/gunawi},
publisher = {USENIX Association},
month = feb
}