Biswaranjan Panda and Deepthi Srinivasan, Nutanix Inc.; Huan Ke, University of Chicago; Karan Gupta and Vinayak Khot, Nutanix Inc.; Haryadi S. Gunawi, University of Chicago
We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Biswaranjan Panda and Deepthi Srinivasan and Huan Ke and Karan Gupta and Vinayak Khot and Haryadi S. Gunawi},
title = {{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services},
booktitle = {2019 USENIX Annual Technical Conference (USENIX ATC 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {47--62},
url = {https://www.usenix.org/conference/atc19/presentation/panda},
publisher = {USENIX Association},
month = jul
}