Metastable Failures in the Wild

Authors: 

Lexiang Huang, The Pennsylvania State University; Twitter; Matthew Magnusson and Abishek Bangalore Muralikrishna, University of New Hampshire; Salman Estyak, The Pennsylvania State University; Rebecca Isaacs, Twitter; Abutalib Aghayev and Timothy Zhu, The Pennsylvania State University; Aleksey Charapko, University of New Hampshire

Abstract: 

Recently, Bronson et al. introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures presented in that work are simplified versions of failures observed at Facebook. In this work, we study the prevalence of such failures in the wild by scouring over publicly available incident reports from many organizations, ranging from hyperscalers to small companies.

Our main findings are threefold. First, metastable failures are universally observed—we present an in-depth study of 22 metastable failures from 11 different organizations. Second, metastable failures are a recurring pattern in many severe outages—e.g., at least 4 out of 15 major outages in the last decade at Amazon Web Services were caused by metastable failures. Third, we extend the model by Bronson et al. to better reflect the metastable failures seen in the wild by categorizing two types of triggers and two types of amplification mechanisms, which we confirm through developing multiple example applications that reproduce different types of metastable failures in a controlled environment. We believe our work will aid in a deeper understanding of metastable failures and in coming up with solutions to them.

OSDI '22 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280934,
author = {Lexiang Huang and Matthew Magnusson and Abishek Bangalore Muralikrishna and Salman Estyak and Rebecca Isaacs and Abutalib Aghayev and Timothy Zhu and Aleksey Charapko},
title = {Metastable Failures in the Wild},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
year = {2022},
isbn = {978-1-939133-28-1},
address = {Carlsbad, CA},
pages = {73--90},
url = {https://www.usenix.org/conference/osdi22/presentation/huang-lexiang},
publisher = {USENIX Association},
month = jul
}

Presentation Video