Passive Realtime Datacenter Fault Detection and Localization

Authors: 

Arjun Roy, University of California, San Diego; Hongyi Zeng and Jasmeet Bagga, Facebook; Alex C. Snoeren, University of California, San Diego

Abstract: 

Datacenters are characterized by their large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with small but non-zero failure rates mean that datacenters are subject to significant numbers of failures, impacting the performance of the services that rely on them. To make matters worse, these failures are not always obvious; network switches and links can fail partially, dropping or delaying various subsets of packets without necessarily delivering a clear signal that they are faulty. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors.

We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate transport-layer flow metrics and network-I/O system call delay at end hosts with the path that traffic takes through the datacenter and apply statistical analysis techniques to identify outliers and localize the faulty link and/or switch(es). We evaluate our approach in a production Facebook front-end datacenter.

NSDI '17 Open Access Videos Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Video

Download Video

Presentation Audio

BibTeX
@inproceedings {201561,
author = {Arjun Roy and Hongyi Zeng and Jasmeet Bagga and Alex C. Snoeren},
title = {Passive Realtime Datacenter Fault Detection and Localization},
booktitle = {14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17)},
year = {2017},
isbn = {978-1-931971-37-9},
address = {Boston, MA},
pages = {595--612},
url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/roy},
publisher = {{USENIX} Association},
}