Debugging Transient Faults in Data Centers using Synchronized Network-wide Packet Histories

Authors: 

Pravein Govindan Kannan, IBM Research - India; Nishant Budhdev, Raj Joshi, and Mun Choon Chan, National University of Singapore

Abstract: 

Data center network faults are hard to debug due to their scale and complexity. With the prevalence of hard-to-reproduce transient faults, root-cause analysis of network faults is extremely difficult due to unavailability of historical data, and inability to correlate the distributed data across the network. Often, it is not possible to find the root cause using only switch-local information. To find the root cause of such transient faults, we need: 1) Visibility: fine-grained, packet-level and network-wide observability, 2) Retrospection: ability to get historical information before the fault occurs, and 3) Correlation: ability to correlate the information across the network. In this work, we present the design and implementation of SyNDB, a tool with the aforementioned capabilities to enable root cause analysis of network faults. We implement and evaluate SyNDB with realistic topologies using large scale simulation and programmable switches. Our evaluations show that SyNDB can capture and correlate packet records over sufficiently large time windows (∼4 ms), thus facilitating the root cause analysis of various network faults.

NSDI '21 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {265027,
author = {Pravein Govindan Kannan and Nishant Budhdev and Raj Joshi and Mun Choon Chan},
title = {Debugging Transient Faults in Data Centers using Synchronized Network-wide Packet Histories},
booktitle = {18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21)},
year = {2021},
isbn = {978-1-939133-21-2},
url = {https://www.usenix.org/conference/nsdi21/presentation/kannan},
publisher = {{USENIX} Association},
month = apr,
}