Website Maintenance Alert
Due to scheduled maintenance, the USENIX website will not be available on Tuesday, December 17, from 10:00 am to 2:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience.
If you are trying to register for Enigma 2020, please complete your registration before or after this time period.
Passive Realtime Datacenter Fault Detection and Localization
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, and Alex C. Snoeren
Datacenters are characterized by their large scale, comprising a large number of network links and switches. However, these hardware components can develop intermittent faults, resulting in randomly occurring packet drops or delays that harm application performance—several such faults occur daily in large production datacenters. Since the effects are intermittent, traditional detection techniques involving host and router statistics or active probe traffic can fall short in their ability to identify and locate these errors. In this article, we present our passive hybrid approach that combines network path information with host-based statistics to rapidly detect and pinpoint the location of datacenter network faults inside a production Facebook datacenter.