Network Fault Finding System: Packet Loss Triangulation

Tuesday, October 29, 2019 - 9:45 am10:30 am

Jose Leitao and Daniel Rodriguez, Facebook

Abstract: 

Most network monitoring relies upon the devices providing the signal used to calculate health via syslog messages, SNMP, telemetry, or custom APIs.

In large scale networks, we can’t trust the devices to accurately report health in all the possible failure cases that may exist. At FB, in addition to standard monitoring tools, we also actively probe our networks with test traffic to ensure the platforms are behaving as we expect. With active monitoring, we can find misbehaving network elements even when they exist several layers deep inside the network.

During the presentation, we will show how we built a sample system that achieves similar results using open source tools and perform a live demo with a lab network from start to finish, introducing packet loss and showing how the system can identify where the loss is occurring in real time.

Jose Leitao, Facebook

Jose Leitao is a production network engineer in the Network org at Facebook. His team's responsibilities include maintaining, monitoring, and improving the global production network infrastructure.

Daniel Rodriguez, Facebook

Daniel Rodriguez is a production network engineer in the Network org at Facebook. His team's responsibilities include maintaining, monitoring, and improving the global production network infrastructure.

BibTeX
@conference {240846,
author = {Jose Leitao and Daniel Rodriguez},
title = {Network Fault Finding System: Packet Loss Triangulation},
year = {2019},
address = {Portland, OR},
publisher = {{USENIX} Association},
month = oct,
}

Presentation Video