Automatic Traffic Scheduling for Internet Connectivity Failures

Thursday, June 07, 2018 - 4:45 pm5:10 pm

Liuqing Zhang, Baidu Inc.


When it comes to high availability (HA) or user experience (UE), people often think about the stability of backend services or product design. Network connectivity, especially the Internet connectivity, is neglected. This might partly come from the impression that network is usually stable. Our observation shows that network failures are far from scarce, at least in China. Every week, we detect 3-5 PoP failures, 5-10 backbone failures breaking the connectivity at the province level. Most of the failures can be remedied by modifying DNS setting to bypass the broken path. The remediation depends on two systems, the detection system and the traffic scheduling system.

The detection system must detect failures precisely and punctually. Besides dedicated monitoring agents, we recruit volunteering agents to improve coverage and punctually. Dealing with their heterogeneity and unpredictable presence is crucial to the detection performance.

The traffic scheduling system is responsible for detouring the traffic to the correct path. It must consider not only the connectivity of external network links, but also the users’ experience and the load of target IDCs.

In this talk, we will introduce how to implement and use the above two systems to handle Internet connectivity failures.

