How We Drained Every Backbone Router Simultaneously

Wednesday, 26 October, 2022 - 09:4510:30

Francois Richard, Meta


On October 4, 2021, we experienced a severe outage lasting approximately 6 hours.

Our engineering teams learned that configuration changes from commands issued as part of routine infrastructure maintenance on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on how our data centers communicate, bringing our services to a halt.

During this presentation, we aim to describe the chain of events that led us to this situation and how the underlying cause of this outage also impacted internal tools and systems we use in our day-to-day operations. We will also delve into our reflections after the event, how continuous validation of support structures, DR capabilities tooling, and processes have helped us and how we are thinking about the future.

Francois currently supports the Reliability Infra team at Meta. The team focuses on both proactive and reactive reliability: from reacting and managing incidents to planning and testing for disaster to validating for resilience & fault tolerance including delivering realistic environments enabling services to truly certify their recovery procedures. Francois has been at Meta for 5+ years previously working in Core Systems focusing on the reliability and security of the control plan components. He is an incident manager oncall and also a crisis manager.

