Sub-Region Failure: How to Handle the Partial Loss of a Data Center

Tuesday, October 29, 2019 - 4:00 pm4:45 pm

Joe Gasperetti and Yang Xia, Facebook

Abstract: 

Large internet companies like Facebook operate out of multiple geo-distributed data centers (DCs) connected via global backbone networks. At this scale, it is common to experience large scale failures, like submarine network cable disconnection, as well as localized physical failures including flipped power breakers, water intrusion, electrical fires, cooling failure and more. Previous research in Disaster Recovery (DR) focuses on minimizing the impact of losing entire data centers by quickly moving traffic and data away from an affected DC.

But what if we could endure physical failures while keeping the DC online? Facebook’s Sub-Region DR initiative aims to handle the partial loss of a data center without expanding the failure scope to the entire DC. Our approach is to work with software teams to make systems durable to partial failure. We will describe how we built an “auditor” which understands stateless, stateful and storage systems and can simulate the effects of power outages without pulling the plug. We will also share testing stories about disconnecting machines on purpose, and war stories about power plugs pulled by accident.

Joe Gasperetti, Facebook

Joe Gasperetti is a Production Engineer at Facebook. He currently works on the Web Foundation team, which is responsible for the uptime and reliability of Facebook's infrastructure. Before Web Foundation, he spent five years working on media storage.

Yang Xia, Facebook

Yang Xia is a Software Engineer at Facebook. He currently works on the Disaster Recovery team. He pulls the plug on data centers on purpose to test their resiliency. Before Disaster Recovery, he spent a year running Red Teams to test the physical security of Facebook data centers.

BibTeX
@conference {240826,
author = {Joe Gasperetti and Yang Xia},
title = {{Sub-Region} Failure: How to Handle the Partial Loss of a Data Center},
year = {2019},
address = {Portland, OR},
publisher = {USENIX Association},
month = oct
}

Presentation Video