Delete This: Decommissioning Servers at Scale

Friday, 2018, August 31 - 12:1512:40

Anirudh Ra, Facebook

Abstract: 

Facebook's datacenter footprint has increased significantly; we now have 12 locations across USA and Europe. As these new locations come online, we have had to plan for the end-of-life process: decommissioning server racks and replacing them in a timely and streamlined manner. Until recently, decommissioning a cluster entailed a lot of manual work: service oncalls were ticketed by project managers and then migrated off the old hardware onto new hardware, after which hardware was unplugged and rolled out.

We realized the need for automation that covered all of this. We started with a framework that allows for automated service migration, given a list of retiring machines and a list of replacements. We moved on to an automated process that looks at a decommission schedule and kicks off jobs to drain server clusters on time so that old racks can be taken away and new racks rolled into their place.

With this automated process in place, we have learned lessons and figured out how to minimize the time that old servers spend without services running on them before being rolled out of the datacenter. We are also exploring ways to reuse parts of this framework in other ways to increase efficiency.

Anirudh Ra, Facebook

Customer support tech turned production engineer, Anirudh tries to remember that his job is even now about helping people succeed. He builds frameworks for service owners to run their services with minimal bother and enjoys baking bread, using oxford commas, and reading fiction, histories, and fictional histories.

SREcon18 Europe/Middle East/Africa Open Access Videos
Sponsored by Indeed

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Audio

BibTeX
@inproceedings {218893,
author = {Anirudh Ra},
title = {Delete This: Decommissioning Servers at Scale},
booktitle = {SREcon18 Europe/Middle East/Africa (SREcon18 Europe)},
year = {2018},
address = {Dusseldorf},
url = {https://www.usenix.org/node/218894},
publisher = {{USENIX} Association},
}