Delete This: Decommissioning Servers at Scale

Friday, 2018, August 31 - 12:1512:40

Anirudh Ra, Facebook

Abstract: 

Facebook's datacenter footprint has increased significantly; we now have 12 locations across USA and Europe. As these new locations come online, we have had to plan for the end-of-life process: decommissioning server racks and replacing them in a timely and streamlined manner. Until recently, decommissioning a cluster entailed a lot of manual work: service oncalls were ticketed by project managers and then migrated off the old hardware onto new hardware, after which hardware was unplugged and rolled out.

We realized the need for automation that covered all of this. We started with a framework that allows for automated service migration, given a list of retiring machines and a list of replacements. We moved on to an automated process that looks at a decommission schedule and kicks off jobs to drain server clusters on time so that old racks can be taken away and new racks rolled into their place.

With this automated process in place, we have learned lessons and figured out how to minimize the time that old servers spend without services running on them before being rolled out of the datacenter. We are also exploring ways to reuse parts of this framework in other ways to increase efficiency.

Anirudh Ra, Facebook

Customer support tech turned production engineer, Anirudh tries to remember that his job is even now about helping people succeed. He builds frameworks for service owners to run their services with minimal bother and enjoys baking bread, using oxford commas, and reading fiction, histories, and fictional histories.

BibTeX
@inproceedings {218893,
author = {Anirudh Ra},
title = {Delete This: Decommissioning Servers at Scale},
booktitle = {SREcon18 Europe/Middle East/Africa (SREcon18 Europe)},
year = {2018},
address = {Dusseldorf},
url = {https://www.usenix.org/node/218894},
publisher = {{USENIX} Association},
}