Designing for Failure: How to Manage Thousands of Hosts Through Automation

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website will not be available on Tuesday, December 17, from 10:00 am to 2:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience.

If you are trying to register for Enigma 2020, please complete your registration before or after this time period.

Monday, October 29, 2018 - 2:00 pm2:30 pm

Brandon Bercovich, Uber

Abstract: 

At Uber, we run thousands of services on top many thousands of hosts using Apache Mesos with the Apache Aurora framework. This setup ensures that when a host breaks a service will automatically get rescheduled to another host, but what happens to the host? What happens when a host is still running services but is misconfigured or has a hardware fault that can be affecting the performance of the service. How about when you want to upgrade the Kernel or other software across your fleet. At Uber, we created CLM or Cluster Lifecycle Manager which is used to answer these questions in a safe and automated way. In this talk we will go through the architecture we are using to make this possible and how we are ensuring our actions don't impact services.

Brandon Bercovich, Uber

I've been working in the industry for over 18 years as a Systems Administrator, a DBA, and now a SRE. At Uber my team manages our Compute platform through automation.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {221726,
author = {Brandon Bercovich},
title = {Designing for Failure: How to Manage Thousands of Hosts Through Automation},
year = {2018},
address = {Nashville, TN},
publisher = {{USENIX} Association},
month = oct,
}

Presentation Video 

Presentation Audio