Designing for Failure: How to Manage Thousands of Hosts Through Automation

Brandon Bercovich

Monday, October 29, 2018 - 2:00 pm–2:30 pm

Brandon Bercovich, Uber

At Uber, we run thousands of services on top many thousands of hosts using Apache Mesos with the Apache Aurora framework. This setup ensures that when a host breaks a service will automatically get rescheduled to another host, but what happens to the host? What happens when a host is still running services but is misconfigured or has a hardware fault that can be affecting the performance of the service. How about when you want to upgrade the Kernel or other software across your fleet. At Uber, we created CLM or Cluster Lifecycle Manager which is used to answer these questions in a safe and automated way. In this talk we will go through the architecture we are using to make this possible and how we are ensuring our actions don't impact services.

I've been working in the industry for over 18 years as a Systems Administrator, a DBA, and now a SRE. At Uber my team manages our Compute platform through automation.

Connect:

@Draajen

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {221726,
author = {Brandon Bercovich},
title = {Designing for Failure: How to Manage Thousands of Hosts Through Automation},
year = {2018},
address = {Nashville, TN},
publisher = {USENIX Association},
month = oct
}

Designing for Failure: How to Manage Thousands of Hosts Through Automation

Open Access Media

Presentation Video

Presentation Audio