Help Protect Your Data Centers with Safety Constraints

Wednesday, March 28, 2018 - 4:50 pm5:10 pm

Christina Schulman and Etienne Perot, Google

Abstract: 

Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.

We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.

These principles apply at any scale, and it’s easier to apply them if you start early.

Christina Schulman, Google

Christina Schulman is an SRE at Google, where she works on datacenter machine management and system dependency control. Prior to joining Google in 2008, she wrote software for medical imaging systems, early Internet startups, and game companies. She has a B.A. in Computer Science from Princeton.

Etienne Perot, Google

Etienne Perot is an SRE at Google working on Borg, Google's cluster orchestration system. He works on systems to make the management of large scale systems at Google safe and reliable.

SREcon18 Americas Open Access Videos Sponsored by
Indeed

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {213084,
author = {Christina Schulman and Etienne Perot},
title = {Help Protect Your Data Centers with Safety Constraints},
year = {2018},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = mar,
}

Presentation Video