Help Protect Your Data Centers with Safety Constraints

Wednesday, March 28, 2018 - 4:50 pm5:10 pm

Christina Schulman and Etienne Perot, Google


Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.

We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.

These principles apply at any scale, and it’s easier to apply them if you start early.

Christina Schulman, Google

Christina Schulman is an SRE at Google, where she works on datacenter machine management and system dependency control. Prior to joining Google in 2008, she wrote software for medical imaging systems, early Internet startups, and game companies. She has a B.A. in Computer Science from Princeton.

Etienne Perot, Google

Etienne Perot is an SRE at Google working on Borg, Google's cluster orchestration system. He works on systems to make the management of large scale systems at Google safe and reliable.

