Christina Schulman and Etienne Perot, Google
Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.
We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.
These principles apply at any scale, and it’s easier to apply them if you start early.
Christina Schulman is an SRE at Google, where she works on datacenter machine management and system dependency control. Prior to joining Google in 2008, she wrote software for medical imaging systems, early Internet startups, and game companies. She has a B.A. in Computer Science from Princeton.
SREcon18 Americas Open Access Videos Sponsored by
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.