Going from 30 to 30 Million SLOs

Wednesday, 26 October, 2022 - 14:4515:30 CEST

Alex Palcuie, Google


I will be presenting the evolution of Service Level Objectives (SLO) for the GCE Compute API for the past 6 years. Starting from the initial 30 or so SLOs, going through a mid-term phase of about a thousand and ending with millions of per-customer SLOs. I will be sharing anecdotes, better techniques on how to handle low-QPS (think continuous over discrete metrics) and how to aggregate the data for better leadership visibility.

Alex has been working as a Site Reliability Engineer in the team that takes care of the GCE Compute API for over 5 years. He’s also been part of the team that built a control plane framework that’s now powering over 20 products in Google Cloud. His current 20% project is helping with huge outages in the Tech Incident Response Team (Tech IRT), like powering down computers in a data centre when the weather is too hot.

