A Theory and Practice of Alerting with Service Level Objectives

Friday, June 08, 2018 - 2:20 pm3:15 pm

Jamie Wilkinson, Google


As systems grow, they get more components, and more ways to fail. The alerts of the last systems' design can slowly "boil the frog", and suddenly no-one has time to help the system scale further because they're constantly firefighting. Alert fatigue sets in and the team burns out.

The way to avoid this is to only page when the SLO is not met, or when the "error budget" is being burned at a rate requiring immediate action.

Perhaps you've moved on from a check-based alerting system and lots of spammy alerts to a timeseries based monitoring system like Prometheus; you've heard about SLOs and error budgets but they sound like a unicorn dream -- at the very least you can't visualise how they might even be constructed in a monitoring system. Fear Not! In this talk, a well-rested champion of work/life balance Jamie Wilkinson will talk about the ideas of alerting on SLO and error budget, how the implementation of that changes as systems scale, and the tools you'll need once the alerts themselves no longer tell you what part is broken.

@conference {214907,
author = {Jamie Wilkinson},
title = {A Theory and Practice of Alerting with Service Level Objectives},
year = {2018},
publisher = {{USENIX} Association},