Jamie Wilkinson, Google
As systems grow, they get more components, and more ways to fail. The alerts of the last systems' design can slowly "boil the frog", and suddenly no-one has time to help the system scale further because they're constantly firefighting. Alert fatigue sets in and the team burns out.
The way to avoid this is to only page when the SLO is not met, or when the "error budget" is being burned at a rate requiring immediate action.
Perhaps you've moved on from a check-based alerting system and lots of spammy alerts to a timeseries based monitoring system like Prometheus; you've heard about SLOs and error budgets but they sound like a unicorn dream -- at the very least you can't visualise how they might even be constructed in a monitoring system. Fear Not! In this talk, a well-rested champion of work/life balance Jamie Wilkinson will talk about the ideas of alerting on SLO and error budget, how the implementation of that changes as systems scale, and the tools you'll need once the alerts themselves no longer tell you what part is broken.
Jamie Wilkinson is a Site Reliability Engineer at Google. Contributing author to the "SRE Book," he has presented on contemporary topics at prominent conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and SREcon. His interests began in monitoring and automation of small installations, but continues with human factors in automation and systems maintenance on large systems. Despite over 15 years in the industry, he is still trying to automate himself out of a job.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.