Latency and Availability Error Budgets Done Right at Scale

Tuesday, December 08, 2020 - 4:05 pm4:25 pm

Fred Moyer, Zendesk

Abstract: 

Learn how Zendesk developed formulas for implementing SLIs, SLOs, and Error Budgets at scale across a team of 1,000 engineers.

Error Budgets tell us when we should stop working on features and instead work on reliability. Because we use them to prioritize expensive resources (not to mention protect our revenue streams), we want them to be as accurate as possible. How do you empower 1,000+ engineers to solve these problems correctly in systems at scale?

Fred Moyer, Zendesk

Fred is an SRE and resident SLOgician (like statistician, not magician) at Zendesk. He previously worked with high scale telemetry at Circonus, and scaled large web systems at Turnitin. Fred developed the first Istio community adapter in 2018, and was a White Camel Award winner in 2013. He likes to daydream about SLOs and Error Budgets while riding his mountain bike.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {262241,
author = {Fred Moyer},
title = {Latency and Availability Error Budgets Done Right at Scale},
booktitle = {SREcon20 Americas (SREcon20 Americas)},
year = {2020},
url = {https://www.usenix.org/conference/srecon20americas/presentation/moyer},
publisher = {{USENIX} Association},
month = dec,
}

Presentation Video