Unified Theory of SRE

Thursday, 27 October, 2022 - 09:4510:30 CEST

Emil Stolarsky, Wave Mobile Money

Abstract: 

Site Reliability Engineering was born at a large company, but it was enshrined at a massive company. When you have over 70 SREs, from a single organization, contributing to a book that documents their approach to running infrastructure, you’re going to be getting a very particular snapshot of the world: the perspective of a place that’s big enough to support over 70 SREs. This is the crux of Niall Murphy’s call to action from SREcon21: what we’ve been treating as gospel can’t possibly be true for everyone.

Since the start, our dialogue has always been rooted in this level of scale. Even when we venture into discussions of solo SREs or bootstrapping a team, it’s stretching the “traditional” model down. We’ve never truly grappled with what SRE would look like at a startup (the 4-person kind, not the billion dollar with hundreds of people, “startup”). But forcing ourselves to work from first principles, to understand what SRE is like at that scale, could be just as insightful as it was for the physics community to attempt to bridge their classic and quantum models.

In this talk, I’d like to do just that. Let’s start at a hypothetical 4-person startup where the default is running out of money within 18 months. From all our cherished SRE ideals, what truly matters at this stage? Do you run incident reviews, do you make SLOs, what is on-call even like? From there, we’ll start turning the dial and watch our startup grow. At a certain point, we’ll reach a size where we have everything one might expect: established SLOs, proper incident response, etc. Throughout that growth, if we take a look at every significant stage and ask ourselves what’s changed, what matters, what needs to be done - we can glean a better understanding of what SRE really is and continue to outline the elephant.

Emil Stolarsky, Wave Mobile Money

Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously he worked on caching, performance, & disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at & organizing a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.

BibTeX
@conference {284647,
author = {Emil Stolarsky},
title = {Unified Theory of {SRE}},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}

Presentation Video