What SRE Could Be: Systems Reliability Engineering

Tuesday, 25 October, 2022 - 17:0017:45

Laura Nolan, Stanza


As a profession, SRE is still in its infancy, but it isn't too young to be experiencing a profound identity crisis.

Google's Ben Treynor originally said that SRE is what you get when you ask software engineers to do operations (but recently contradicted himself by saying that SRE isn't operations). Google later described SRE as an implementation of DevOps. Lots of people see SRE as holding the pager, or as use of SLOs.

SRE is bigger than any of the definitions above. I define SRE as systems thinking applied to software in production, including the sociotechnical aspects of running software systems. Systems thinking has a set of associated tools and methodologies, but, at its core, it is a philosophy that is aware that systems have underlying structures that cause particular patterns of behaviour. Systems thinking aims to model complex problems and make them tractable.

The best SREs and the best SRE organisations, are thoroughly immersed in systems thinking but it largely remains an implicit organising principle, not something we make a first-class citizen in our profession.

Niall Murphy, at SREcon 2021, asked what SRE 2.0 should be. My answer is that SRE 2.0 should be a profession where we put systems thinking first. This talk will explore how we might do that.

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal (and principled) Engineer at Stanza Systems, where she is building software to help humans understand and control their production systems. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

