STPA for Software Workshop: Finding the Outages Waiting to Happen

Tuesday, March 24, 2026 - 1:50 pm5:30 pm

Theo Klein, Ruben Barroso, and Garrett Holtaus, Google

SREs know about some of the flaws and vulnerabilities in their systems. They might also have intuition on where to look for additional issues–"known unknowns." But what about the "unknown unknowns"–outages waiting to happen that nobody is even looking for? With the vast complexity of modern software systems, this dark space of unknowns can be huge. And, what's worse, most of the outages in this space happen due to complex interactions between various parts of the system, even when everything is working according to specification, i.e. no implementation bugs.

What if we had a way to shine a light into the unknown unknowns? What if we could understand our systems enough to be able to methodically explore these complex interactions and build a comprehensive list of possible outage scenarios? In STPA, we model systems based on control-feedback loops, creating a hierarchical control structure, or HCS. In this session, we'll use a real Google system to show how an HCS can help you gain a new perspective and understanding of your system. We'll note similarities to common patterns in software design so you can start thinking about similar vulnerabilities in your own systems.

This session will be an interactive workshop. Attendees should plan to actively participate in the small group exercises in order to get the most benefit from the session.

Theo Klein is a Staff Site Reliability Engineer (SRE) at Google. Over the past two years, he has led an effort to improve the safety and reliability of road disruption data on Google Maps. He also has several years of experience applying safety engineering methods like STPA and CAST to proactively identify risks in complex socio-technical systems at Google. His primary interests are in systems thinking, dependency management and horizontal analyses of large-scale systems.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Garrett Holthaus is a technical writer for Site Reliability Engineering at Google. He has a background in electrical and computer engineering, as well as experience teaching and designing science and technology curricula. In addition to writing and maintaining SRE documentation, Garrett develops and gives training in System Theoretic Process Analysis (STPA) at Google.

BibTeX
@conference {316475,
author = {Theo Klein and Ruben Barroso and Garrett Holthaus},
title = {{STPA} for Software Workshop: Finding the Outages Waiting to Happen},
year = {2026},
address = {Seattle, WA},
publisher = {USENIX Association},
month = mar
}