STPA for Software Systems–Illuminate the Unknown Unknowns

Thursday, 9 October, 2025 - 11:0012:35

Theo Klein, Garrett Holthaus, and Ruben Barroso, Google

SREs know about some of the flaws and vulnerabilities in their systems. They might also have intuition on where to look for additional issues–"known unknowns." But what about the "unknown unknowns"–outages waiting to happen that nobody is even looking for? With the vast complexity of modern software systems, this dark space of unknowns can be huge. And, what's worse, most of the outages in this space happen due to complex interactions between various parts of the system, even when everything is working according to specification, i.e. no implementation bugs.

What if we had a way to shine a light into the unknown unknowns? What if we could understand our systems enough to be able to methodically explore these complex interactions and build a comprehensive list of possible outage scenarios? In STPA, we model systems based on control-feedback loops, creating a hierarchical control structure, or HCS. In this session, we'll use a real Google system to show how an HCS can help you gain a new perspective and understanding of your system. We'll note similarities to common patterns in software design so you can start thinking about similar vulnerabilities in your own systems.

This session will be an interactive workshop. Attendees should plan to actively participate in the small group exercises in order to get the most benefit from the session.

Theo Klein is a Staff Site Reliability Engineer working on Google Maps. Over the past two years, he has lead an effort to improve the safety and reliability of road disruptions data on Google Maps. Previously, he lead efforts to remove unneeded dependencies on critical systems, which de-risked Google's many serving layers from global outages.

His primary interests are in systems thinking, dependency management and horizontal analyses of large-scale systems.

Garrett Holthaus is a technical writer for Site Reliability Engineering at Google. He has a background in electrical and computer engineering, as well as experience teaching and designing science and technology curricula. In addition to writing and maintaining SRE documentation, Garrett develops and gives training in System Theoretic Process Analysis (STPA) at Google.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

BibTeX
@conference {311910,
author = {Theo Klein and Garrett Holthaus and Ruben Barroso},
title = {{STPA} for Software {Systems{\textendash}Illuminate} the Unknown Unknowns},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video