Principled Identification of "Root Causes" Using Techniques from Safety Engineering

Wednesday, 26 October, 2022 - 16:4517:10 CEST

Laura de Vesine, Datadog Inc


Industry approaches to "root cause analysis" are often ad-hoc and unprincipled. Causes lie somewhere in the space between "the last change before it broke" and "The Big Bang", but how do we decide which to focus on? "Five whys" - or relying on our best guesses - are unsatisfying: we try to identify and prevent specific occurrences of worst-case conditions, and end up playing "whack-a-mole" with outage triggers. This talk draws from safety engineering to present a framework for choosing the "right" root causes, with a worked example. We sharply distinguish our system from our environment, allowing us to design and build safe system behavior under any reasonable conditions - including worst-case.

Laura de Vesine, Datadog

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 6 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

