Tanner Lund, Indeed
Outage pattern analysis is hard! There have been many attempts to learn across multiple incidents. Folks look for categories, tags, causes, etc. to identify what's brittle or risky in their system, sometimes even using statistical models to help make sense of the data. However, their results often prove unsatisfying, non-actionable, or don't tell you anything you didn't already know from other sources.
An alternate approach is to find patterns via Christopher Alexander's "Pattern-Centered Inquiry". Complex systems fail according to certain patterns or fundamental laws. We can identify and learn from these patterns and then see how their individual, diverse manifestations in our systems develop and manifest. An understanding of patterns and how to spot them then underpins better informed reliability decision making.
Tanner Lund, Indeed
Tanner Lund has been studying incidents and what they can tell us about systems for the better part of a decade. During his time supporting cloud platforms, building data pipelines, managing crises, and improving site reliability, he's found there is a lot more to understand about how software and people work (and don't work) together. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets. That may take a while...
author = {Tanner Lund},
title = {Patterns, Not Categories: Learning Across Incidents},
year = {2023},
address = {Singapore},
publisher = {USENIX Association},
month = jun
}