Todd Porter, Meta; Aleksey Charapko, University of New Hampshire
Dealing with failures is an inevitable part of operating large distributed systems. Luckily, such systems are designed to handle failures and recover from their effects. In this talk, we explore the unfortunate cases in which recovery actions intended to address problems, unbeknownst to the operators, become the cause of even larger failures. This process occurs through natural recovery cascades in large systems, in which the recovery of one system or component triggers recovery in the next. We show that, via recovery cascades, systems may amplify the recovery cost at each step as the process crosses from one system to another. Moreover, these amplifications can propagate backwards into the systems that have already recovered, creating positive feedback loops that reintroduce and reinforce the failure.
We explain our findings, failure causes and contributing factors, and mitigation strategies using a global-scale message bus that experienced such problems as an example.
Note: David Maier from Portland State University made significant contributions to the contents of this talk.

Todd Porter is a Software Engineer at Meta working on stream processing and streaming ingestion systems. He focuses on studying emergent behavior of large-scale systems in order to make them safer to operate.

Aleksey Charapko is an assistant professor at the University of New Hampshire. He broadly works at the intersection of performance, reliability, and efficiency of distributed systems. Aleksey has won the NSF CAREER award for his ongoing work on Metastable Failures. In addition to his academic career, Aleksey has substantial industrial and consulting experience.

