Turning an Incident Report into a Design Issue with TLA+

Wednesday, March 22, 2023 - 2:45 pm3:30 pm

A. Finn Hackett, University of British Columbia; Markus Alexander Kuppe, Microsoft Research

Abstract: 

This talk will discuss our experience using modeling-driven techniques as part of a postmortem deep dive into a long-lasting, high-impact outage at Microsoft. We built a precise specification of the micro-service architecture, most notably its foundational distributed database service CosmosDB. Modeling allowed us to go beyond the standard postmortem analysis and accurately determine the outage's root cause. The key to our approach was using TLA+, a formal specification language that let us concisely and formally describe the database's user-visible behavior under concurrency. By building a specification, we could thoroughly understand the underlying design issue and find bugs in the database's documentation. Through this case study, we demonstrate the value of modeling-driven techniques in postmortem analysis and illustrate how they can help avoid similar issues in the future. We also explain how you can use TLA+ to build precise and reusable specifications.

A. Finn Hackett, University of British Columbia

Finn is a Ph.D. student exploring practical applications of domain-specific language design and formal verification. Formal verification can help people with reasoning exercises that unassisted humans often get wrong in practice, where each mistake can become a bug or a feature that's too hard to implement. That is why it's important to make these reasoning tools more accessible, by improving their design and by demonstrating different ways to apply them in practice. Presentation topic aside, Finn's most recent work is PGo, a TLA+-based domain-specific language for developing formally verified distributed systems. For older and/or musical projects, see Finn's personal site.

Markus Alexander Kuppe, Microsoft Research

Markus is a principal research software engineer at Microsoft Research. He has been a member of the TLA+ project for over a decade. In this role, Markus has made significant contributions to the development of the TLA+ tools and has helped engineers at Microsoft formalize their systems using TLA+. Markus also teaches TLA+ to engineers at Microsoft and elsewhere, sharing his knowledge and experience. Last summer, Markus served as Finn's mentor, working alongside fellow engineer Josh Rowe on using TLA+ to model incidents.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {286274,
author = {A. Finn Hackett and Markus Alexander Kuppe},
title = {Turning an Incident Report into a Design Issue with {TLA+}},
year = {2023},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video