Modeling Reliability for Distributed Systems

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Tuesday, March 24, 2020 - 2:00 pm2:50 pm

Narayan Desai, Google

Core Principles
Abstract: 

Many of our core SRE processes exist as rituals, with many aspects of their implementation derived from the broad type of service and business environment they emerged from. While these processes seem to work, we don't necessarily have a deep understanding of why, or how they should be altered to work in different situations. The same is true of the reliability of distributed systems in general - while we have keenly developed intuition about how to improve our services, we have no coherent theory of why our services are reliable in a deep sense. In short, SRE suffers from a lack of rigor. This problem is the largest and most consequential one facing our profession today.

In this talk, I'll discuss the importance of modeling in engineering disciplines in general, and walk through a basic model of distributed system reliability. Models are useful for encoding intuition in a way that enables validation and projection—key capabilities for engineers. I'll present a model based on a theory of reliability that couples the reliability properties of the Central Limit Theorem to a representation of the toil that characterizes our services. I use this to explain why some problems can be easily mitigated, while others are terrifying. While all models are wrong, some are useful—this one enables the systematic examination and classification of risks, as well as suggesting a way that we may be able to hoist difficult reliability problems into a more tractable regime.

Narayan Desai, Google

Narayan Desai is an SRE at Google, where he focuses on the reliability of Google Cloud Platform Data Analytics products. He has a checkered past, having worked on scheduling, configuration management, supercomputers, and metagenomics—always in the context of production systems.

BibTeX
@conference {247257,
author = {Narayan Desai},
title = {Modeling Reliability for Distributed Systems},
year = {2020},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = mar,
}