Beyond Goldilocks Reliability

Note: Presentation times are in Coordinated Universal Time (UTC).

Friday, 15 October, 2021 - 01:00–01:45

Narayan Desai, Google


Reliability engineering is still in its infancy, with best practices stemming largely from community experiences and hard knocks. SRE practices, including alerting and SLOs, are built around subjective thresholds—institutionalizing an esoteric model of reliability. Increasingly, the cracks in this approach are showing this Goldilocks approach to be insufficient.

Reliability is currently an amorphous concept. If we hope to tackle it robustly, we must first frame reliability concisely. A concrete model provides a foundation for answering complex questions about our services in a principled way. So what is reliability, anyway?

Once we have an understanding of what reliability is, we can scrutinize our current best practices and mitigation strategies. Why do they work, and why are they so effective? Why is aggregation pervasive? Why do backend drains work so well? Identifying underlying mechanisms enables us to reinforce the reliability properties we want, and identify new mitigation strategies when needed.

Narayan Desai, Google

Narayan is an SRE at Google Cloud, where he is responsible for the reliability of GCP Data Analytics products.

SREcon21 Open Access Sponsored by Indeed

@conference {276759,
author = {Narayan Desai},
title = {Beyond Goldilocks Reliability},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video