Beyond Goldilocks Reliability

Narayan Desai

Friday, 15 October, 2021 - 01:00–01:45

Narayan Desai, Google

Reliability engineering is still in its infancy, with best practices stemming largely from community experiences and hard knocks. SRE practices, including alerting and SLOs, are built around subjective thresholds—institutionalizing an esoteric model of reliability. Increasingly, the cracks in this approach are showing this Goldilocks approach to be insufficient.

Reliability is currently an amorphous concept. If we hope to tackle it robustly, we must first frame reliability concisely. A concrete model provides a foundation for answering complex questions about our services in a principled way. So what is reliability, anyway?

Once we have an understanding of what reliability is, we can scrutinize our current best practices and mitigation strategies. Why do they work, and why are they so effective? Why is aggregation pervasive? Why do backend drains work so well? Identifying underlying mechanisms enables us to reinforce the reliability properties we want, and identify new mitigation strategies when needed.

Narayan is an SRE at Google Cloud, where he is responsible for the reliability of GCP Data Analytics products.

Connect:

@nldesai

BibTeX

@conference {276759,
author = {Narayan Desai},
title = {Beyond Goldilocks Reliability},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

Beyond Goldilocks Reliability

Presentation Video