Looking at SRE Needs and Trends over Two Decades with a Single Service

Thursday, 12 October, 2023 - 14:5015:30

Salim Virji, Google LLC and Murali Suriar, Snowflake

Abstract: 

We all have experienced in our organisations the case where we build a quick solution to solve an immediate problem, and eventually find the software fulfilling other needs. This is the story of Chubby, Google's distributed lock service, and how it began as a mechanism to provide leader election for infrastructure and evolved rapidly to provide service discovery, config-file distribution, and other production-critical services.

During this talk, the presenter will explore the evolution and maturity of the field of Site Reliability Engineering through the lens of this specific piece of infrastructure software. The audience will hear foundational experiences with monitoring, caching, proxying, and isolation — and learn about our experiences, both good and bad. The audience will also hear suggestions for the direction that SRE practice will take in the near future.

Salim Virji, Google LLC

Salim Virji develops reliable engineering practices and processes for Google’s SRE organization, and has previously built distributed consensus and storage systems. Salim’s other interests include machine learning and composting.

Murali Suriar, Snowflake

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Working on traffic management at Snowflake after 12 years at Google. Currently learning what "the cloud is just someone else's computer" means.

BibTeX
@conference {292093,
author = {Salim Virji and Murali Suriar},
title = {Looking at {SRE} Needs and Trends over Two Decades with a Single Service},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video