SRE for ML: The First 10 Years and the Next 10

Note: Presentation times are in Coordinated Universal Time (UTC).

Thursday, 14 October, 2021 - 01:0001:45

Todd Underwood, Google


Over 10 years ago we started building SRE for a large multi-model ML service at Google. We faced many interesting challenges including:

  • Defining scope: Why do these services need ML anyway?
  • Unclear SLOs: What are we measuring and how can we actually be responsible for those things?
  • Fuzzy demarcation with our modeling teams: What is a model quality problem caused by infrastructure vs a model quality problem caused by the model or the data?

With the explosion of ML training and serving platforms, the choices we faced are now confronting many SRE teams across the industry. I will review the history focusing on the decisions we made and why those made sense to us at the time and might make sense for others. And I'll try to answer the question of whether there is a real need for SRE for ML at all.

Todd Underwood, Google

Todd Underwood is a Director at Google. He leads Machine Learning for Site Reliability Engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every Product Area at Google. He is also the Engineering Site Lead for Google's Pittsburgh office.

SREcon21 Open Access Sponsored by Indeed

@conference {276737,
author = {Todd Underwood},
title = {{SRE} for {ML}: The First 10 Years and the Next 10},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video