Nothing to Recommend It: An Interactive ML Outage Fable

Note: Presentation times are in Coordinated Universal Time (UTC).

Thursday, 14 October, 2021 - 02:3003:00

Todd Underwood, Google


This is the story of an ML outage, based on a real outage, but anonymized and recast to protect the innocent (and the guilty). An ML model is misbehaving, causing serious damage to the company. As with many outages, the underlying cause is unclear. In fact, even the timeline of the outage is unclear. This talk walks the audience through how the outage was detected, how troubleshooting worked, how it was mitigated and resolved, and what follow-up work was scheduled. The talk will aim to be (honor system for asynchrony) interactive!

Todd Underwood, Google

Todd Underwood is a Director at Google. He leads Machine Learning for Site Reliability Engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every Product Area at Google. He is also the Engineering Site Lead for Google's Pittsburgh office.

SREcon21 Open Access Sponsored by Indeed

@conference {276749,
author = {Todd Underwood},
title = {Nothing to Recommend It: An Interactive {ML} Outage Fable},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video