How ML Breaks: A Decade of Outages for One Large ML Pipeline

Daniel Papasian and Todd Underwood, Google


Reliable management of continuous or periodic machine learning pipelines at large scale presents significant operational challenges. Using experience from almost 15 years of operating some of the largest ML pipelines, we examine the characteristics of one of the largest and oldest continuous pipeline at Google. We look at actual outages experienced and try to understand what caused them.

We examine failures in detail, categorizing them into ML vs Non-ML and Distributed vs. Non-Distributed. We demonstrate that a majority of the outages are not ML-centric and are more related to the distributed character of the pipeline.

Daniel Papasian, Google

Daniel Papasian is a Staff Software Engineer at Google, working in Site Reliability Engineering. He has spent ten years at Google working on large scale data processing and machine learning systems, in both Site Reliability Engineering roles and as a Software Engineer in the Ads Quality organization. Before Google, he worked as a Network System Engineer for Carnegie Mellon's Computing Services, writing software to automate network reconfiguration. Prior to that, he was staff for the Chronicle of Higher Education, in charge of all things technical for their website, He holds a BS from Carnegie Mellon University with majors in the decision sciences and a minor in engineering.

Todd Underwood, Google

Todd Underwood is a lead Machine Learning for Site Reliability Engineering Director at Google and is a Site Lead for Google’s Pittsburgh office. ML SRE teams build and scale internal and external ML services and are critical to almost every product area at Google. Todd was in charge of operations, security, and peering for Renesys’s Internet intelligence services that is now part of Oracle’s cloud service. Before that Todd was Chief Technology Officer of Oso Grande in New Mexico. Todd has a BA in philosophy from Columbia University and a MS in computer science from the University of New Mexico. He is interested in how to make computers and people work much, much better together.

OpML '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {256668,
author = {Daniel Papasian and Todd Underwood},
title = {How {ML} Breaks: A Decade of Outages for One Large {ML} Pipeline},
year = {2020},
publisher = {USENIX Association},
month = jul

Presentation Video