Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 13 October, 2021 - 17:0017:30

Mary McGlohon, Google


Machine Learning is often treated as mysterious or unknowable. This can lead to SREs choosing to work around ML-related reliability problems in their systems rather than through them. This avoidance is not only risky but also unnecessary: Any given SRE operates with systems that they themselves may not know in great depth. To manage risk, they use a series of generalized techniques to understand the properties of the system and its failure modes.

In this talk, we apply this outside-in approach towards ML reliability, drawing from experiences with a large-scale ML production platform. We describe common failure modes (spoiler alert: they tend to be the same things that happen in other large systems), and based on these failure modes, recommend best practices for productionization: Monitor systems and protect them from human error. Understand data integrity needs, and meet them. Prioritize pipeline workloads for efficiency and backlog recovery.

Mary McGlohon, Google

Mary McGlohon is a Site Reliability Engineer at Google, who has worked on large-scale ML systems for the past 4 years. Prior to that, her career included data mining research, software development, and distributed pipeline systems. She completed a B.S. in computer science from the University of Tulsa and a Ph.D. in machine learning from Carnegie Mellon University. She is interested in how production techniques can make ML better for human operators and users.

SREcon21 Open Access Sponsored by Indeed

@conference {276683,
author = {Mary McGlohon},
title = {Demystifying Machine Learning in Production: Reasoning about a {Large-Scale} {ML} Platform},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video