Automating Operations with ML

Todd Underwood and Steven Ross, Google

Abstract: 

Engineers have been attracted to the idea of using Machine Learning to control their applications and infrastructure. Unfortunately, the majority of proposed uses of ML for production engineering are unsuited for the stated purpose. They generally fail to account for several structural limitations of the proposed application, including failure to account for error rate, cost versus failure and most generally insufficient number of labeled examples.

We will review the common proposed applications of Machine Learning to production control including: anomaly detection, monitoring/alerting, capacity prediction, security, and resource scaling. For each we will use experience to demonstrate the limitations that ML modeling techniques have. We will identify one application with the best results.

We will end with specific recommendations for how organizations can get ready to take advantage of ML for their production operations in the future.

Todd Underwood, Google

Todd Underwood is a lead Machine Learning for Site Reliability Engineering Director at Google and is a Site Lead for Google’s Pittsburgh office. ML SRE teams build and scale internal and external ML services and are critical to almost every product area at Google. Todd was in charge of operations, security, and peering for Renesys’s Internet intelligence services that is now part of Oracle’s cloud service. Before that Todd was Chief Technology Officer of Oso Grande in New Mexico. Todd has a BA in philosophy from Columbia University and a MS in computer science from the University of New Mexico. He is interested in how to make computers and people work much, much better together.

Steven Ross, Google

Steven Ross is a tech lead in Site Reliability Engineering for Google in Pittsburgh, and has worked on Machine Learning at Google since Pittsburgh Pattern Recognition was acquired by Google in 2011. Before that he worked as a Software Engineer for Dart Communications, Fishtail Design Automation, and then Pittsburgh Pattern Recognition until Google acquired it. Steven has a B.S. from Carnegie Mellon University (1999) and an M.S. in Electrical and Computer Engineering from Northwestern University (2000). He is interested in mass-producing Machine Learning models.

OpML '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {256664,
author = {Todd Underwood and Steven Ross},
title = {Automating Operations with {ML}},
year = {2020},
publisher = {USENIX Association},
month = jul
}

Presentation Video 
Teaser
Full