The Power of Metrics—How to Monitor and Improve ML Efficiency

Yan Yan and Zhilan Zweiger, Facebook

Abstract: 

This talk is about an ML operational tool born on account of the rapid development of ML training workload, the need for headlight to perform issues, and the seeking of best practices for ML training. It helps to make the most of the limited computing resources and assures that the production model is of efficiency, reliability, and scalability. You will know our motivation behind developing this tool, the challenges we have faced, its main features, use cases and how diverse users have leveraged this tool during their work to improve productivity.

Yan Yan, Facebook

Yan Yan is a production engineer, working at Facebook. She belongs to the Ads Ranking PE team that improves efficiency, reliability, and scalability for machine learning at Facebook. Her mission is to share her knowledge to help society by anticipating ML problems and solve the existing ones. Previously, she received an M.S. degree in computer science from UCLA.

Zhilan Zweiger, Facebook

Zhilan Zweiger is a staff engineer and tech lead in the Production Engineering team at Facebook. She is primarily responsible for reliability, efficiency, and scalability of the Ads Machine Learning infrastructure stack. Before that, she worked at Twitter in the Data Platform SRE team where she focused on the reliability of the big data, batch compute, and streaming compute environment. Zhilan holds a master's degree in Computer Science.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {232955,
author = {Yan Yan and Zhilan Zweiger},
title = {The Power of Metrics{\textemdash}How to Monitor and Improve {ML} Efficiency},
year = {2019},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = may,
}