{TonY}: An Orchestrator for Distributed Machine Learning Jobs

Anthony Hsu; Keqiu Hu; Jonathan Hung; Arun Suresh; Zhe Zhang

Anthony Hsu, Keqiu Hu, Jonathan Hung, Arun Suresh, and Zhe Zhang, LinkedIn

Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, we describe TonY, an open-source orchestrator for distributed ML jobs built at LinkedIn to address these challenges.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {232985,
author = {Anthony Hsu and Keqiu Hu and Jonathan Hung and Arun Suresh and Zhe Zhang},
title = {{TonY}: An Orchestrator for Distributed Machine Learning Jobs},
booktitle = {2019 USENIX Conference on Operational Machine Learning (OpML 19)},
year = {2019},
isbn = {978-1-939133-00-7},
address = {Santa Clara, CA},
pages = {39--41},
url = {https://www.usenix.org/conference/opml19/presentation/hsu},
publisher = {USENIX Association},
month = may
}

Download

Hsu PDF

View the slides

TonY: An Orchestrator for Distributed Machine Learning Jobs

Open Access Media