Litz: Elastic Framework for High-Performance Distributed Machine Learning

Authors: 

Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Abutalib Aghayev, Carnegie Mellon University; Weiren Yu, Petuum, Inc. and Beihang University; Haoyang Chen and Qirong Ho, Petuum, Inc.; Garth A. Gibson, Carnegie Mellon University and Vector Institute; Eric P. Xing, Petuum, Inc. and Carnegie Mellon University

Abstract: 

Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications.

Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes --- stateful workers, model scheduling, and relaxed consistency --- which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution.

We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {216041,
author = {Aurick Qiao and Abutalib Aghayev and Weiren Yu and Haoyang Chen and Qirong Ho and Garth A. Gibson and Eric P. Xing},
title = {Litz: Elastic Framework for High-Performance Distributed Machine Learning},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Boston, MA},
pages = {631--644},
url = {https://www.usenix.org/conference/atc18/presentation/qiao},
publisher = {{USENIX} Association},
month = jul,
}

Presentation Audio