Gandiva: Introspective Cluster Scheduling for Deep Learning

Wencong Xiao; Romil Bhardwaj; Ramachandran Ramjee; Muthian Sivathanu; Nipun Kwatra; Zhenhua Han; Pratyush Patel; Xuan Peng; Hanyu Zhao; Quanlu Zhang; Fan Yang; Lidong Zhou

Wencong Xiao, Beihang University & Microsoft Research; Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, and Nipun Kwatra, Microsoft Research; Zhenhua Han, The University of Hong Kong & Microsoft Research; Pratyush Patel, Microsoft Research; Xuan Peng, Huazhong University of Science and Technology & Microsoft Research; Hanyu Zhao, Peking University & Microsoft Research; Quanlu Zhang, Fan Yang, and Lidong Zhou, Microsoft Research

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster. One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {222611,
author = {Wencong Xiao and Romil Bhardwaj and Ramachandran Ramjee and Muthian Sivathanu and Nipun Kwatra and Zhenhua Han and Pratyush Patel and Xuan Peng and Hanyu Zhao and Quanlu Zhang and Fan Yang and Lidong Zhou},
title = {Gandiva: Introspective Cluster Scheduling for Deep Learning},
booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
year = {2018},
isbn = {978-1-939133-08-3},
address = {Carlsbad, CA},
pages = {595--610},
url = {https://www.usenix.org/conference/osdi18/presentation/xiao},
publisher = {USENIX Association},
month = oct
}

Download

Xiao PDF

View the slides

Presentation Audio

Download Audio