Analysis of {Large-Scale} {Multi-Tenant} {GPU} Clusters for {DNN} Training Workloads

Myeongjae Jeon; Shivaram Venkataraman; Amar Phanishayee; Junjie Qian; Wencong Xiao; Fan Yang

Authors:

Myeongjae Jeon, UNIST and Microsoft Research; Shivaram Venkataraman, University of Wisconsin and Microsoft Research; Amar Phanishayee and Junjie Qian, Microsoft Research; Wencong Xiao, Beihang University and Microsoft Research; Fan Yang, Microsoft Research

Abstract:

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

View Trace Repository on GitHub: https://github.com/msr-fiddle/philly-traces

Myeongjae Jeon, UNIST and Microsoft Research

Shivaram Venkataraman, University of Wisconsin and Microsoft Research

Amar Phanishayee, Microsoft Research

Junjie Qian, Microsoft Research

Wencong Xiao, Beihang University and Microsoft Research

Fan Yang, Microsoft Research

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {234916,
author = {Myeongjae Jeon and Shivaram Venkataraman and Amar Phanishayee and Junjie Qian and Wencong Xiao and Fan Yang},
title = {Analysis of {Large-Scale} {Multi-Tenant} {GPU} Clusters for {DNN} Training Workloads},
booktitle = {2019 USENIX Annual Technical Conference (USENIX ATC 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {947--960},
url = {https://www.usenix.org/conference/atc19/presentation/jeon},
publisher = {USENIX Association},
month = jul
}

Download

Jeon PDF

View the Lightning Talks Slides

View the Presentation Slides

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads