Elastic Resource Sharing for Distributed Deep Learning

Authors: 

Changho Hwang and Taehyun Kim, KAIST; Sunghyun Kim, MIT; Jinwoo Shin and KyoungSoo Park, KAIST

Abstract: 

Resource allocation and scheduling strategies for deep learning training (DLT) jobs have a critical impact on their average job completion time (JCT). Unfortunately, traditional algorithms such as Shortest-Remaining-Time-First (SRTF) often perform poorly for DLT jobs. This is because blindly prioritizing only the short jobs is suboptimal and job-level resource preemption is too coarse-grained for effective mitigation of head-of-line blocking.

We investigate the algorithms that accelerate DLT jobs. Our analysis finds that (1) resource efficiency often matters more than short job prioritization and (2) applying greedy algorithms to existing jobs inflates average JCT due to overly optimistic views toward future resource availability. Inspired by these findings, we propose Apathetic Future Share (AFS) that balances resource efficiency and short job prioritization while curbing unrealistic optimism in resource allocation. To bring the algorithmic benefits into practice, we also build CoDDL, a DLT system framework that transparently handles automatic job parallelization and efficiently performs frequent share re-adjustments. Our evaluation shows that AFS outperforms Themis, SRTF, and Tiresias-L in terms of average JCT by up to 2.2x, 2.7x, and 3.1x, respectively.

NSDI '21 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {265013,
author = {Changho Hwang and Taehyun Kim and Sunghyun Kim and Jinwoo Shin and KyoungSoo Park},
title = {Elastic Resource Sharing for Distributed Deep Learning},
booktitle = {18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21)},
year = {2021},
isbn = {978-1-939133-21-2},
pages = {721--739},
url = {https://www.usenix.org/conference/nsdi21/presentation/hwang},
publisher = {{USENIX} Association},
month = apr,
}

Presentation Video