Low-latency Job Scheduling with Preemption for the Development of Deep Learning

Hidehito Yabuuchi; Daisuke Taniwaki; Shingo Omura

Hidehito Yabuuchi, The University of Tokyo; Daisuke Taniwaki and Shingo Omura, Preferred Networks, Inc.

Efficient job scheduling of trial-and-error (TE) jobs is a challenging problem in deep learning projects. Unfortunately, existing job schedulers to date do not feature well-balanced scheduling for the mixture of TE and best-effort (BE) jobs, or they can handle the mixture in limited situations at most. To fill in this niche, we present an algorithm that efficiently schedules both TE and BE jobs by selectively preempting the BE jobs that can be, when the time comes, resumed without much delay. In our simulation study with synthetic workloads, we were able to reduce the 95th percentile of the slowdown rates for the TE jobs in the standard FIFO strategy by 96.6% while compromising the median of the BE slowdown rates by only 18.0% and the 95th percentile by only 23.9%.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {232969,
author = {Hidehito Yabuuchi and Daisuke Taniwaki and Shingo Omura},
title = {Low-latency Job Scheduling with Preemption for the Development of Deep Learning},
booktitle = {2019 USENIX Conference on Operational Machine Learning (OpML 19)},
year = {2019},
isbn = {978-1-939133-00-7},
address = {Santa Clara, CA},
pages = {27--30},
url = {https://www.usenix.org/conference/opml19/presentation/yabuuchi},
publisher = {USENIX Association},
month = may
}

Download

Yabuuchi PDF

View the slides