ModelKeeper: Accelerating DNN Training via Automated Training Warmup

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website will not be available on Saturday, April 13, from 12:00 am–12:30 am Pacific Daylight Time (UTC-7). We apologize for the inconvenience.

If you are trying to register for NSDI '24 or register for PEPR '24, please complete your registration before or after this time period.

Authors: 

Fan Lai, Yinwei Dai, Harsha V. Madhyastha, and Mosharaf Chowdhury, University of Michigan

Abstract: 

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.

We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job's model by transforming an already-trained model's weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

NSDI '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {285169,
author = {Fan Lai and Yinwei Dai and Harsha V. Madhyastha and Mosharaf Chowdhury},
title = {{ModelKeeper}: Accelerating {DNN} Training via Automated Training Warmup},
booktitle = {20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)},
year = {2023},
isbn = {978-1-939133-33-5},
address = {Boston, MA},
pages = {769--785},
url = {https://www.usenix.org/conference/nsdi23/presentation/lai-fan},
publisher = {USENIX Association},
month = apr
}
Lai Paper (Prepublication) PDF

Presentation Video