Jayashree Mohan, UT Austin; Amar Phanishayee, Microsoft Research; Vijay Chidambaram, UT Austin and VMware research
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery.
We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%.
FAST '21 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Jayashree Mohan and Amar Phanishayee and Vijay Chidambaram},
title = {{CheckFreq}: Frequent, {Fine-Grained} {DNN} Checkpointing},
booktitle = {19th USENIX Conference on File and Storage Technologies (FAST 21)},
year = {2021},
isbn = {978-1-939133-20-5},
pages = {203--216},
url = {https://www.usenix.org/conference/fast21/presentation/mohan},
publisher = {USENIX Association},
month = feb
}