SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Authors: 

Redwan Ibne Seraj Khan and Ahmad Hossein Yazdani, Virginia Tech; Yuqi Fu, University of Virginia; Arnab K. Paul, BITS Pilani; Bo Ji and Xun Jian, Virginia Tech; Yue Cheng, University of Virginia; Ali R. Butt, Virginia Tech

Abstract: 

Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today's DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling.

To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job's training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5 times compared to the LRU caching policy.

FAST '23 Open Access Sponsored by
NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {285774,
author = {Redwan Ibne Seraj Khan and Ahmad Hossein Yazdani and Yuqi Fu and Arnab K. Paul and Bo Ji and Xun Jian and Yue Cheng and Ali R. Butt},
title = {{SHADE}: Enable Fundamental Cacheability for Distributed Deep Learning Training},
booktitle = {21st USENIX Conference on File and Storage Technologies (FAST 23)},
year = {2023},
isbn = {978-1-939133-32-8},
address = {Santa Clara, CA},
pages = {135--152},
url = {https://www.usenix.org/conference/fast23/presentation/khan},
publisher = {USENIX Association},
month = feb
}

Presentation Video