Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

Joshua Romero; Junqi Yin; Nouamane Laanait; Bing Xie; M. Todd Young; Sean Treichler; Vitalii Starchenko; Albina Borisevich; Alex Sergeev; Michael Matheson

Joshua Romero, NVIDIA, Inc.; Junqi Yin, Nouamane Laanait, Bing Xie, and M. Todd Young, Oak Ridge National Laboratory; Sean Treichler, NVIDIA, Inc.; Vitalii Starchenko and Albina Borisevich, Oak Ridge National Laboratory; Alex Sergeev, Carbon Robotics; Michael Matheson, Oak Ridge National Laboratory

This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod’s existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer — Summit. We compare our proposals to Horovod’s original design and observe 2x performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {276984,
author = {Joshua Romero and Junqi Yin and Nouamane Laanait and Bing Xie and M. Todd Young and Sean Treichler and Vitalii Starchenko and Albina Borisevich and Alex Sergeev and Michael Matheson},
title = {Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
isbn = {978-1-939133-27-4},
address = {Renton, WA},
pages = {1027--1040},
url = {https://www.usenix.org/conference/nsdi22/presentation/romero},
publisher = {USENIX Association},
month = apr
}

Download

Romero PDF

Romero Paper (Prepublication) PDF

View the slides

Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

Open Access Media

Presentation Video