{TopoOpt}: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Weiyang Wang; Moein Khazraee; Zhizhen Zhong; Manya Ghobadi; Zhihao Jia; Dheevatsa Mudigere; Ying Zhang; Anthony Kewitsch

Weiyang Wang, Moein Khazraee, Zhizhen Zhong, and Manya Ghobadi, Massachusetts Institute of Technology; Zhihao Jia, Meta and CMU; Dheevatsa Mudigere and Ying Zhang, Meta; Anthony Kewitsch, Telescent

We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training workloads. TopoOpt co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. We demonstrate the mutability of AllReduce traffic, and leverage this property to construct efficient network topologies for DNN training jobs. TopoOpt then uses an alternating optimization technique and a group theory-inspired algorithm called TotientPerms to find the best network topology and routing plan, together with a parallelization strategy. We build a fully functional 12-node direct-connect prototype with remote direct memory access (RDMA) forwarding at 100 Gbps. Large-scale simulations on real distributed training models show that, compared to similar-cost Fat-Tree interconnects, TopoOpt reduces DNN training time by up to 3.4x.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

Conference attendees

BibTeX

@inproceedings {285119,
author = {Weiyang Wang and Moein Khazraee and Zhizhen Zhong and Manya Ghobadi and Zhihao Jia and Dheevatsa Mudigere and Ying Zhang and Anthony Kewitsch},
title = {{TopoOpt}: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs},
booktitle = {20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)},
year = {2023},
isbn = {978-1-939133-33-5},
address = {Boston, MA},
pages = {739--767},
url = {https://www.usenix.org/conference/nsdi23/presentation/wang-weiyang},
publisher = {USENIX Association},
month = apr
}

Download

Wang PDF

Wang Paper (Prepublication) PDF

View the slides

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Open Access Media

This content is available to:

Presentation Video