SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization


Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai, Tsinghua University


Deep neural networks are growing large for stronger model ability, consuming enormous computation resources to train them. Sparsely activated models have been increasingly proposed and deployed to reduce training costs while enlarging model size. Unfortunately, previous auto-parallelization approaches designed for dense neural networks can hardly be applied to these sparse models, as sparse models are data-sensitive and barely considered by prior works.

To address these challenges, we propose SmartMoE to perform distributed training for sparsely activated models automatically. We find optimization opportunities in an enlarged space of hybrid parallelism, considering the workload of data-sensitive models. The space is decomposed into static pools offline, and choices to pick within a pool online. To construct an optimal pool ahead of training, we introduce a data-sensitive predicting method for performance modeling. Dynamic runtime selection of optimal parallel strategy is enabled by our efficient searching algorithm. We evaluate SmartMoE on three platforms with up to 64 GPUs. It achieves up to 1.88x speedup in end-to-end training over the state-of-the-art MoE model training system FasterMoE.

USENIX ATC '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

@inproceedings {288691,
author = {Mingshu Zhai and Jiaao He and Zixuan Ma and Zan Zong and Runqing Zhang and Jidong Zhai},
title = {{SmartMoE}: Efficiently Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
isbn = {978-1-939133-35-9},
address = {Boston, MA},
pages = {961--975},
url = {},
publisher = {USENIX Association},
month = jul

Presentation Video