Yuhang Zhou, Zibo Wang, Zhibin Wang, Ruyi Zhang, Chen Tian, Xiaoliang Wang, Wanchun Dou, and Guihai Chen, Nanjing University; Bingqiang Wang, Yonghong Tian, Yan Zhang, and Hui Wang, Peng Cheng Laboratory; Fuchun Wei, Boquan Sun, Jingyi Zhang, Bin She, Teng Su, Yifan Yao, Chunsheng Li, Ziyang Zhang, and Yaoyuan Wang, Huawei; Bin Zhou, Shandong University; Guyue Liu, Peking University
Training large-scale deep learning (DL) models is a resource-intensive and time-consuming endeavor, yet optimizing training efficiency poses significant challenges. The sporadic performance fluctuations during long training require advanced profiling capabilities. It is not easy to perform comprehensive and accurate bottleneck analysis amidst numerous influencing factors. Selecting effective optimization strategies without proper guidance further complicates the process. This paper shares our practical insights on optimizing training on Huawei Ascend chips based on three years of experience with 135 typical cases. We propose a systematic optimization system, Hermes, including a lightweight profiling approach, a hierarchical bottleneck analysis framework, and an optimization advisor. Our real-world experiments demonstrate significant acceleration in training for models like PanGu-α, MobileNetV1, and MoE (Mixture of Experts), with respective speedups of 3.05×, 1.91×, and 1.19×.
USENIX ATC '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Yuhang Zhou and Zibo Wang and Zhibin Wang and Ruyi Zhang and Chen Tian and Xiaoliang Wang and Wanchun Dou and Guihai Chen and Bingqiang Wang and Yonghong Tian and Yan Zhang and Hui Wang and Fuchun Wei and Boquan Sun and Jingyi Zhang and Bin She and Teng Su and Yifan Yao and Chunsheng Li and Ziyang Zhang and Yaoyuan Wang and Bin Zhou and Guyue Liu},
title = {Accelerating Model Training on Ascend Chips: An Industrial System for Profiling, Analysis and Optimization},
booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
year = {2025},
isbn = {978-1-939133-48-9},
address = {Boston, MA},
pages = {1387--1408},
url = {https://www.usenix.org/conference/atc25/presentation/zhou},
publisher = {USENIX Association},
month = jul
}