Jinkun Lin, New York University; Ziheng Jiang, Zuquan Song, Sida Zhao, and Menghan Yu, ByteDance Seed; Zhanghan Wang, New York University; Chenyuan Wang, ByteDance Seed; Zuocheng Shi, Zhejiang University; Xiang Shi, ByteDance; Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, and Xin Liu, ByteDance Seed; Aurojit Panda and Jinyang Li, New York University
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
OSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Jinkun Lin and Ziheng Jiang and Zuquan Song and Sida Zhao and Menghan Yu and Zhanghan Wang and Chenyuan Wang and Zuocheng Shi and Xiang Shi and Wei Jia and Zherui Liu and Shuguang Wang and Haibin Lin and Xin Liu and Aurojit Panda and Jinyang Li},
title = {Understanding Stragglers in Large Model Training Using What-if Analysis},
booktitle = {19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)},
year = {2025},
isbn = {978-1-939133-47-2},
address = {Boston, MA},
pages = {483--498},
url = {https://www.usenix.org/conference/osdi25/presentation/lin-jinkun},
publisher = {USENIX Association},
month = jul
}

