Di-PS: System-Algorithm Co-Design for Asynchronous and Heterogeneous Cross-cluster LLM Training at Scale

Shengwei Li, National Key Laboratory of Parallel and Distributed Computing; Qiaoling Chen, Shanghai Artificial Intelligence Laboratory and Nanyang Technology University; Zhiquan Lai, National Key Laboratory of Parallel and Distributed Computing; Penglong Jiao, Wenwen Qu, Kun Cai, Jiaxing Li, Peng Sun, and Xingcheng Zhang, Shanghai Artificial Intelligence Laboratory; Xiaoge Deng, Dongsheng Li, and Kai Lu, National Key Laboratory of Parallel and Distributed Computing; Tianwei Zhang, Nanyang Technological University

Large language models (LLMs) have revolutionized artificial intelligence, exhibiting remarkable performance in various tasks. Training these models demands extensive computational resources, which are often economically and physically prohibitive. Cross-cluster training can balance infrastructure costs, alleviate physical and resource constraints, better match workload demands, and sustain higher efficiency through geo-distributed deployment. However, challenges arise from network variability, heterogeneous computational resources, and intrinsic training instability.

To address these issues, we present Di-PS, a novel framework for cross-cluster LLM training at scale. The core of Di-PS is the system-algorithm co-design of a parameter server paradigm, to achieve heterogeneous, asynchronous, and resilient training across decentralized clusters. We make several innovative contributions in Di-PS, including (i) an efficient parameter server design for cross-cluster communication of LLM parameters, (ii) a pseudo-gradient penalty strategy for convergence stability enhancement of asynchronous two-stage optimization, and (iii) a resilience mechanism for fault tolerance in cross-cluster training. Results from the controlled experimental setting demonstrate that Di-PS improves training efficiency by up to 4.67× over synchronous cross-cluster approaches while maintaining model quality, and achieving near-linear scalability in heterogeneous training resources. Di-PS has been deployed in the production environment, involving dynamic training scales with up to 9 clusters and more than 10,000 NPUs. At this scale, Di-PS enables successful cross-cluster training of a 100B-parameter LLM with only 6% overhead compared to single-cluster training, and effectively handles frequent failures and resource changes.

Category: 
Operational Systems Paper

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316658,
author = {Shengwei Li and Qiaoling Chen and Zhiquan Lai and Penglong Jiao and Wenwen Qu and Kun Cai and Jiaxing Li and Peng Sun and Xingcheng Zhang and Xiaoge Deng and Dongsheng Li and Kai Lu and Tianwei Zhang},
title = {{Di-PS}: {System-Algorithm} {Co-Design} for Asynchronous and Heterogeneous Cross-cluster {LLM} Training at Scale},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {377--395},
url = {https://www.usenix.org/conference/nsdi26/presentation/li-shengwei},
publisher = {USENIX Association},
month = may
}

Presentation Video