Accelerating Design Space Exploration for {LLM} Training Systems with Multi-experiment Parallel Simulation

Fei Gui; Kaihui Gao; Li Chen; Dan Li; Vincent Liu; Ran Zhang; Hongbing Yang; Dian Xiong

Fei Gui, Tsinghua University, BNRist, and Tsinghua Shenzhen International Graduate School; Kaihui Gao and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Vincent Liu, University of Pennsylvania; Ran Zhang and Hongbing Yang, Zhongguancun Laboratory; Dian Xiong, Tsinghua University

The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance.

In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1−73.2X speedup over state-of-the-art CPU-based simulators in various use cases.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {305961,
author = {Fei Gui and Kaihui Gao and Li Chen and Dan Li and Vincent Liu and Ran Zhang and Hongbing Yang and Dian Xiong},
title = {Accelerating Design Space Exploration for {LLM} Training Systems with Multi-experiment Parallel Simulation},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)},
year = {2025},
isbn = {978-1-939133-46-5},
address = {Philadelphia, PA},
pages = {473--488},
url = {https://www.usenix.org/conference/nsdi25/presentation/gui},
publisher = {USENIX Association},
month = apr
}

Download

Gui PDF

View the slides

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Open Access Media

Presentation Video