Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation

Jianxing Qin; Jingrong Chen; Xinhao Kong; Yongji Wu; Tianjun Yuan; Liang Luo; Zhaodong Wang; Ying Zhang; Tingjun Chen; Alvin R. Lebeck; Danyang Zhuo

Jianxing Qin, Duke University; Jingrong Chen, Uber; Xinhao Kong, NVIDIA; Yongji Wu, University of California, Berkeley; Tianjun Yuan, Duke University; Liang Luo, Zhaodong Wang, and Ying Zhang, Meta; Tingjun Chen, Alvin R. Lebeck, and Danyang Zhuo, Duke University

Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly.

This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316690,
author = {Jianxing Qin and Jingrong Chen and Xinhao Kong and Yongji Wu and Tianjun Yuan and Liang Luo and Zhaodong Wang and Ying Zhang and Tingjun Chen and Alvin R. Lebeck and Danyang Zhuo},
title = {Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1809--1825},
url = {https://www.usenix.org/conference/nsdi26/presentation/qin},
publisher = {USENIX Association},
month = may
}

Download

Qin PDF

View the slides

Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation

Open Access Media

Presentation Video