Enabling AI Network Cross-Layer Design and Operations with Arcadia: A Simulation Platform at Scale

Zhaodong Wang, Satyajeet Singh Ahuja, Xu Zhang, Yuhui Zhang, Max Noormohammadpour, Gregory R. Steinbrecher, Thomas Fuller, Xin Liu, Kevin Quirk, Mikel Jimenez Fernandez, Abhinav Triguna, Yan Cai, and Steve Politis, Meta Platforms; Petr Lapukhov and Naader Hasani, Nvidia; Ying Zhang, Meta Platforms

The rapid evolution of Artificial Intelligence (AI) technology is fueling significant investments by hyperscalers, making AI networks crucial for large-scale training. Understanding the design impacts on AI training requires systematic, cross-layer evaluation. Production experience highlights the need for a robust simulation platform to guide network design and operations. This paper defines the platform requirements, addresses complex design challenges, and shares our experience building Arcadia, a scalable, high-fidelity simulation platform for AI Networks. It operates at the cluster level, focusing on overall cluster performance rather than individual job performance. By using our fast-forwarding, lock free, and synchronization-cost reduction mechanisms, Arcadia achieves scalability and speed, allowing us to faithfully simulate real-world-scale training clusters and plays an important role in guiding Meta's AI network evolution.

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316720,
author = {Zhaodong Wang and Satyajeet Singh Ahuja and Xu Zhang and Max Noormohammadpour and Gregory R. Steinbrecher and Thomas Fuller and Xin Liu and Kevin Quirk and Mikel Jimenez Fernandez and Abhinav Triguna and Yan Cai and Steve Politis and Petr Lapukhov and Naader Hasani and Ying Zhang},
title = {Enabling {AI} Network {Cross-Layer} Design and Operations with Arcadia: A Simulation Platform at Scale},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1791-1807},
url = {https://www.usenix.org/conference/nsdi26/presentation/wang-zhaodong},
publisher = {USENIX Association},
month = may
}

Presentation Video