{SYMI}: Efficient {Mixture-of-Experts} Training via Model and Optimizer State Decoupling

Athinagoras Skiadopoulos; Mark Zhao; Swapnil Gandhi; Thomas Norrie; Shrijeet Mukherjee; Christos Kozyrakis

Athinagoras Skiadopoulos, Stanford University; Mark Zhao, University of Colorado Boulder; Swapnil Gandhi, Stanford University and NVIDIA; Thomas Norrie, OpenAI; Shrijeet Mukherjee, NVIDIA; Christos Kozyrakis, Stanford University and NVIDIA

Mixture-of-Experts (MoE) models have become a widely-adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts—sparsely-activated feed-forward networks—within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads.

To break this performance-accuracy tradeoff, we introduce SYMI, an adaptive MoE training system. The key insight of SYMI is to decouple the placement of expert parameters from their large optimizer state. SYMI statically partitions the optimizer of each expert across all training nodes. Meanwhile, SYMI dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SYMI right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overhead. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SYMI is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316086,
author = {Athinagoras Skiadopoulos and Mark Zhao and Swapnil Gandhi and Thomas Norrie and Shrijeet Mukherjee and Christos Kozyrakis},
title = {{SYMI}: Efficient {Mixture-of-Experts} Training via Model and Optimizer State Decoupling},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {75--92},
url = {https://www.usenix.org/conference/nsdi26/presentation/skiadopoulos},
publisher = {USENIX Association},
month = may
}

Download

Skiadopoulos PDF

Skiadopoulos Paper (Prepublication) PDF

SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling

Open Access Media

Presentation Video