{SwiftEP}: Accelerating {MoE} Inference with Buffer Fusion and {TMA} Offloading

Xingyi Li; Yadong Liu; Xiaojie Huang; Yiran Zhang; Shuai Wang; Shangguang Wang; Zhehao Lin; Yinben Xia; Chang Yu; Qihang Liu; Xuan Zhang; Hao Lu; Xiang Li; Zekun He; Yachen Wang; Xianneng Zou

Xingyi Li, unaffiliated; Yadong Liu and Xiaojie Huang, Tencent; Yiran Zhang, Shuai Wang, and Shangguang Wang, unaffiliated; Zhehao Lin and Yinben Xia, Tencent; Chang Yu, Nanjing University; Qihang Liu, Xuan Zhang, Hao Lu, Xiang Li, Zekun He, Yachen Wang, and Xianneng Zou, Tencent

Large Language Models (LLMs) increasingly rely on Mixture-of-Experts (MoE) architectures to scale computation efficiently. Expert Parallelism (EP), which distributes experts across GPUs, introduces all-to-all communication overhead during the dispatch and combine phases, especially in the prefill stage, which dominates the inference performance. Existing communication libraries, such as DeepEP, suffer from excessive GPU SM utilization and underutilized interconnect bandwidth, limiting prefill performance.

In this paper, we identify two root causes: redundant buffer copies and inefficient intra-server transfers over NVLink. To address these, we propose SwiftEP, an all-to-all communication library tailored for MoE prefill, combining buffer fusion and Tensor Memory Accelerator (TMA) offloading. Buffer fusion eliminates redundant staging copies, enabling true zero-copy communication, while TMA offloading maximizes NVLink utilization and supports efficient multicast/reduce operations. SwiftEP further incorporates RDMA scatter-gather lists, QP transmission parallelization, and CUDA IPC to handle dynamic token placement and inter-GPU memory access. Evaluation on 16- and 32-GPU clusters shows that SwiftEP achieves up to 119.7% higher algorithm bandwidth, reduces SM occupancy by up to 66.7%, and improves request serving capacity by 21.2% compared to DeepEP.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316662,
author = {Xingyi Li and Yadong Liu and Xiaojie Huang and Yiran Zhang and Shuai Wang and Shangguang Wang and Zhehao Lin and Yinben Xia and Chang Yu and Qihang Liu and Xuan Zhang and Hao Lu and Xiang Li and Zekun He and Yachen Wang and Xianneng Zou},
title = {{SwiftEP}: Accelerating {MoE} Inference with Buffer Fusion and {TMA} Offloading},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1073--1089},
url = {https://www.usenix.org/conference/nsdi26/presentation/li-xingyi},
publisher = {USENIX Association},
month = may
}

Download

Li PDF

SwiftEP: Accelerating MoE Inference with Buffer Fusion and TMA Offloading

Open Access Media

Presentation Video