{RollArt}: Disaggregated {Multi-Task} Agentic {RL} Training at Scale

Wei Gao; Yuheng Zhao; Tianyuan Wu; Shaopan Xiong; Weixun Wang; Dakai An; Lunxi Cao; Dilxat Muhtar; Zichen Liu; Haizhou Zhao; Ju Huang; Siran Yang; Yongbin Li; Wenbo Su; Jiamang Wang; Lin Qu; Bo Zheng; Wei Wang

Wei Gao, Yuheng Zhao, and Tianyuan Wu, Hong Kong University of Science and Technology; Shaopan Xiong and Weixun Wang, Alibaba Group; Dakai An and Lunxi Cao, Hong Kong University of Science and Technology; Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, and Siran Yang, Alibaba Group; Yongbin Li, Tongyi Lab, Alibaba; Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng, Alibaba Group; Wei Wang, Hong Kong University of Science and Technology

Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty reward evaluation. Existing systems either colocate all stages on a single GPU cluster or decouple them only at a coarse granularity, overlooking hardware heterogeneity and incurring substantial synchronization overhead across stages.

We present ROLLART, a system for multi-task agentic RL on disaggregated infrastructure. ROLLART maps each pipeline stage to best-fit hardware, routing prefill-heavy tasks to compute-optimized GPUs, decode-heavy tasks to bandwidth-optimized GPUs, and environments to CPU clusters. It decouples rollout at the trajectory level, allowing generation, environment interaction, and reward scoring to proceed independently, so that slow or failed environments never block the others. ROLLART offloads stateless reward computation to serverless infrastructure and overlaps rollout with training via staleness-bounded asynchronous weight synchronization. Our results demonstrate that ROLLART effectively improves training throughput and achieves 1.31–2.05× training time reduction compared to various RL systems. We also evaluated ROLLART by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with above 3,000 GPUs, demonstrating its stability and scalability.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {318475,
author = {Wei Gao and Yuheng Zhao and Tianyuan Wu and Shaopan Xiong and Weixun Wang and Dakai An and Lunxi Cao and Dilxat Muhtar and Zichen Liu and Haizhou Zhao and Ju Huang and Siran Yang and Yongbin Li and Wenbo Su and Jiamang Wang and Lin Qu and Bo Zheng and Wei Wang},
title = {{RollArt}: Disaggregated {Multi-Task} Agentic {RL} Training at Scale},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
isbn = {978-1-939133-55-7},
address = {Seattle, WA},
pages = {863--881},
url = {https://www.usenix.org/conference/osdi26/presentation/gao},
publisher = {USENIX Association},
month = jul
}

Download

Gao PDF

RollArt: Disaggregated Multi-Task Agentic RL Training at Scale

Open Access Media