{TrainMover}: An {Interruption-Resilient} Runtime for {ML} Training

ChonLam Lao; Jiaqi Gao; Jiamin Cao; Zhipeng Zhang; Pengcheng Zhang; Jiangfei Duan; Zhilong Zheng; Yu Guan; Yichi Xu; Yong Li; Zhengping Qian; Aditya Akella; Minlan Yu; Ennan Zhai; Dennis Cai; Jingren Zhou

ChonLam Lao, Harvard University and Alibaba Group; Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, and Zhengping Qian, Alibaba Group; Aditya Akella, The University of Texas at Austin; Minlan Yu, Harvard University; Ennan Zhai, Dennis Cai, and Jingren Zhou, Alibaba Group

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {318599,
author = {ChonLam Lao and Jiaqi Gao and Jiamin Cao and Zhipeng Zhang and Pengcheng Zhang and Jiangfei Duan and Zhilong Zheng and Yu Guan and Yichi Xu and Yong Li and Zhengping Qian and Aditya Akella and Minlan Yu and Ennan Zhai and Dennis Cai and Jingren Zhou},
title = {{TrainMover}: An {Interruption-Resilient} Runtime for {ML} Training},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
isbn = {978-1-939133-55-7},
address = {Seattle, WA},
pages = {2047--2064},
url = {https://www.usenix.org/conference/osdi26/presentation/lao},
publisher = {USENIX Association},
month = jul
}

Download

Lao PDF

TrainMover: An Interruption-Resilient Runtime for ML Training

Open Access Media