Continuum: An Interruption-Resilient Runtime for ML Training

ChonLam Lao, Alibaba Cloud & Harvard University; Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, and Jiangfei Duan, Alibaba Cloud; Minlan Yu, Harvard University; Aditya Akella, UT Austin; Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Ennan Zhai, Dennis Cai, Zhengping Qian, and Jingren Zhou, Alibaba Cloud