Safeguarding {LLM} Training at Scale: Online {SDC} Detection and Insights from 35 Million {GPU} Hours

Kinman Lei; Liyan Zheng; Xiang Li; Hongmin Chen; Yun Zhang; Gaohong Liu; Zuquan Song; Zixuan Ma; Zhiyu Xue; Minghui Yu; Shuguang Wang; Wencong Xiao; Haibin Lin; Yuyang Jin; Jidong Zhai; Bo Liu; Xin Liu

Kinman Lei, Tsinghua University; Liyan Zheng, Xiang Li, Hongmin Chen, Yun Zhang, Gaohong Liu, Zuquan Song, and Zixuan Ma, ByteDance; Zhiyu Xue, Tsinghua University; Minghui Yu, Shuguang Wang, Wencong Xiao, and Haibin Lin, ByteDance; Yuyang Jin and Jidong Zhai, Tsinghua University; Bo Liu and Xin Liu, ByteDance

Silent Data Corruption (SDC) poses a critical threat to large-scale LLM training. Existing offline tests and online detection methods provide solutions for large-scale systems, yet they suffer from high overhead or low detection accuracy in LLM training. This paper presents AEGIS, an online SDC detection framework for large-scale LLM training. We introduce a two-stage cSensor-cVerifier abstraction that decouples SDC detection into lightweight corruption sensing and definitive corruption verification. Based on this abstraction, AEGIS co-designs new detection techniques by integrating the inherent features of LLM training with GPU characteristics, enabling practical online SDC detection. In a production deployment spanning 3.5 × 10⁷ GPU-hours, AEGIS identified 18 real-world SDC incidents and 13 faulty GPUs while incurring only 0.86% performance overhead, enabling a systematic empirical characterization of SDCs in large-scale LLM training.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {318525,
author = {Kinman Lei and Liyan Zheng and Xiang Li and Hongmin Chen and Yun Zhang and Gaohong Liu and Zuquan Song and Zixuan Ma and Zhiyu Xue and Minghui Yu and Shuguang Wang and Wencong Xiao and Haibin Lin and Yuyang Jin and Jidong Zhai and Bo Liu and Xin Liu},
title = {Safeguarding {LLM} Training at Scale: Online {SDC} Detection and Insights from 35 Million {GPU} Hours},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
isbn = {978-1-939133-55-7},
address = {Seattle, WA},
pages = {1369--1384},
url = {https://www.usenix.org/conference/osdi26/presentation/lei},
publisher = {USENIX Association},
month = jul
}

Download

Lei PDF

Safeguarding LLM Training at Scale: Online SDC Detection and Insights from 35 Million GPU Hours

Open Access Media