Safeguarding LLM Training at Scale: Online SDC Detection and Insights from 35 Million GPU Hours

Kinman Lei, Tsinghua University; Liyan Zheng, Xiang Li, Hongmin Chen, Yun Zhang, Gaohong Liu, Zuquan Song, and Zixuan Ma, ByteDance; Zhiyu Xue, Tsinghua University; Minghui Yu, Shuguang Wang, and Wencong Xiao, ByteDance; Haibin Lin, Bytedance; Yuyang Jin and Jidong Zhai, Tsinghua University; Bo Liu and Xin Liu, ByteDance