Yuxuan Jiang and Ryan Huang, University of Michigan
AI training systems are now critical production infrastructure. Training large models requires thousands of GPUs for weeks, so silent failures waste enormous compute and engineering time. Despite their importance, the observability practices for AI training lag behind. Current practices rely on coarse, noisy signals that are sampled periodically and provide little help for catching or diagnosing many training errors.
This talk introduces TrainCheck, an open-source framework for deep observability inside the training process. TrainCheck introduces training invariants: semantic rules about expected internal behavior, such as consistency across parallel ranks or whether optimizer steps actually update parameters. We will describe how TrainCheck instruments training pipelines efficiently, automatically infers invariants from execution traces using relation templates, and derives any necessary preconditions. By continuously checking invariants during execution, TrainCheck detects subtle training errors early and provides actionable debugging hints. We will present evidence of TrainCheck's effectiveness on real-world issues.
Yuxuan Jiang is a PhD Candidate in the Department of Electrical Engineering and Computer Science (EECS) at the University of Michigan, Ann Arbor, advised by Dr. Ryan Huang. He is a member of the Ordered Systems Lab, where his research focuses on computer systems reliability, with an emphasis on detecting and preventing silent failures in large-scale machine learning, agentic and distributed systems.
Dr. Ryan Huang is an Associate Professor in the EECS Department at the University of Michigan, Ann Arbor, where he leads the Ordered Systems Lab. He conducts research broadly in computer systems, with specialties in designing principled methods to improve the reliability and performance of large-scale systems. He is a recipient of the NSF CAREER Award.

author = {Yuxuan Jiang and Ryan Huang},
title = {Beyond Loss and Accuracy: Closing the Observability Gaps in {AI} Training with {TrainCheck}},
year = {2026},
address = {Seattle, WA},
publisher = {USENIX Association},
month = mar
}
