{OpGuard}: Bitwise Alignment for Precise and General Debugging of Production {LLM} Training

Ziming Zhou; Yinjie Zhao; Hang Zhu; Wenxiao Wang; Zhihao Bai; Yun Zhang; Shuguang Wang; Haibin Lin; Peng Huang

Ziming Zhou and Yinjie Zhao, University of Michigan; Hang Zhu, Wenxiao Wang, Zhihao Bai, Yun Zhang, Shuguang Wang, and Haibin Lin, ByteDance Seed; Peng Huang, University of Michigan

Large-scale LLM training runs on many GPUs for weeks atop rapidly evolving software stacks. Bugs or hardware glitches can silently corrupt the computation and only surface much later. Debugging becomes finding a needle in a haystack across time. Developers often use another training run and compare their loss, gradient norms, etc. But these aggregate signals are noisy and easily diluted across millions of operations, offering little guidance on why the divergence occurs. This paper introduces bitwise alignment as a correctness oracle and debugging primitive for LLM training, and OpGuard, a practical system that realizes it at production scale. OpGuard discovers semantic-stable operator boundaries across heterogeneous training stacks, and wraps them with lightweight fingerprinting. A schedule-tolerant mapper computes the longest prefix where two executions produce bitwise-identical tensors. The first mismatching point becomes a pivot for debugging and is presented with rich context. By carefully controlling benign nondeterminism, OpGuard makes the first mismatch strong evidence of error. OpGuard has been deployed at ByteDance across pre-training and post-training workloads. It diagnosed over twenty production issues, including subtle kernel races and silent data corruptions missed by existing checks, reducing debugging time from days to minutes.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {318527,
author = {Ziming Zhou and Yinjie Zhao and Hang Zhu and Wenxiao Wang and Zhihao Bai and Yun Zhang and Shuguang Wang and Haibin Lin and Peng Huang},
title = {{OpGuard}: Bitwise Alignment for Precise and General Debugging of Production {LLM} Training},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
isbn = {978-1-939133-55-7},
address = {Seattle, WA},
pages = {1385--1406},
url = {https://www.usenix.org/conference/osdi26/presentation/zhou-ziming},
publisher = {USENIX Association},
month = jul
}

Download

Zhou PDF

OpGuard: Bitwise Alignment for Precise and General Debugging of Production LLM Training

Open Access Media