FLARE: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale

Weihao Cui, Shanghai Jiao Tong University and National University of Singapore; Ji Zhang, Independent Researcher; Han Zhao, Shanghai Jiao Tong University; Chao Liu, Independent Researcher; Jian Sha, Tsinghua University; Bo Sang, Ant Group; Bingsheng He, National University of Singapore; Minyi Guo and Quan Chen, Shanghai Jiao Tong University

The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, it is challenging due to the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce FLARE, a diagnostic framework designed for distributed LLM training at scale. FLARE first integrates a lightweight tracing daemon for full-stack and backend-extensible tracing. Additionally, it features a diagnostic engine that automatically diagnoses anomalies, with a focus on performance regressions. The deployment of FLARE across 6,000 GPUs has demonstrated significant improvements in pinpointing deficiencies in real-world scenarios, with continuous operation for over eight months.

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316592,
author = {Weihao Cui and Ji Zhang and Han Zhao and Chao Liu and Jian Sha and Bo Sang and Bingsheng He and Minyi Guo and Quan Chen},
title = {{FLARE}: Anomaly Diagnostics for Divergent {LLM} Training in {GPU} Clusters of {Thousand-Plus} Scale},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1021--1035},
url = {https://www.usenix.org/conference/nsdi26/presentation/cui},
publisher = {USENIX Association},
month = may
}

Presentation Video