SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Authors: 

Yifan Xiong, Yuting Jiang, Ziyue Yang, and Lei Qu, Microsoft Research; Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, and Joe Chau, Microsoft; Peng Cheng, Yongqiang Xiong, and Lidong Zhou, Microsoft Research

Awarded Best Paper!

Abstract: 

Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions.

We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61×. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.

USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {298589,
author = {Yifan Xiong and Yuting Jiang and Ziyue Yang and Lei Qu and Guoshuai Zhao and Shuguang Liu and Dong Zhong and Boris Pinzur and Jie Zhang and Yang Wang and Jithin Jose and Hossein Pourreza and Jeff Baxter and Kushal Datta and Prabhat Ram and Luke Melton and Joe Chau and Peng Cheng and Yongqiang Xiong and Lidong Zhou},
title = {{SuperBench}: Improving Cloud {AI} Infrastructure Reliability with Proactive Validation},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {835--850},
url = {https://www.usenix.org/conference/atc24/presentation/xiong},
publisher = {USENIX Association},
month = jul
}