Operating Tens of Thousands of GPUs on Hyperscalers: Failure, Firmware, and the Illusion of Capacity

Tuesday, March 24, 2026 - 11:50 am12:35 pm

Abe Hoffman and Martin Smith, NVIDIA

At the scale of 10,000+ GPUs, a 0.01% failure rate is a daily guarantee. While hyperscalers market an image of uniform capacity, SREs know that hardware heterogeneity and "invisible" substrate states are true challenges at these kinds of scales.

This vendor-neutral session distills two years of experience managing multi-region GPU fleets. We strip away cloud provider illusions to reveal the ground truth of hardware saturation and orchestration challenges. Leave with a practical "AI-scale checklist" to maintain the best large scale cluster posture and better understand and operate in the reality of your underlying infrastructure.

Abe Hoffman is a Principal Staff Engineer, working at the intersection of large-scale GPU infrastructure, reliability engineering, and secure platform automation. His current work focuses on hyperscale-level observability and mitigation. Abe brings a systems-first perspective grounded in both theoretical computer science and hands-on operations, with prior experience founding and scaling infrastructure platforms in highly regulated domains.

Martin Smith is a Principal Architect for Site Reliability at NVIDIA, where he focuses on DGX Cloud and bridges the gap between engineering teams and the cloud service providers. With more than 20 years of experience in reliability and cloud infrastructure at companies like HashiCorp and Rackspace, he specializes in building scalable, resilient systems and infrastructure automation. Beyond his technical work, Martin is a dedicated mentor, speaker, and activist committed to making the world more awesome through engineering.

BibTeX
@conference {316282,
author = {Abe Hoffman and Martin Smith},
title = {Operating Tens of Thousands of {GPUs} on Hyperscalers: Failure, Firmware, and the Illusion of Capacity},
year = {2026},
address = {Seattle, WA},
publisher = {USENIX Association},
month = mar
}