Todd Underwood, Anthropic
Running frontier AI systems creates significant reliability challenges while simultaneously offering new tools to address them. This talk explores both sides of this equation. We'll examine the unique SRE challenges of large scale AI/ML systems - how training creates distributed systems nightmares, how accelerator scheduling defies traditional patterns, and the unique challenges of LLM serving. But we'll also explore the flip side: how these models are becoming part of the reliability toolkit itself.
This talk covers the general idea of what works in model-assisted technical operations, covering the functions that work well now and those where human expertise remains critical. It relies on real-world experience and reports across a variety of environments. Whether you're running ML workloads or considering LLMs as operational tools, you'll leave with concrete strategies for navigating the new reality where AI is both the challenge and part of the solution.

Todd Underwood leads reliability at Anthropic, a company trying to create AI systems that are safe, reliable, and beneficial to society. Prior to that he led reliability for the Research Platform at Open AI briefly. Before that he was a Senior Engineering Director at Google leading ML capacity engineering at Alphabet. He also founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services. He was also the Site Lead for Google’s Pittsburgh office. Along with several colleagues, he published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).

author = {Todd Underwood},
title = {{SRE} for {AI} and {AI} for {SRE}},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}