Dashboards & Dragons: Reliability Magic for {AI} Platforms

Alexa Griffith; Sal Furino

Tuesday, 7 October, 2025 - 13:50–14:35

Alexa Griffith and Sal Furino, Bloomberg

Scaling a generative AI platform is no fairy tale. Instead, it’s an epic battle through dungeons spanning training workloads, inference services, and infrastructure. GenAI has introduced a whole new level of complexity for infrastructure, bringing heavier resource demands, new requirements for the scaling of token-based patterns, and questions about how to monitor and manage it all.

Building GenAI systems is hard, but keeping it reliable is even harder.

In this talk, we’ll recount our journey taming the complexity of multi-cluster AI platforms using actionable SLOs as our compass. Whether you’re building your first AI platform or defending the reliability of a cluster, you’ll complete your quest equipped with practical, open source-friendly strategies to help make your systems observable, debuggable, and resilient.

Alexa Griffith is a Senior Software Engineer at Bloomberg. She works on building Bloomberg’s AI Inference Platform and the open source KServe & Envoy AI Gateway projects. She enjoys solving engineering challenges at scale, working in open source, and speaking about AI, as well as engaging with the community through her personal podcast, Alexa’s Input.

Connect:

X

Sal Furino is a Customer Reliability Engineer at Bloomberg. During his career he’s worked as a TPM, SRE, Developer, Sys Admin, and in IT support. When not working, he enjoys cooking, gaming, and traveling. Sal lives in Queens and has a bachelor’s degree in applied mathematics from Marist College.

Connect:

Bluesky

BibTeX

@conference {311816,
author = {Alexa Griffith and Sal Furino},
title = {Dashboards \& Dragons: Reliability Magic for {AI} Platforms},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

Dashboards & Dragons: Reliability Magic for AI Platforms

Presentation Video