Aravind S Kumar and Scott Johnson, NVIDIA
SREs are under pressure to "just add AI" without increasing risk. But where do you start, and how do you avoid hype-driven dead ends? This talk shares lessons from 18 months of applying LLMs in a real, high-stakes environment: powering live gaming, simulation, and ML inference. We’ll walk through use-cases that worked, the pitfalls we encountered, and how we addressed challenges like hallucinations, alert fatigue, and on-prem LLM deployment. The session includes hands-on demos of playbook indexing, live signal integration, and role-specific outputs, all built into a practical AI stack that keeps your data within your firewall. You’ll leave with honest takeaways, working examples, and clear steps for responsibly integrating AI into your operations.

Aravind Kumar is a Senior Software Engineer at NVIDIA working on AI agents and infrastructure for cloud gaming and cloud inference. Previously, he worked on deep learning models for malaria surveillance in East Africa, under a Johns Hopkins and Gates Foundation collaboration.

Scott Johnson is a Senior Technical Program Manager at NVIDIA working on the GeForce NOW service with a focus on capacity management and AI solutions in the SRE space. He has over 20 years of experience in SRE, working on large scale online services at companies like Microsoft, Intuit, and Yahoo!.

author = {Aravind Sunil Kumar and Scott Johnson},
title = {Why Risk Management Requires Taking Risks: A Practical Guide to Getting {SRE} Teams {AI-Ready}},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}
