From Chaos to Confidence: How {SREs} Can Leverage 50 (and Counting) Failure Scenarios to Test {AI} Readiness

Rohan Arora; Bhavya

Wednesday, March 26, 2025 - 1:55 pm–2:40 pm

Rohan R. Arora and Bhavya, IBM Research

Using sandboxed Kubernetes environments, we created 50+ production-inspired failure scenarios that put AI assistants to the test across the full SRE toolkit. The results? Current AI models resolve only 13.8% of scenarios—a reality check for anyone evaluating these tools.This session introduces our evaluation framework and shows how you can use it to benchmark AI assistants against real failure patterns, chaos-test your own applications with production-inspired scenarios, and assess whether AI-assisted approaches fit your operational needs.We're building a community-driven repository where SREs contribute real incidents and advance the field together. Come learn what AI can (and can't) do today—and help shape what it could do tomorrow.

Rohan R. Arora is a Senior Software Engineer at IBM Research. He joined IBM Research in 2016 after graduating from the University of Illinois at Urbana-Champaign with a Master's Degree in Electrical and Computer Engineering. In his early career at IBM, he co-led the effort on developing augmented and virtual reality-based solutions for the enterprise. Since 2021, he has been working at the intersection of machine learning (ML) and IT operations, particularly in theareas of incident management and resource optimization.

Connect:

Bhavya is a Research Scientist at IBM Research, where she works on LLM-based agentic systems for IT automation, with a focus on areas like incident management. She holds a Ph.D. from the University of Illinois, Urbana-Champaign (2024), where her research explored LLM-driven approaches to novel NLP tasks, particularly in educational contexts. Prior to her doctorate, she spent two years at Gartner as a Data Scientist building Recommender Systems and Text Mining solutions.

Connect:

From Chaos to Confidence: How SREs Can Leverage 50 (and Counting) Failure Scenarios to Test AI Readiness

Presentation Video