Modernizing Incident Response with LLMs, RAG, and the MCP

Wednesday, 8 October, 2025 - 16:4517:30

Theofilos Papapanagiotou, Amazon

In traditional SRE workflows, incident response often relies on fragmented tools and tribal knowledge. This talk shares how a large-scale SRE team transitioned to secure, LLM-powered workflows using the Model Context Protocol (MCP), a novel pattern for injecting real-time, local system data (logs, configs, tickets) into LLM prompts without violating security controls. We’ll cover how we paired MCP with a domain-specific retrieval system built on OpenSearch and Bedrock Titan embeddings, enabling semantic search over incidents, dashboards, and playbooks. Learn how we tackle adoption, safeguard sensitive data, and measurably reduce resolution times.

Theofilos Papapanagiotou is a Senior Applied Scientist at Amazon, specializing in serving large language models (LLMs) at scale with a focus on performance, reliability, and cost-efficiency. He brings deep expertise in ML infrastructure, Kubernetes, and GPU optimization to help organizations deploy custom GPT-based models in secure, production environments. His work supports large-scale Site Reliability Engineering (SRE) operations, where he leads the integration of LLM-powered workflows that combine real-time logs, metrics, and incident data using the Model Context Protocol (MCP). Theofilos’ contributions bridge the gap between AI innovation and resilient system operations.

BibTeX
@conference {311880,
author = {Theofilos Papapanagiotou},
title = {Modernizing Incident Response with {LLMs}, {RAG}, and the {MCP}},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video