SREcon26 Americas Conference Program

Monday, March 23

5:00 pm–7:00 pm

Welcome Get-Together

Grand Foyer

Tuesday, March 24

7:30 am–8:45 am

Continental Breakfast

Grand Foyer

8:45 am–9:00 am

Opening Remarks

Grand Ballroom II & III

Program Co-Chairs: Patrick Cable, DraftKings, and Laura Maguire, Trace Cognitive Engineering

9:00 am–10:30 am

Opening Plenary Session

Grand Ballroom II & III

Taming the Unpredictable: Reliability in Chaos

Tuesday, 9:00 am–9:45 am

Michelle Brush, Google Inc.

Available Media

The increasing volume of code written by AI will only accelerate the complexity of our systems. We are moving beyond predictable systems that can be managed with traditional methods like thorough project plans, runbooks, and unit tests, into an era of truly complex systems that are vast and difficult to comprehend fully. These immensely complex systems will behave almost nondeterministically. We need new strategies.

This presentation will delve into why robust reliability practices are not just helpful but essential for navigating this explosion in complexity. It will share strategies for conquering unpredictability including building rigorous evaluations, implementing generic mitigations, recreating reproducibility, and developing software with a risk-first approach. Most importantly, it will discuss why humans will always be critical to taming the ever-growing complexity.

Michelle Brush has 25 years of software experience working across embedded software, distributed systems, enterprise software, and consumer devices. In her current role as an Engineering Director at Google, she leads the global teams of SREs that ensure Compute Engine and Persistent Disk are reliable. She is also the author of 2 of the 97 Things Every SRE Should Know.

Previously, Michelle worked for Cerner Corporation as the Engineering Director responsible for the data engineering platform for Cerner’s Population Health products. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm and later led the map technology department responsible for Garmin’s various spatial search, data compression, and shortest pathfinding algorithms. If you’ve ever gotten a route you didn’t like using a Garmin device, it was probably Michelle’s fault.

Mean Time to WTF: Why Developer Experience Frameworks Belong in Your Incident Retrospectives

Tuesday, 9:45 am–10:30 am

Nicole Forsgren

Available Media

SREs are developers who happen to focus on reliability. You write code, build tools, manage complex workflows, and experience friction every single day. That friction has a measurable cost: slower incident response, more mistakes under pressure, higher on-call burden, and ultimately, degraded system reliability.

We should also address the elephant: AI is generating way more code, leading to more deployments and bigger changes to your systems. If your operational friction is already high, AI is about to amplify it exponentially.

This talk applies developer experience frameworks (SPACE, DORA) to operational work. You'll learn how to measure SRE experience, connect it to reliability outcomes, and identify high-leverage changes to reduce friction. Walk away with practical things you can do tomorrow and a framework for making the business case that investing in SRE experience is a reliability strategy.

Dr. Nicole Forsgren is considered an expert in DevOps and developer productivity. She has led efforts to improve DevEx and AI-native developer productivity at some of the largest companies in the world. She is author of two best-selling, award-winning books: Accelerate and DevOps Handbook 2e. Her latest book, Frictionless, was recently released. She has been an entrepreneur (with an exit to Google), professor, developer, sysadmin, and performance engineer.

Nicole’s work has been published in several peer-reviewed journals and conferences. Nicole earned her PhD in Management Information Systems and Masters in Accounting from the University of Arizona. She lives in the PNW and recharges her brain with gym time, tacos, and Diet Coke.

10:30 am–11:00 am

Coffee and Tea Break

Grand Foyer

11:00 am–12:35 pm

Track 1

Grand Ballroom II

Infinity Is Not a Strategy: Right‑Sizing the Cloud

Tuesday, 11:00 am–11:45 am

Praval Panwar, Microsoft

Available Media

SRE teams are often told the cloud is “infinitely scalable”—until costs spike or systems fall over. In practice, many teams oscillate between over-provisioning and panic-scaling, reacting to incidents instead of planning for uncertainty.

Other industries like airlines, power grids, and logistics have faced this problem for decades. They don’t assume infinite resources. They forecast demand, build buffers, and design explicitly for failure. This talk borrows those proven ideas and applies them to cloud systems. It explores why assumptions of infinite scale lead to wasted spend and fragile reliability, and how SRE teams can reason more clearly about capacity, cost, and performance together. The focus is not on new tools, but on better mental models—so teams can plan for spikes, failures, and uncertainty without relying on guesswork or excess capacity.

Praval Panwar is a Principal Software Engineer at Microsoft, where he works on large-scale observability and telemetry platforms. He has spent over seven years designing and operating hype-scale distributed systems that need to work reliably under messy, real-world conditions. His interests include how systems fail under pressure, how assumptions break at scale, and why many operational problems turn out to be design problems in disguise.

Operating Tens of Thousands of GPUs on Hyperscalers: Failure, Firmware, and the Illusion of Capacity

Tuesday, 11:50 am–12:35 pm

Abe Hoffman and Martin Smith, NVIDIA

At the scale of 10,000+ GPUs, a 0.01% failure rate is a daily guarantee. While hyperscalers market an image of uniform capacity, SREs know that hardware heterogeneity and "invisible" substrate states are true challenges at these kinds of scales.

This vendor-neutral session distills two years of experience managing multi-region GPU fleets. We strip away cloud provider illusions to reveal the ground truth of hardware saturation and orchestration challenges. Leave with a practical "AI-scale checklist" to maintain the best large scale cluster posture and better understand and operate in the reality of your underlying infrastructure.

Abe Hoffman is a Principal Staff Engineer, working at the intersection of large-scale GPU infrastructure, reliability engineering, and secure platform automation. His current work focuses on hyperscale-level observability and mitigation. Abe brings a systems-first perspective grounded in both theoretical computer science and hands-on operations, with prior experience founding and scaling infrastructure platforms in highly regulated domains.

Martin Smith is a Principal Architect for Site Reliability at NVIDIA, where he focuses on DGX Cloud and bridges the gap between engineering teams and the cloud service providers. With more than 20 years of experience in reliability and cloud infrastructure at companies like HashiCorp and Rackspace, he specializes in building scalable, resilient systems and infrastructure automation. Beyond his technical work, Martin is a dedicated mentor, speaker, and activist committed to making the world more awesome through engineering.

Track 2

Grand Ballroom III

The Zero Trust Odyssey: Our Journey to Modernize Internal Access

Tuesday, 11:00 am–11:45 am

Nathan Handler and Pratik Lotia, Reddit

Available Media

For years, Reddit’s internal services were protected by a traditional, perimeter-based security model using NGINX proxy. This talk explores our journey toward a zero trust architecture and the process of replacing that legacy system. We cover why we chose Cloudflare, the challenges of migrating at Reddit’s scale, and the hard-won lessons that shaped our approach. We show you how we made it simple and fast for developers to onboard new applications without getting bogged down in complex security configurations. Leave with practical insights to guide your own zero trust transitions.

Nathan is a Staff Infrastructure Security Engineer at Reddit, working within the SPACE (Security, Privacy, Assurance, Corporate Engineering) organization. He focuses on ensuring Reddit’s infrastructure is launched securely by default, partnering closely with infrastructure, product, and compliance teams to build tooling that surfaces misconfigurations early, automates common security controls, and provides clear visibility into security posture without slowing development.

Prior to his current role in Security, Nathan worked as a Site Reliability Engineer at Reddit, supporting multiple critical platforms. In that role, he operated large-scale production infrastructure such as the RPAN video platform and ran the infrastructure powering r/place in 2022 and 2023. Alongside his current work, he continues to help shape Reddit’s IAM strategy and Infrastructure as Code practices.

Before Reddit, Nathan led infrastructure at the crypto startup Orchid, building systems for a decentralized bandwidth marketplace on Ethereum. Earlier in his career, he was a Site Reliability Engineer at Yelp, where he was a core contributor to PaaSTA, Yelp’s internal open-source Platform as a Service built on Apache Mesos. He has also been deeply involved in open source as an Ubuntu and Debian developer and as a former member of the freenode IRC staff.

Connect:

Pratik Lotia is an infrastructure security engineer at Reddit, where he is responsible for building tools and processes for implementing security best practices for cloud native environments. He has extensive experience working on security projects for public and private clouds and telco security. He actively contributes to CNCF TAG security projects and runs the DDoS Community at DEFCON.

Connect:

Shift-Left Compliance

Tuesday, 11:50 am–12:35 pm

Bharath Modhipalli and Karanveer Anand, Google

Available Media

The traditional "compliance tax" of reactive auditing is failing as systems scale. This talk introduces "Shift-Left Compliance," a framework that transforms compliance from an afterthought into an intrinsic system property. We will demonstrate how to implement Compliance-as-Code using tools like Open Policy Agent (OPA), and how to integrate automated checks into your CI/CD pipeline to stop violations before they deploy. Join us to learn how to design "born compliant" systems, prevent configuration drift, and significantly reduce engineering toil by treating regulatory requirements as code.

Bharath Modhipalli is a Staff Site Reliability Engineer (SRE) and Tech Lead at Google Workspace in Sunnyvale, California. He currently serves as the lead for the Apps Core Infrastructure SRE team, where he focuses on the reliability and scalability of foundational systems for Google Workspace.

Karanveer Anand is a Technical Program Manager for Google SRE, based in Sunnyvale, California. He oversees Workspace reliability, managing a portfolio of projects aimed at enhancing the reliability of infrastructure and ensuring customer compliance.

Workshop

Elliott Bay Room

Using Staged World Exercises to Practice Effective Incident Response and Analysis Techniques

Tuesday, 11:00 am–12:35 pm

Moderator: Courtney Nash, The VOID
Panelists: Sarah Butt, Salesforce; Alex Elman, Slack; Eric Dobbs; Hamed Silatani, Uptime Labs

This hands-on session combines an interactive workshop with a unique panel of four experts, all from different companies and moderated by an industry expert, discussing an incident within a Staged World platform and its subsequent post-incident analysis. This unique approach offers unprecedented transparency and detail that typical software incidents don’t provide, including the rare opportunity to see exceptional incident response in action. Participants will review chat transcripts, videos of responders, and analyst notes, enabling a candid discussion free from legal or reputational concerns. As part of this, participants will encounter common challenges in incident management, such as the multi-party dilemma, managing saturation, and hidden work complexities. The subsequent panel discussion after the workshop session will cover effective teamwork, the nuances of expertise in high-pressure situations, the nature of surprises in incidents, and how to derive meaningful insights from complex events. Designed for incident responders, commanders, and SREs, this talk merges expertise from various fields to better inform how we respond to software incidents.

Courtney Nash is the Co-founder and CEO of The VOID. Her research focuses on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she has held a variety of editorial, program management, research, and management roles at Verica, Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

Connect:

Sarah Butt is a Principal Engineer within Salesforce's Reliability Engineering group, where she helps lead Salesforce's Centralized Incident Response organization. She is fascinated by scale, complexity, systems thinking, and non-functional requirements— particularly those around reliability. You'll likely find her talking about topics such as resilience engineering, observability, and incident management and response. Prior to working at Salesforce, Sarah worked in both Dell and SentinelOne's SRE organizations. In her free time, Sarah enjoys freelancing as an audio engineer, competing as an equestrian, and spending time with her husband and their three labradors.

Connect:

Alex Elman helps software organizations cope with ever increasing complexity and interconnectedness. His focus is on learning from incidents and incident management processes. Alex spent thirteen years at Indeed and was a founding member of their Site Reliability Engineering team. In 2024 Alex joined Slack and works in the reliability organization on incident management.

Connect:

Eric Dobbs is helping bring resilience engineering and cognitive systems engineering into practice in the software industry. A self-taught programmer, he's been writing code recreationally for over four decades and professionally for three. He has also practiced aikido for over three decades and taught for two of those. His career has meandered through education, consultancies, government, non-profit, and businesses from startup size to Internet scale. He started deliberately practicing learning from incidents in 2018 as part of the SNAFUcatchers consortium. He holds a bachelor's degree in environmental design from the University of Colorado and a fifth-degree black belt from Boulder Aikikai.

Connect:

Hamed Silatani is the Co-founder and CEO of Uptime Labs, a software engineer with over 20 years of experience supporting high-pressure financial trading platforms, and a resilience engineering enthusiast. Having managed major incidents under intense scrutiny, he learned that tools alone don’t create resilience—prepared teams do, and focuses on practical approaches to building confidence and incident readiness in engineering teams.

Connect:

Discussion Track

Vashon Room

Varieties of SRE

Tuesday, 11:00 am–12:35 pm

Kurt Andersen, Clari, and Sebastian Vietz, Compass Digital and Compass Group

Ask 10 people what SRE is, and you’ll get 11 answers. There is no single "correct" way to do SRE. From embedded engineers to centralized consulting teams, every organization adapts the core concepts to fit their reality. This unconference discussion session invites you to peel back the label and examine the reality of the role. Join our facilitators to discuss the different flavors of SRE, share how your organization defines it, and debate whether the title still fits the work we do today.

By day, Kurt works as an infrastructure software architect at Clari. In addition, he serves on the USENIX Board and has had the pleasure to work with amazing people around the globe in the SREcon conferences. He also helps with the annual SRE survey and report.

Sebastian Vietz is Director of Reliability Engineering at Compass Digital, where he builds the SRE discipline to help the organization move from operational friction to sustainable reliability. Over the past decade, Seb has launched SRE teams from the ground up and stepped into established organizations at reliability impasse — realigning structure, ownership, and operating models to restore clarity, impact and momentum.

Seb is also the creator of SRE Insights (sreinsights.io) and co-host of the Reliability Enablers podcast with Ash Patel, contributing to the broader reliability community through practical, experience-driven conversations.

12:35 pm–1:50 pm

Luncheon

Grand Ballroom I & Cascade Ballroom

1:50 pm–3:25 pm

Track 1

Grand Ballroom II

Low Latency Serving of Offline Data: Efficient, Safe, and Reliable Data Loading At Scale

Tuesday, 1:50 pm–2:35 pm

William Schor

Available Media

Transferring massive datasets (e.g., 50 TB) into online serving systems—without disrupting live applications—poses unique challenges. Traditional methods often lead to high costs, slow transfers, and performance bottlenecks. In this talk, we present a novel system that preprocesses data offline into RocksDB SST files, stages them in the cloud, and loads them into isolated online infrastructure for seamless, near-instantaneous cutovers. Our approach enables robust validation, rapid rollback, and high-scale, low-latency reads, all while dramatically reducing deployment time (by 99%) and cost (by 70%). Attendees will learn about the architectural innovations, technical lessons, and practical strategies that make this system a game-changer for large-scale data operations.

William Schor is a Senior Software Engineer at Netflix, where he works on the KeyValue team building and scaling distributed data systems. He focuses on resilience and high availability, playing a key role in evolving Netflix's data infrastructure to support diverse and fast-growing use cases across the company.

William holds a BS in Computer Science from Brown University and has published research in cryptographic security. His work spans distributed systems, data infrastructure, and security, bringing both academic rigor and practical experience to solving complex engineering challenges at scale.

Connect:

Autonomous Policy Validation: Building AI Agents to Analyze Logs and Identify User Data Policy Violations

Tuesday, 2:40 pm–3:25 pm

Paul Huang, Meta

Available Media

Dive into the architecture and operational lessons of our AI-driven system for User Data Access Policy (UDAP) violation validation. This talk details how we built AI agents that use log analysis and artifact reasoning to automate and augment investigations. We share specific learnings and evaluation approaches for deploying such applied ML/AI systems in low risk-tolerance environments.

Paul is the ML/AI TL in Meta Security Detection & Response organization leading AI for Security efforts at Meta. He is a serial founder and technology organization leader in the applied ML/AI product and security space.

Track 2

Grand Ballroom III

Executing Chaos Engineering in Production at a Critical Financial Institution

Tuesday, 1:50 pm–2:35 pm

Luiz Siqueira and Leonardo Marques, Bradesco

Available Media

Discover how Chaos Engineering transformed a high-stakes financial ecosystem processing thousands of transactions per second. This real-world case study unveils a reproducible framework for risk-averse organizations, blending fault injection, automation, and observability.

Key takeaways include safe experiment design with governance guardrails, automated chaos workflows, and multidisciplinary GameDays. Results: 73% reduction in MTTD, 10 hidden vulnerabilities exposed, 5 new metrics, and a shift to proactive reliability.

Learn a compliance-friendly methodology to turn failures into insights, bridging theory and measurable business impact in critical systems. Perfect for SREs, Developers, and Ops teams seeking production-ready resilience.

Luiz Siqueira is a specialist in Information Technology with over 15 years of experience in managing critical systems and operational reliability. His background includes an MBA in Site Reliability Engineering (SRE) and a degree in IT Management. He has worked on large-scale projects in companies such as IBM, Kyndryl, and Banco Bradesco, focusing on support, automation, and digital transformation. Currently, he serves as SRE Manager at Bradesco, leading initiatives to ensure scalability, resilience, and efficiency in digital environments. He holds certifications in SRE Foundation℠, Gremlin Chaos Engineering, ITIL, Agile, and Cloud Service Management. Beyond his technical expertise, he is recognized for building high-performance teams and advancing modern reliability practices in complex environments.

Connect:

Leonardo Siqueira Marques is a Senior IT Operations and Site Reliability Engineering leader with over 25 years of experience in information technology, building and operating highly available, large-scale systems in the financial sector. He holds a degree in Computer Science and an MBA in Digital Transformation from the University of São Paulo (USP). Leonardo has been actively driving reliability transformation initiatives focused on Chaos Engineering, incident response, and operational maturity, emphasizing the use of controlled experimentation and real production failures as continuous learning mechanisms to build resilient, high-performing engineering teams.

From Thundering Herd to Zero Outages: Building Reliable Inventory Sync

Tuesday, 2:40 pm–3:25 pm

Rushikesh Ghatpande, Broadcom Inc

Available Media

Managing accurate inventory across distributed infrastructure is critical for security policy enforcement and operational reliability. Enterprise datacenter software requires centralized policy management across thousands of servers and hundreds of thousands of VMs and containers, yet resources are distributed across multiple data centers.

This talk shares a battle-tested inventory synchronization protocol that evolved over 6 years of production experience, handling real-world challenges from thundering herd problems during full datacenter restarts to fairness in queue processing. The protocol uses a 5-stage finite state machine to ensure reliable, consistent inventory sync while preventing system overload.

You'll learn how we evolved from a naive 3-step process to a robust 5-stage protocol, how we solved the thundering herd problem, ensured fairness in queue processing, and separated connection establishment from application readiness. We'll share empirical analysis that led to specific timeout values and demonstrate how bidirectional communication patterns eliminated message ordering complexity.

This protocol has been validated at scale across 10,000+ servers with zero customer escalations over 4 years. The patterns are immediately applicable to any distributed state synchronization challenge - whether managing VMs, containers, network devices, or any distributed resources.

Rushikesh Ghatpande is a Principal Engineer at Broadcom, where he leads critical architectural initiatives in network virtualization and security.

With over a decade of experience in distributed systems and software engineering, he has risen from a new graduate to a principal engineer through his work on VMware's flagship NSX product.

His expertise spans software architecture, cross-team engineering leadership, and building highly available, planet-scale systems. Beyond his engineering contributions, Rushikesh has published research papers and filed multiple patents in the networking and virtualization space, reflecting his passion for innovation in cloud infrastructure and distributed systems.

Connect:

Workshop

Elliott Bay Room

STPA for Software Workshop: Finding the Outages Waiting to Happen

Tuesday, 1:50 pm–5:30 pm

Theo Klein, Ruben Barroso, and Garrett Holtaus, Google

SREs know about some of the flaws and vulnerabilities in their systems. They might also have intuition on where to look for additional issues–"known unknowns." But what about the "unknown unknowns"–outages waiting to happen that nobody is even looking for? With the vast complexity of modern software systems, this dark space of unknowns can be huge. And, what's worse, most of the outages in this space happen due to complex interactions between various parts of the system, even when everything is working according to specification, i.e. no implementation bugs.

What if we had a way to shine a light into the unknown unknowns? What if we could understand our systems enough to be able to methodically explore these complex interactions and build a comprehensive list of possible outage scenarios? In STPA, we model systems based on control-feedback loops, creating a hierarchical control structure, or HCS. In this session, we'll use a real Google system to show how an HCS can help you gain a new perspective and understanding of your system. We'll note similarities to common patterns in software design so you can start thinking about similar vulnerabilities in your own systems.

This session will be an interactive workshop. Attendees should plan to actively participate in the small group exercises in order to get the most benefit from the session.

Theo Klein is a Staff Site Reliability Engineer (SRE) at Google. Over the past two years, he has led an effort to improve the safety and reliability of road disruption data on Google Maps. He also has several years of experience applying safety engineering methods like STPA and CAST to proactively identify risks in complex socio-technical systems at Google. His primary interests are in systems thinking, dependency management and horizontal analyses of large-scale systems.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Connect:

Garrett Holthaus is a technical writer for Site Reliability Engineering at Google. He has a background in electrical and computer engineering, as well as experience teaching and designing science and technology curricula. In addition to writing and maintaining SRE documentation, Garrett develops and gives training in System Theoretic Process Analysis (STPA) at Google.

Discussion Track

Vashon Room

Architecture Decisions and Tradeoffs

Tuesday, 1:50 pm–3:25 pm

Steve McGhee, Google, and Michael Rembetsy, Bloomberg

Architecture is the art of choosing your tradeoffs. Do you accept the "vendor lock-in" to move faster, or do you drown in the complexity of a "cloud-agnostic" abstraction layer? Do you chase five-nines availability across regions, or do you prioritize data consistency? In this session, we’ll create a forum for open discussion on topics such as Multi-Cloud strategies, the reality of Hybrid/On-Prem, managing critical vendor dependencies, and the unique constraints of highly regulated environments. Join us to compare notes on what works, what failed, and the hidden costs of complexity.

Steve loves to help Google Cloud customers build reliable systems with sustainable teams. He was a Google SRE for over a decade, then a customer, then a Reliability Advocate. He recently returned to his roots in SRE as a member of the Google Cloud Incident Response Core Team. He is also a host of the Prodcast, Google’s podcast on SRE and Production Software.

Michael is the Global Head of Network Engineering and Operations at Bloomberg. His teams are responsible for everything from physical hardware to SREs who focus on global connectivity for customers. Before Bloomberg, Michael was the VP of Infrastructure at Etsy. Michael has spoken around the world, participated in various industry program committees, and is a past conference program chair for LISA and SREcon. Additionally, he has served on advisory boards throughout his career.

3:25 pm–3:55 pm

Coffee and Tea Break

Grand Foyer

3:55 pm–5:30 pm

Track 1

Grand Ballroom II

From ESXi to Kubernetes: Modernizing 1,400 Edge Locations with Open Source

Tuesday, 3:55 pm–4:15 pm

Jasjit Singh and Kanika Nayyar

Available Media

When virtualization licensing costs increased tenfold, a large retail organization rethought how it ran infrastructure across 1,400 edge locations. This session shares how existing store hardware was transformed from a VMWare-based setup into a fully open-source, Kubernetes-powered platform — without adding new hardware or disrupting business operations. Learn how ESXi was replaced with Kubernetes clusters were centrally managed using GitOps with FluxCD, and open-source tools like Dex, Helm, Grafana, and KubeVirt were used to deploy and observe workloads at scale. The talk focuses on practical SRE lessons around automation, resilience at the edge, and cost-efficient modernization that attendees can apply to their own distributed environments.

Jasjit Singh is a Technical Program Manager at Loblaw with over 18 years of experience delivering enterprise-scale solutions across large, distributed environments. His expertise spans monitoring, observability, and Site Reliability Engineering, with a strong focus on building resilient and scalable platforms.

Connect:

As a Team Lead in Site Reliability Engineering at Loblaw Companies Limited, I specialize in building and scaling resilient, high-performing systems that power critical retail and digital experiences at enterprise scale. With a strong foundation in cloud infrastructure, automation, and observability, I lead cross-functional teams focused on improving system reliability, reducing operational toil, and enabling rapid, confident deployments.

My work sits at the intersection of engineering and operations, where I drive initiatives around incident management, service level objectives (SLOs), and infrastructure as code to ensure consistent, measurable reliability. I hold a Master of Science in Advanced Computing from the University of Nottingham, which shapes my approach to solving complex engineering challenges.

At this SREcon26 event in Seattle, I’m excited to share insights from operating large-scale systems in the retail sector, discuss practical approaches to reliability engineering, and connect with peers who are shaping the future of SRE.

Connect:

Beyond Blanket Freezes: Enabling Safe Innovation During Critical Events at Netflix

Tuesday, 4:20 pm–4:40 pm

Prachi Jain and Sandhya Narayan, Netflix

Available Media

Deployment freezes are a common safety lever during high-risk periods, but they often slow teams down and create bottlenecks. In this talk, Prachi Jain and Sandhya Narayan share how Netflix is replacing blanket freezes with data-driven, service-specific risk management that enables safe, continuous delivery even during critical events. Attendees will learn how to classify services by risk, integrate this into CI/CD, and empower teams to ship quickly without sacrificing reliability.

Prachi Jain is an experienced SRE with expertise in building and managing scalable, reliable services in Public Key Infrastructure (PKI), cryptography, secrets management, and security. Currently an SRE at Netflix, she previously held key roles at Fastly and Cisco, where she addressed complex, large-scale reliability challenges for high-availability security applications. An active contributor to the Internet Engineering Task Force (IETF), Prachi actively helps shape security standards and is recognized for her passion for solving intricate technical problems.

Sandhya Narayan is an accomplished Technical Program Manager (TPM) with extensive experience in leading and managing global, cross-functional teams to deliver complex security initiatives. She is an expert in information security, compliance, and risk management, with a deep understanding of secure product lifecycles and regulatory requirements. Currently, Sandhya serves as a TPM at Netflix, where she drives critical security programs and fosters cross-team collaboration. Previously, she held strategic roles at Adobe, SAP, eBay, and the Stanford Research Institute, leading key global security initiatives and strengthening security engagement across organizations.

Reliability in the Big Leagues: How SRE Powers America’s Pastime

Tuesday, 4:45 pm–5:30 pm

Jessica Johnson and Chris Alexander, Major League Baseball

Available Media

In professional sports, a "timeout" is strategic, but "downtime" is a disaster. At Major League Baseball, SRE covers the whole field – ensuring real-time data integrity for millions of fans and powering the critical on-field technology that impacts every play. This session goes beyond the dugout to reveal how we built a Major League SRE practice from the ground up. We will share our journey of rebuilding trust through psychological safety, how we "score the game" using advanced SLOs, and how we are now moving to the offensive as we approach Opening Day. Join us for a look at the unique hurdles of sports tech and walk away with a gameplan for fielding a championship-caliber SRE team.

Jessica Johnson is the Senior Director of Site Reliability Engineering at Major League Baseball, a tenacious leader who drives the adoption of reliability best practices at MLB through education, enablement and engagement. Recent accomplishments include leading teams to deliver over 300 reliability improvements in one quarter, and establishing the Operational Excellence program to empower leaders to reflect on SLOs and incident data. She is dedicated to building a psychologically safe culture where mistakes are learning opportunities and where information is democratized to support growth and innovation.

Connect:

Chris Alexander is an Engineering Manager and Technical Lead for the Baseball Data Platform at Major League Baseball. Operating at the intersection of real-time sports analytics and high-availability infrastructure, Chris architects the pipelines behind MLB’s Emmy Award-winning Statcast system. He is responsible for delivering the critical, low-latency live game data that powers global broadcasts and sportsbooks. Beyond the tech stack, Chris leads MLB’s Learning From Incidents program, championing a blameless culture that prioritizes resilient systems over assigning fault.

Connect:

Track 2

Grand Ballroom III

The Critical Resource Is You: Practical Destressing for On-Call Engineers

Tuesday, 3:55 pm–4:15 pm

Beth Adele Long, Adaptive Capacity Labs

Available Media

You know how to triage, diagnose, and resolve problems in your software and hardware systems — but can you do the same with your own biological stress response? In this talk we’ll look at the physiological experience of chronic and acute stress: the slow burn of on-call rotations and the hyper-alertness of incident response. We’ll address the observability challenge that makes it hard for many of us to notice stress before it becomes burnout. And we’ll unpack a few simple but powerful ways to use biology in your favor to recover from even the most intense incident.

Beth Adele Long is a Principal at Adaptive Capacity Labs and founding board member of the Resilience in Software Foundation. She has held engineering and product roles at New Relic, Jeli.io, and Gruntwork. She coaches individuals and organizations in incident response and analysis, adaptive delivery and sustainable practices for operational work.

AI Agents for Incident Investigation: The Good, The Bad, and The Ugly

Tuesday, 4:20 pm–4:40 pm

Vladyslav Budichenko, Vocaly AI

Available Media

AI agents can help with production incidents, but most demos skip the messy reality. This session covers using LLM-based agents and coding engines like Claude Code during real on-call work. You'll learn how agents can compress investigation time, improve MTTR, and help you understand what broke without drowning in alerts and dashboards. We'll look at what works: pulling context, correlating signals, suggesting fixes. And what doesn't: where agents give bad advice or waste your time.

Vladyslav Budichenko is a software engineer with 11 years of experience building distributed systems, AI applications, and production infrastructure at scale. An AWS- and GCP-certified professional architect, Vladyslav is also the founder of Vocaly AI, a platform to create voice AI agents for businesses. He is passionate about discovering new AI tools and integrating them into everyday workflows, especially for vibe coding and automations.

Three Lies We Tell Ourselves about Disaster Recovery and What to Do about Them

Tuesday, 4:45 pm–5:30 pm

Colette Alexander, Resilience in Software Foundation

Available Media

While many engineers and leaders participate in and sponsor disaster recovery (DR) activities at their companies, few ever need to wrestle with whether their DR systems will work in reality. This is because of the resilience and robustness of the systems they work within (and maybe a bit of luck). But what if we needed to use our DR plans in reality one day? What would happen then? Using real-life stories from the movies, the US defense department, and nuclear power we can illuminate what some of the weaknesses of our own DR plans are, and then come up with tactical and philosophical approaches to making DR exercises better for everyone.

Colette has been working as an engineering leader in the software industry for 10+ years. Her obsession with learning from incidents and Resilience Engineering began while managing teams at Spotify. It eventually led her to pursue her Masters in Science at Lund University in Human Factors and Systems Safety. She has led organizations in SRE and observability at HashiCorp and Cognite. She also maintains an active composition and recording career as a rock cellist, and lives with her rescue dog, 2 kids and husband in Ann Arbor, Michigan.

Workshop (continued)

Elliott Bay Room

STPA for Software Workshop: Finding the Outages Waiting to Happen

Tuesday, 1:50 pm–5:30 pm

Theo Klein, Ruben Barroso, and Garrett Holtaus, Google

SREs know about some of the flaws and vulnerabilities in their systems. They might also have intuition on where to look for additional issues–"known unknowns." But what about the "unknown unknowns"–outages waiting to happen that nobody is even looking for? With the vast complexity of modern software systems, this dark space of unknowns can be huge. And, what's worse, most of the outages in this space happen due to complex interactions between various parts of the system, even when everything is working according to specification, i.e. no implementation bugs.

What if we had a way to shine a light into the unknown unknowns? What if we could understand our systems enough to be able to methodically explore these complex interactions and build a comprehensive list of possible outage scenarios? In STPA, we model systems based on control-feedback loops, creating a hierarchical control structure, or HCS. In this session, we'll use a real Google system to show how an HCS can help you gain a new perspective and understanding of your system. We'll note similarities to common patterns in software design so you can start thinking about similar vulnerabilities in your own systems.

This session will be an interactive workshop. Attendees should plan to actively participate in the small group exercises in order to get the most benefit from the session.

Theo Klein is a Staff Site Reliability Engineer (SRE) at Google. Over the past two years, he has led an effort to improve the safety and reliability of road disruption data on Google Maps. He also has several years of experience applying safety engineering methods like STPA and CAST to proactively identify risks in complex socio-technical systems at Google. His primary interests are in systems thinking, dependency management and horizontal analyses of large-scale systems.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Connect:

Garrett Holthaus is a technical writer for Site Reliability Engineering at Google. He has a background in electrical and computer engineering, as well as experience teaching and designing science and technology curricula. In addition to writing and maintaining SRE documentation, Garrett develops and gives training in System Theoretic Process Analysis (STPA) at Google.

Discussion Track

Vashon Room

Monitoring and Observability

Tuesday, 3:55 pm–5:30 pm

Daria Barteneva, Microsoft Azure, and Liz Fong-Jones, Honeycomb

Drowning in logs but still struggling to get answers? In this open-format session, we invite you to drive the conversation. Whether you're debating the cost of cardinality, struggling with alert fatigue, or actively implementing OpenTelemetry at scale, this unconference discussion session is your space to compare notes. Join us to discuss the tools, culture, and realities of monitoring and observability in complex systems.

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently a Technical Fellow at honeycomb.io, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

She lives in Vancouver, BC with her wife Elly and partners, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

Bluesky

5:30 pm–7:00 pm

Conference Reception at the Sponsor Showcase

Grand Foyer

Sponsored by IBM

Wednesday, March 25

8:00 am–9:00 am

Continental Breakfast

Grand Foyer

9:00 am–10:35 am

Track 1

Grand Ballroom II

Escaping Version Skew: Formalizing Compatibility in a World of Partial Rollouts

Wednesday, 9:00 am–9:45 am

Robbie Ostrow, OpenAI

Available Media

Developers interact with codebases one commit at a time, but the distributed systems we manage change much more chaotically. Services are constantly deploying at different versions and rollouts are never instant. A simple change to a wire format can break inflight requests, data in queues, caches, or databases.

Data formats like Protocol Buffers exist to solve some of these problems, but they aren’t expressive enough to encode higher-level invariants. Keeping track of breaking changes and managing multi-step deployments by hand is too much overhead for any engineer.

In this talk, we get a bit more formal about what it means to actually make a “breaking change” and introduce simple tooling that shifts rollout safety from individual developers into the system, while enabling stricter API contracts that prevent entire classes of failure.

Robbie Ostrow works on infrastructure and reliability at OpenAI, where he spends far too much time trying to formally express the invariants he expects from his systems. Before OpenAI, Robbie led engineering teams at Q Bio and Vanta, where he preached the same ideas at much smaller scales. When he isn’t thinking about type systems or breaking changes, you can find him in San Francisco playing word games or hunting down merch from failed companies.

The Case of the Misnamed Cities: CAST Analysis of a Google Maps Incident

Wednesday, 9:50 am–10:35 am

Ruben Barroso, Google

Available Media

Traditional Root Cause Analysis (RCA) is inadequate for learning from complex software incidents, often resulting in only proximal, sharp-end mitigations. For the past three years at Google, we have successfully applied Causal Analysis based on Systems Theory (CAST) to analyze major incidents. In this talk I will demonstrate the power of CAST using a real incident involving incorrect data being displayed to external users. I will contrast the findings of the original postmortem, which focused on the proximal chain of events, with the CAST analysis, which uncovered deeper, systemic environmental factors. I will show you how we use CAST to identify blunt-end systemic factors and translate those insights into recommendations that deliver durable safety improvements.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Connect:

Track 2

Grand Ballroom III

Ghosts in the Interview Loop and Avoiding AI Taylorism

Wednesday, 9:00 am–9:45 am

Andrew Hatch, Cisco ThousandEyes

Available Media

AI tools are now living in your interview loop, reducing once rigorous SRE coding interviews to something solved by an LLM in seconds. Remote interviewing only makes cheating easier, harder to detect and System Design interviews are fast-following as tools grow in sophistication

In 2025, ThousandEyes collided with this reality. AI-assisted cheating, broken coding rounds, requiring us to pivot to take-home challenges that test thinking, not just output. This talk will discuss interviewing and the emerging threat these tools are leading us to a new dawn of Taylorism, albeit wrapped in cheerful Agentic/LLM chatter (as opposed to checklists and a stopwatch), reducing skilled engineers to mindless prompt operators, eroding expertise, lowering morale and, fragile systems and codebases drowning in a sea of AI slop.

Weaving humor, critque and lived experience this talk will encourage debate on what we are hiring for in this new era of our industry

Andrew Hatch is an engineering leader and SRE manager at Cisco ThousandEyes, with over 25 years in the technology industry across Australia, India, and the United States. He moved to the Bay Area in 2020 to join LinkedIn as an SRE Manager before taking up his current role at ThousandEyes. His work spans software engineering, consulting, operations, and building SRE and platform teams for large-scale systems. Andrew has previously spoken at SREcon on learning from complex systems and the realities of SRE management, and continues to explore how organisations can hire, lead, and learn more effectively in an AI-augmented world

Connect:

Bluesky

So You Want a New Incident Commander—Lessons from Building Incident Response Teams

Wednesday, 9:50 am–10:35 am

Vanessa Huerta Granda, Enova International

Available Media

Incident Command isn’t a badge for the most senior engineer. It’s a sociotechnical leadership skill that keeps teams aligned, reduces cognitive load, and builds trust during outages. This talk shares lessons from a decade building IC programs across SRE organizations; including how to identify, train, and support effective Incident Commanders without burning out your best responders.

Vanessa is a Technology Manager for Resilience Engineering at Enova. Previously she worked at Jeli.io helping companies make the most of their incidents. Vanessa has built and scaled incident command programs across multiple engineering organizations, training hundreds of engineers in high-pressure leadership, on-call operations, and outage communication. Her work focuses on the human side of reliability; creating sustainable practices that reduce burnout while improving resilience. She speaks frequently on incident command, SRE culture, and decision-making under uncertainty.

10:35 am–11:05 am

Coffee and Tea Break

Grand Foyer

11:05 am–12:40 pm

Track 1

Grand Ballroom II

Beyond Loss and Accuracy: Closing the Observability Gaps in AI Training with TrainCheck

Wednesday, 11:05 am–11:50 am

Yuxuan Jiang and Ryan Huang, University of Michigan

Available Media

AI training systems are now critical production infrastructure. Training large models requires thousands of GPUs for weeks, so silent failures waste enormous compute and engineering time. Despite their importance, the observability practices for AI training lag behind. Current practices rely on coarse, noisy signals that are sampled periodically and provide little help for catching or diagnosing many training errors.

This talk introduces TrainCheck, an open-source framework for deep observability inside the training process. TrainCheck introduces training invariants: semantic rules about expected internal behavior, such as consistency across parallel ranks or whether optimizer steps actually update parameters. We will describe how TrainCheck instruments training pipelines efficiently, automatically infers invariants from execution traces using relation templates, and derives any necessary preconditions. By continuously checking invariants during execution, TrainCheck detects subtle training errors early and provides actionable debugging hints. We will present evidence of TrainCheck's effectiveness on real-world issues.

Yuxuan Jiang is a PhD Candidate in the Department of Electrical Engineering and Computer Science (EECS) at the University of Michigan, Ann Arbor, advised by Dr. Ryan Huang. He is a member of the Ordered Systems Lab, where his research focuses on computer systems reliability, with an emphasis on detecting and preventing silent failures in large-scale machine learning, agentic and distributed systems.

Connect:

X

Dr. Ryan Huang is an Associate Professor in the EECS Department at the University of Michigan, Ann Arbor, where he leads the Ordered Systems Lab. He conducts research broadly in computer systems, with specialties in designing principled methods to improve the reliability and performance of large-scale systems. He is a recipient of the NSF CAREER Award.

Connect:

X

Stop Reading Changelogs: Safer Kubernetes Upgrades with Simulation

Wednesday, 11:55 am–12:40 pm

David R. Morrison, Applied Computing Research Labs

Available Media

Kubernetes has an extremely aggressive release cycle, with a new Kubernetes version released 3 times per year. Keeping up with this release schedule is a difficult and thankless task, exacerbated by the fact that there is no safe rollback path for Kubernetes upgrades. In this talk, we present SimKube, an open-source Kubernetes simulation environment that be used to “shift left” the Kubernetes upgrade process. We will provide an overview of SimKube’s capabilities, which enable platform engineers to record a “trace” (i.e., a timestamped stream of events) collected from a production cluster, and replay it in a simulated setting. We will show how users can use this capability to identify components of their infrastructure that are incompatible with a new Kubernetes version before any live clusters are upgraded. Lastly, we will present a demonstration of SimKube, showing how it can detect upgrade issues in an example drawn from real-world experience.

David Morrison (drmorr online) is the founder and a research scientist at Applied Computing Research Labs, a small business focusing on Kubernetes scheduling, optimization, and simulation. He previously worked as a software engineer on the platform teams at Airbnb and Yelp, helping to manage their Kubernetes infrastructure and cloud operations. He received his PhD in computer science (operations research focus) from the University of Illinois, Urbana-Champaign in 2014, under the supervision of Dr. Sheldon Jacobson.

Connect:

Mastodon

Bluesky

Track 2

Grand Ballroom III

Epistemology of Incidents and Problem Solving

Wednesday, 11:05 am–11:50 am

Jack Kingsman, Atlassian

Available Media

In high-pressure incident response, critical fundamentals of thought, action, and communication matter more than ever. Engineers need concrete criteria and examples of how to think and problem-solve during incidents, and answers to big questions: What fundamental decision-making loops should we orient ourselves in when disaster strikes? How do we reason about trapping the location and cause of unknown issues? What specific qualities make a hypothesis good or worthwhile, and how do we construct effective tests to prove or disprove them? How can we structure notes and progress updates to provide the most signal and the least noise in fast-paced situations?

Drawn from nearly a decade of experience in Site Reliability Engineering, from small startup to publicly traded SaaS firm, this talk will level up how you think and act when it matters, and equip you with the concepts to teach those skills to others.

Jack Kingsman is a Site Reliability Engineer at Atlassian, EMT, scuba diver, and serial hobbyist. He draws on his unique blend of experiences in infrastructure engineering, incident command, and emergency medical care and instruction to provide practical, battle-tested, people-oriented tools for operations and infrastructure. Jack believes that the most important part of any technical system is the people who build and tend to it, and that excellence in engineering begins with excellence in people, and providing them the tools and methods to rise to that excellence.

Human Factors in the Age of AI Ops: Re-Engineering Trust between Humans and Machines

Wednesday, 11:55 am–12:40 pm

Eddie Redick, CTC Ops

Available Media

When everything fails at once... cascading service degradation, overlapping automations, and an over-eager AI auto-remediator - you don’t rise to the level of your architecture; you fall to the level of your systems thinking.

As AI and automation become deeply woven into the fabric of reliability engineering, teams are learning that convergence isn’t just technical — it’s cultural, cognitive, and procedural. What happens when human intuition collides with machine logic in the middle of a P1?

I have witnessed countless times where precious outage minutes are wasted chasing false positives. AI is garbage-in/out, or, as I always say, "Only as smart as you feed it."

Your engineers can spend countless Scrum hours planning and building the best next-gen tool, only to have it fall flat. Non-structured data is tricky. If not architected effectively or starved, it will waste away its ROI.

Site Reliability and Incident Management leader known for his philosophy of “Commanding the Chaos”. It's his framework focused on psychological composure, systems thinking, and automation at scale. With over 15 years of experience managing large-scale distributed systems and high-severity outages, he has led reliability, observability, and response transformations across complex tech ecosystems.

Passionate about bridging the gap between human intuition and machine intelligence — helping teams build systems that are not only reliable but resilient under stress. Eddie blends deep technical experience with a human-first leadership approach, bringing clarity to chaos when it matters most.

Connect:

"Ask Me Anything" (AMA) Sessions

Elliott Bay Room

"Ask Me Anything" sessions are an opportunity for attendees to engage directly with SREcon26 Americas presenters.

AMA with Martin Smith

Wednesday, 11:05 am–11:50 am

Martin Smith, NVIDIA

This session brings back one of the speakers of the Operating Tens of Thousands of GPUs on Hyperscalers: Failure, Firmware, and the Illusion of Capacity talk as an AMA session host. Designed as an informal extension of the presentation, this is an open space to dive deeper into the technical nuances and follow-up questions that didn't fit.

Martin Smith is a Principal Architect for Site Reliability at NVIDIA, where he focuses on DGX Cloud and bridges the gap between engineering teams and the cloud service providers. With more than 20 years of experience in reliability and cloud infrastructure at companies like HashiCorp and Rackspace, he specializes in building scalable, resilient systems and infrastructure automation. Beyond his technical work, Martin is a dedicated mentor, speaker, and activist committed to making the world more awesome through engineering.

AMA with John Allspaw

Wednesday, 11:55 am–12:40 pm

John Allspaw, Adaptive Capacity Labs

John Allspaw didn’t just witness the birth of the DevOps and SRE movements - he helped write the script. He offers a rare blend of deep technical experience and academic rigor. This is a strictly off-the-record, no-AV session designed for high-signal Q&A. Ask him anything: about the history of the movement, the future of resilience engineering, and beyond.

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Discussion Track

Vashon Room

Management and Mentoring

Wednesday, 11:05 am–12:40 pm

Andrew Hatch, Cisco ThousandEyes, and Scott Nelson Windels, Notion

SRE isn't just about systems, it's also about the people running them. Join our knowledgeable facilitators for an interactive discussion about all things management and mentoring. Whether you're an IC looking to grow your influence or a manager navigating the world of people leadership in SRE, this session has a place for you.

Andrew Hatch is an engineering leader and SRE manager at Cisco ThousandEyes, with over 25 years in the technology industry across Australia, India, and the United States. He moved to the Bay Area in 2020 to join LinkedIn as an SRE Manager before taking up his current role at ThousandEyes. His work spans software engineering, consulting, operations, and building SRE and platform teams for large-scale systems. Andrew has previously spoken at SREcon on learning from complex systems and the realities of SRE management, and continues to explore how organisations can hire, lead, and learn more effectively in an AI-augmented world

Connect:

Bluesky

Scott is an Incident Program Manager at Notion. Previously, he spent nearly eight years at Slack, most recently as an Engineering Manager. He built out Slack's core incident response ecosystem, established a company-wide incident management program, and transformed incident culture by training over half of Slack's engineering organization. He brings hands-on experience building high-performing teams, scaling incident programs, and fostering the kind of cross-functional collaboration that makes organizations more resilient.

12:40 pm–1:55 pm

Luncheon

Grand Ballroom I & Cascade Ballroom

1:55 pm–3:30 pm

Track 1

Grand Ballroom II

From Chaos to Confidence: How SREs Can Leverage 50 (and Counting) Failure Scenarios to Test AI Readiness

Wednesday, 1:55 pm–2:40 pm

Rohan R. Arora and Bhavya, IBM Research

Available Media

Using sandboxed Kubernetes environments, we created 50+ production-inspired failure scenarios that put AI assistants to the test across the full SRE toolkit. The results? Current AI models resolve only 13.8% of scenarios—a reality check for anyone evaluating these tools.This session introduces our evaluation framework and shows how you can use it to benchmark AI assistants against real failure patterns, chaos-test your own applications with production-inspired scenarios, and assess whether AI-assisted approaches fit your operational needs.We're building a community-driven repository where SREs contribute real incidents and advance the field together. Come learn what AI can (and can't) do today—and help shape what it could do tomorrow.

Rohan R. Arora is a Senior Software Engineer at IBM Research. He joined IBM Research in 2016 after graduating from the University of Illinois at Urbana-Champaign with a Master's Degree in Electrical and Computer Engineering. In his early career at IBM, he co-led the effort on developing augmented and virtual reality-based solutions for the enterprise. Since 2021, he has been working at the intersection of machine learning (ML) and IT operations, particularly in theareas of incident management and resource optimization.

Connect:

Bhavya is a Research Scientist at IBM Research, where she works on LLM-based agentic systems for IT automation, with a focus on areas like incident management. She holds a Ph.D. from the University of Illinois, Urbana-Champaign (2024), where her research explored LLM-driven approaches to novel NLP tasks, particularly in educational contexts. Prior to her doctorate, she spent two years at Gartner as a Data Scientist building Recommender Systems and Text Mining solutions.

Connect:

When the Cure Is Worse than the Disease: Metastability in Recovery

Wednesday, 2:45 pm–3:30 pm

Todd Porter, Meta; Aleksey Charapko, University of New Hampshire

Available Media

Dealing with failures is an inevitable part of operating large distributed systems. Luckily, such systems are designed to handle failures and recover from their effects. In this talk, we explore the unfortunate cases in which recovery actions intended to address problems, unbeknownst to the operators, become the cause of even larger failures. This process occurs through natural recovery cascades in large systems, in which the recovery of one system or component triggers recovery in the next. We show that, via recovery cascades, systems may amplify the recovery cost at each step as the process crosses from one system to another. Moreover, these amplifications can propagate backwards into the systems that have already recovered, creating positive feedback loops that reintroduce and reinforce the failure.

We explain our findings, failure causes and contributing factors, and mitigation strategies using a global-scale message bus that experienced such problems as an example.

Note: David Maier from Portland State University made significant contributions to the contents of this talk.

Todd Porter is a Software Engineer at Meta working on stream processing and streaming ingestion systems. He focuses on studying emergent behavior of large-scale systems in order to make them safer to operate.

Connect:

Aleksey Charapko is an assistant professor at the University of New Hampshire. He broadly works at the intersection of performance, reliability, and efficiency of distributed systems. Aleksey has won the NSF CAREER award for his ongoing work on Metastable Failures. In addition to his academic career, Aleksey has substantial industrial and consulting experience.

Connect:

Track 2

Grand Ballroom III

The Unconspicuous Role of Conntrack in Kubernetes Networking

Wednesday, 1:55 pm–2:40 pm

Ricard Bejarano, Cisco

Available Media

One very common assumption of Kubernetes practicioners, even those in the network side of things, is that Kubernetes Services behave like good old load balancers. And to a certain extent, Services do behave in a similar fashion to what one would typically classify as a round-robin load balancer. However, the reality of it is much deeper than that. Both simpler and more complex at the same time.

In this talk we'll go over an incident we had on September 2025, where a mix of this misconception, Istio's behavior, CoreDNS' failure, kube-proxy's silence, and iptables and conntrack interoperability made it look like everything was OK, yet DNS—it's always DNS—was failing.

We will go deep into how brilliantly simple Kubernetes' networking is, how Istio's DNS works on top of Kubernetes' DNS, and how both broke each other.

Ricard is a Lead Site Reliability Engineer at Cisco ThousandEyes' SRE team. You can often find him investigating the weirdest incidents, such as the one that motivated this talk. Ricard is currently writing a book about homelabbing, so go talk to him if you have a homelab!

Connect:

It's Not Always the Network (But Here's How to Prove It): Kubernetes Packet Capture for SREs

Wednesday, 2:45 pm–3:30 pm

Mitsuhiro Shibuya, Mercari

Available Media

As reliability engineers, many of us would prefer to avoid confronting the network directly. Its behavior is inherently best-effort, lacking the deterministic nature we expect from our application layer. Yet, because we are responsible for the end-to-end reliability of our complex systems, the day inevitably comes when we are forced to face this uncertain domain. This session is for every SRE who decided to dive in.

Instead of just theory, this talk provides a practical playbook for capturing packets in Kubernetes. We'll tackle the key challenges and offer a strategic guide to choosing the right tool for the job—whether it's kubectl debug, ksniff, or modern eBPF-based solutions. You won't leave as a network expert, but as an SRE armed with the toolkit to prove with data. Join us to have the right to say, "See? It's not the network." (or finally prove that it is!)

Mitsuhiro Shibuya is a Tokyo-based Site Reliability Engineer with nearly a decade of experience dedicated to the art of keeping complex systems running. He began his career as a Backend Engineer with a lasting love for Ruby, and keeps that passion alive through modest contributions to open-source projects. This talk was born from his recent experience on a Platform Network team, where he aims to empower fellow engineers with the network troubleshooting skills he was once forced to learn. Cat enthusiast.

Connect:

"Ask Me Anything" (AMA) Sessions

Elliott Bay Room

"Ask Me Anything" sessions are an opportunity for attendees to engage directly with SREcon26 Americas presenters. Information about this AMA session will be available here soon.

AMA with Ruben Barroso

Wednesday, 1:55 pm–2:40 pm

Ruben Barroso, Google

This session brings back the speaker from the "The Case of the Misnamed Cities: CAST Analysis of a Google Maps Incident" talk as an "Ask Me Anything" Session host. Designed as an informal extension of the presentation, this is an open space to dive deeper into the technical nuances and follow-up questions that didn't fit.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Connect:

AMA with Andrew Hatch

Wednesday, 2:45 pm–3:30 pm

Andrew Hatch, Cisco ThousandEyes

This session brings back a speaker from the "Ghosts in the Interview Loop and Avoiding AI Taylorism" talk as an "Ask Me Anything" Session host. Designed as an informal extension of the presentation, this is an open space to dive deeper into the technical nuances and follow-up questions that didn't fit.

Andrew Hatch is an engineering leader and SRE manager at Cisco ThousandEyes, with over 25 years in the technology industry across Australia, India, and the United States. He moved to the Bay Area in 2020 to join LinkedIn as an SRE Manager before taking up his current role at ThousandEyes. His work spans software engineering, consulting, operations, and building SRE and platform teams for large-scale systems. Andrew has previously spoken at SREcon on learning from complex systems and the realities of SRE management, and continues to explore how organisations can hire, lead, and learn more effectively in an AI-augmented world

Connect:

Bluesky

Discussion Track

Vashon Room

Learning in SRE

Wednesday, 1:55 pm–3:30 pm

John Allspaw, Adaptive Capacity Labs, and Colette Alexander, Resilience in Software Foundation

When there is 99.95% availability with a service, there’s a tendency to focus almost exclusively on the 0.05% that is keeping us from the promised land: 100%. But have you ever wondered what makes the 99.95% happen? You know, the “non-incident” time?

The one thing that makes non-incidents happen is learning. People learn in different ways, at different times, and asynchronously. Come and talk with us about the most critical — and yet invisible — thing we do everyday: learning.

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Colette has been working as an engineering leader in the software industry for 10+ years. Her obsession with learning from incidents and Resilience Engineering began while managing teams at Spotify. It eventually led her to pursue her Masters in Science at Lund University in Human Factors and Systems Safety. She has led organizations in SRE and observability at HashiCorp and Cognite. She also maintains an active composition and recording career as a rock cellist, and lives with her rescue dog, 2 kids and husband in Ann Arbor, Michigan.

3:30 pm–4:00 pm

Coffee and Tea Break

Grand Foyer

4:00 pm–5:35 pm

Track 1

Grand Ballroom II

The Ironies of AI²

Wednesday, 4:00 pm–4:45 pm

J. Paul Reed, Chime

Available Media

We'll explore some of the "ironies" of automation—and now, artificial intelligence—in their interactions with software operators (i.e. you), especially during high consequence, high tempo situations (aka incidents).

We'll also look at considerations when building automation and integrating AI into your systems and workflows, including how we reason about them when they go awry and some food for thought on the role both AI and automation play in your next incident. Also? "Fun" incident stories!

J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful boutique consulting firm, he now spends his days as a Staff Incident Operations Manager at Chime, focusing on incident response, analysis, and systemic risk identification. He's worked with such organization as VMware, Mozilla, Symantec, and Netflix.

Resilient Observability at the Retail Edge: A Lightweight, Scalable, and Cost-Efficient Framework

Wednesday, 4:50 pm–5:35 pm

Prakash Velusamy, CVSHealth

Available Media

This technical presentation addresses one of the most critical challenges in retail infrastructure: implementing comprehensive observability across distributed edge computing environments while operating with limited resources and network bandwidth. The session presents a novel, production-validated framework that combines lightweight Kubernetes distributions, OpenTelemetry, and optimized logging to achieve enterprise-grade monitoring with dramatically reduced resource consumption and cost.

Prakash Velusamy is as a Principal Software Development Engineer at CVSHealth in Phoenix, USA. With over two decades of demonstrated excellence in the industry, Prakash is a technical leader in AI Operations (AIOps), Machine Learning, Observability, and Site Reliability Engineering (SRE), ensuring that the reliability, stability, and resiliency of services are maintained at their peak. He leads multiple key initiatives, including integrating AI into operations to improve productivity and service quality, as well as modernizing and migrating mission-critical flagship applications.

Track 2

Grand Ballroom III

Precision Over Proliferation: SRE Approach for Leaner, Smarter and Data-Driven Observability

Wednesday, 4:00 pm–4:45 pm

Md Shaghil, Rubriks

Available Media

Tame Your Metrics Monster: Cut Observability Costs by 40% Without Sacrificing Reliability
Observability costs eating 30% of your infrastructure budget? You're not alone. Learn battle-tested strategies that deliver 40%+ cost savings while maintaining full visibility.

Discover how we tackled exploding metrics cardinality using three game-changing approaches:
"Dollar per Query": A simple framework that measures each metric's cost vs. actual usage, helping you identify waste and optimize ruthlessly.

Batch Metrics: Process non-critical metrics offline and load them on-demand in Grafana, cutting costs by 16% while preserving historical visibility.

On-Demand Metrics: Enable metrics via API only when needed for debugging or testing, saving 4% immediately with massive potential as we scale to production.

Plus quarterly audits that root out stale metrics and save another 16%.

You'll leave with practical guides: approach for usage-tracking dashboards, designing cost frameworks, creating self-service tools for engineering teams, and rolling out these solutions step-by-step. We'll share real successes, pitfalls to avoid, and how to empower teams to optimize their own metrics. Perfect for: SREs and platform engineers drowning in observability costs who need actionable strategies to reduce spend without compromising reliability. Stop letting metrics costs spiral—learn to make every metric justify its existence.

Md Shaghil is a Site Reliability Engineer with over 5 years of experience, currently working at Rubrik India Pvt Limited in Bangalore. He holds a Master's degree in Computer Science from the Indian Institute of Technology Bhubaneswar (IIT BBS). At Rubrik, Shaghil is part of the observability team, where he specializes in metrics services and cost optimization strategies. His work focuses on building scalable, cost-effective solutions that balance reliability with infrastructure efficiency.

Connect:

Observability for LLMs: Understanding What’s Happening Under the Hood

Wednesday, 4:50 pm–5:35 pm

Salman Munaf, TikTok

Available Media

As LLMs and AI systems move into the core of modern products, keeping them reliable requires a new way of thinking. Monitoring large language models is fundamentally different from monitoring traditional web services. Latency and error rates alone no longer tell the full story.

This talk explores how observability changes when systems are driven by LLMs and GPU inference rather than REST APIs and CPU workloads. It breaks down the unique behaviors of these systems, including unpredictable model outputs, long context chains, token drift, embedding stores, and GPU bound execution.

Using real world examples, the session shows which signals actually matter, from token throughput and model latency to GPU utilization, memory pressure, and energy efficiency. Attendees will leave with a clear mental model for understanding LLM system health and a new perspective on reliability when your most critical component is a model that learns, drifts, and scales very differently from code.

Salman Munaf is a Lead Site Reliability Engineer at TikTok, where he builds and operates large-scale video infrastructure serving millions of users. He specializes in distributed systems, observability, and reliability at scale, with prior experience as a Software Engineer at Meta. Salman is passionate about helping developers embed reliability into their workflows from day one, making complex systems more resilient and easier to operate.

Connect:

"Ask Me Anything" (AMA) Sessions

Elliott Bay Room

"Ask Me Anything" sessions are an opportunity for attendees to engage directly with SREcon26 Americas presenters.

AMA with Laura de Vesine

Wednesday, 4:00 pm–4:45 pm

Laura de Vesine, Reddit

Laura de Vesine is a staff engineer and 25-year industry veteran who specializes in the intersection of technical systems and organizational culture. She thrives in ambiguous spaces - whether that involves untangling complex technical problems or helping teams align on a shared direction. With a background in incident analysis and chaos engineering, Laura is passionate about building resilience in a way that genuinely improves the lives of both users and engineers. This is an informal, off-the-record session with no slides or microphones - just a space for candid conversation about prevention, resilience, or the realities of building engineering cultures.

Laura de Vesine is a 25+ year software industry veteran. She has spent the last 10 years in SRE working in incident analysis and prevention, systems understanding, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her kittens nap on her diploma.

Discussion Track

Vashon Room

Wellbeing and Burnout

Wednesday, 4:00 pm–5:35 pm

Beth Adele Long, Adaptive Capacity Labs, and Sarah Butt, Salesforce

Is your on-call rotation sustainable, or is it just "fine for now"? What happens after the incident is resolved and the adrenaline wears off? How do we avoid praising heroics but ignoring the toll it takes on the humans behind the screens? Join facilitators Beth Long and Sarah Butt for an open and honest discussion about wellness, burnout, and caring for the humans in our systems.

Beth Adele Long is a Principal at Adaptive Capacity Labs and founding board member of the Resilience in Software Foundation. She has held engineering and product roles at New Relic, Jeli.io, and Gruntwork. She coaches individuals and organizations in incident response and analysis, adaptive delivery and sustainable practices for operational work.

Sarah Butt is a Principal Engineer within Salesforce's Reliability Engineering group, where she helps lead Salesforce's Centralized Incident Response organization. She is fascinated by scale, complexity, systems thinking, and non-functional requirements— particularly those around reliability. You'll likely find her talking about topics such as resilience engineering, observability, and incident management and response. Prior to working at Salesforce, Sarah worked in both Dell and SentinelOne's SRE organizations. In her free time, Sarah enjoys freelancing as an audio engineer, competing as an equestrian, and spending time with her husband and their three labradors.

Connect:

5:45 pm–7:00 pm

Lightning Talks

Grand Ballroom II

Lightning Talks are four-minute talks by different speakers addressing a variety of SRE-relevant topics.

The Convergence of Correctness and Chaos: A Certificate Rotation Story
Bala Subrahmanyam Kambala
Zero Trust vs SLOs: What Broke First in Production
Charit Upadhyay, Adobe
From Brake Lights to Bottlenecks: What Traffic Teaches Us About Reliability
Apparao Boddeti, Fidelity Investments
Failure Nodes of Agentic AI Assistants
Philip Rowlands, Jane Street
Unit Typing as Reliability Infrastructure
Emmanuel I. Obi, The Radiativity Company
Telemetry Debt: When More Signals Make Incidents Harder
Khushboo Nigam, Oracle

Available Media

Thursday, Mar 26

8:00 am–9:00 am

Continental Breakfast

Grand Foyer

9:00 am–10:35 am

Track 1

Grand Ballroom II

The Gashlycrumb Tinies of AI Networking You Must Know (or Languish!)

Thursday, 9:00 am–9:45 am

Lerna Ekmekcioglu, Clockwork.io

Available Media

As the worlds of AI workloads and SRE practices converge, more and more teams are expected to ramp up on a whole new vocabulary. Terms like "NCCL", "PXN", "MoE" and "queue pairs" are thrown around, yet these concepts can remain elusive, buried in dense academic papers or scattered vendor documentation.

Inspired by Edward Gorey's whimsical illustrated alphabets, this talk provides a structured, approachable tour through essential AI networking concepts. We'll demystify the specialized terminology and building blocks of AI networking, explaining each concept in clear language with practical context and a touch of darkly playful illustration to help the concepts stick.

Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect serving Global Financial Services customers at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies working on authentication systems, distributed caching, and multi region deployments using IaC and CI/CD to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.

Connect:

How We Built Protosockets to Go Beyond HTTP and gRPC

Thursday, 9:50 am–10:35 am

Pratik Agarwal, Figma

Available Media

Modern systems often default to HTTP-based RPC frameworks like gRPC—but at high message rates, those abstractions can become the bottleneck. At Momento, Pratik and team built Protosocket, a lightweight, message-oriented layer over raw TCP that avoids the overhead of HTTP, gRPC, and WebSockets. By rethinking synchronization and message flow, they achieved 3× throughput, reduced server count by 65% and client side latencies by 70%. This talk explores how stripping away layers of abstraction can unlock surprising efficiency gains in production systems.

Pratik is a distributed systems engineer at Figma, where he architects and scales infrastructure that powers millions of designers and developers worldwide. With deep expertise spanning the entire backend stack - from client SDKs to storage systems - he has consistently delivered transformational performance improvements and cost optimizations across his career. Previously, Pratik worked at Momento on caching and high performance, databases at DynamoDB, and event processing at AWS Marketplace.

Connect:

Track 2

Grand Ballroom III

5 Wrong Hypotheses about PostgreSQL Multi-Transaction Locks

Thursday, 9:00 am–9:45 am

Clint Byrum, HashiCorp, an IBM Company

Available Media

Working on complex systems and expecting to be right are not always compatible things. PostgreSQL is an incredibly complex component of complex systems, and how we use it is just as important as how it works internally.

We have spent several years operating a very busy AWS Aurora PostgreSQL database with varying levels of reliability, and we got pretty good at finding ways to break it with MultiXacts. This meant incidents, experiments, and learning to get a little less wrong with each iteration. Join us to learn about two very important things:

Several, decreasingly wrong ideas about how MultiXact locks work in PostgreSQL with data and analysis from real incidents.
How to persevere through the stress of incidents and keep learning in a complex system despite knowing that you're at least a little wrong, all the time.

Clint Byrum is a Staff SRE at IBM, working on the reliability and performance of HashiCorp Terraform, IBM's cloud offering for running Terraform. Clint has decades of experience in operations, open source, and software engineering, including working as a core developer on Ubuntu and OpenStack. More recently Clint has been a full-time reliability engineer, leading efforts to stabilize and scale systems inside GoDaddy, Spotify, and HashiCorp, with a particular focus on Resilience and Learning from Incidents. Clint co-hosts a podcast about Resilience in Software called "This is Fine!" with Colette Alexander.

Connect:

Bluesky

Designing Layered High‑Availability Architectures for PostgreSQL on a Budget

Thursday, 9:50 am–10:35 am

Umair Shahid, Stormatics Pte Ltd

Available Media

This presentation talks about high availability for PostgreSQL by treating it as a layered set of replication and failover patterns. Many teams deploy synchronous replication without understanding its trade‑offs or rely on single‑region clusters that can’t survive a data‑center outage. Worse, budget constraints often force compromises that jeopardise recovery objectives. The talk proposes a framework for composing HA layers - local synchronous replication, asynchronous standbys, hot standbys and cross‑region disaster‑recovery nodes - to meet specific RPO/RTO targets while controlling costs. It will cover practical experiences from multiple deployments where achieving four-nines availability was critical, yet enterprise clustering licences were not an option.

Umair Shahid is the founder of Stormatics, a PostgreSQL-focused professional services firm that designs, operates, and scales production database platforms for fintechs, SaaS companies, and large enterprises. He works on high availability architectures, performance optimisation, database security, and migrations from expensive proprietary databases to PostgreSQL, both on premises and in the cloud.

Umair is a veteran of the PostgreSQL community. Being a recognized subject matter expert, he is a regular speaker at conferences globally and publishes blogs that get 15,000+ visitors each month.

Connect:

Discussion Track

Vashon Room

System Rescue: Restoring Operational Excellence

Thursday, 9:00 am–10:35 am

Marianne Bellotti, Bloomberg Government, and Lorin Hochstein, Airbnb

Nobody knows what this code does. That service was never instrumented correctly. The runbook is stale and out of date. When the incident is over but the system is still broken, what happens next? The gap between the SRE ideal and the SRE reality can be wide, and once an incident is marked “resolved,” the work required to prevent the next one can feel overwhelming or even impossible. Join facilitators Marianne Bellotti and Lorin Hochstein for a discussion on how to tackle persistent, long-term operational problems and what it takes to rescue failing systems.

Marianne Bellotti is a software engineer who specializes in restoring legacy systems to operational excellence. She's the author of the bestselling book on the topic: Kill It With Fire. As part of the original United States Digital Service she responded to incidents at the State Department, the IRS, the Department of Defense, among others. Prior to federal service she worked for the United Nations on data sharing during ongoing humanitarian crises. She's currently focusing on the application of formal logic in triage and incident response, including developing the modeling language Fault.

Lorin Hochstein is a Staff Software Engineer, Reliability at Airbnb. He was previously Senior Staff Software Engineer at Coupang, Senior Software Engineer at Netflix, Senior Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.

Lorin has a B.Eng. in Computer Engineering from McGill University, an M.S. in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland. He is also a proud member of the Resilience in Software Foundation and the Resilience Engineering Association.

10:35 am–11:05 am

Coffee and Tea Break

Grand Foyer

11:05 am–12:40 pm

Track 1

Grand Ballroom II

Operationalizing Key Management for Regulatory Compliance and Emergency Response

Thursday, 11:05 am–11:50 am

Swetha Srinivasan, Google

Available Media

Modern key management is a core SRE challenge, demanding verifiable control, continuous compliance, and a ready pipeline for emergency key rotations. Most large-scale systems fail to meet the strict SLOs and non-repudiable audit trails required by regulators and incident response teams.

Drawing from experience in the world’s most stringent regulatory environments, this talk presents an SRE framework for building and operating "compliance-native" cryptographic systems. We deconstruct common failure modes in high-stakes operations like root key rotation, especially when driven by external triggers.

Attendees will learn generalizable architectural patterns, including public key distribution and monitoring for identity keys. Discover how infrastructure built for deliberate compliance becomes your fastest tool in a security incident. This is for SREs and security architects moving beyond simply using cryptographic tools to guaranteeing the reliability and compliance of critical key management workflows.

Swetha Srinivasan is a Staff software engineer at Google with over a decade of experience in computer software, specializing in Cloud Infrastructure Security and Cybersecurity, with expertise spanning cryptography, secure coding, and systems design. She holds a Master's degree in Electrical and Computer Engineering from the University of Wisconsin-Madison and a Stanford Advanced Computer Security Certificate. At Google, she led critical projects such as Key Rotation for Google Sovereign Cloud and ensuring reliability while enforcing Boot Integrity on Google Production machines. Prior roles included contributing to security features like TPM attestation and key management at VMWare, and optimizing OS capabilities at Intel.

How Security Incidents Are Different ... and How They're Exactly the Same

Thursday, 11:55 am–12:40 pm

Laura de Vesine, Reddit, and Alec Randazzo, Datadog

Available Media

In most ways, security incidents are extremely similar to the incidents that we regularly manage as SREs: they are emergencies that involve coordinating many responders, have specific expectations for stopping the bleeding, then resolving the issue, and teams are expected to engage in long-term follow ups for preventing recurrences. However, in some key ways they are meaningfully different. This is especially true around requirements to document and not document particular things, primarily because of legal concerns; it’s also often the case that investigation for a security incident requires a much more detailed lens than many reliability incidents. This talk will explore what experts on both sides of security & SRE/reliability incidents can learn from each other, and help SREs build a mental model of how and why the security space is different (and the same).

Laura de Vesine is a 25+ year software industry veteran. She has spent the last 10 years in SRE working in incident analysis and prevention, systems understanding, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her kittens nap on her diploma.

Alec Randazzo has been in the security detection and response space for 13+ years. He's split this time between incident response consulting, detection and response engineering, and internal detection and response. He's responded to dozens of high severity security incidents at organizations ranging from SMBs to Fortune 500s. Alec is passionate about real world practical detection and response. He is currently a staff security incident response engineer at Datadog.

Track 2

Grand Ballroom III

Intelligent Load Balancing in Kubernetes

Thursday, 11:05 am–11:50 am

Gaurav Nanda and Vincent Cheng, Databricks

Available Media

Kubernetes relies on kube-proxy and DNS for simple Layer 4 load balancing, which works for short-lived HTTP traffic but fails for persistent connections and high-throughput gRPC workloads. With thousands of requests multiplexed over a single TCP connection, clusters often see uneven load, pod hot-spotting, and rising tail latency.

This talk presents a client-side, control-plane-driven approach that removes kube-proxy and DNS from the data path. A lightweight control plane tracks Service and EndpointSlice updates, while client libraries receive live endpoint changes through xDS and make per-request routing decisions at Layer 7. We show how strategies like Power of Two Choices and zone-affinity routing improve load balance, stabilize tail latency, and reduce resource waste in production.

SREs and platform engineers will learn why default Kubernetes routing breaks down, how to design intelligent client-side load balancing, and what operational challenges emerge when deploying these systems at scale.

Gaurav Nanda leads the Application Traffic and Networking Platform Infrastructure group at Databricks, where he focuses on multi cloud connectivity, intelligent load balancing, and overload protection for large scale Data and AI systems. He brings more than fifteen years of experience and previously held engineering leadership roles at Google and Harness.

Connect:

Vincent is a software engineer on the Application Traffic team at Databricks. Much of his present day work involves ensuring that internal services can seamlessly reach out to each other and distribute network traffic efficiently. Prior to Databricks, he worked on large scale configuration distribution for Google Cloud.

Reliable OpenTelemetry at Scale: No Queue, No Problem

Thursday, 11:55 am–12:40 pm

Tommy Li and Vlad Seliverstov, ClickHouse

Available Media

SREs are expected to ingest and analyze massive streams of metrics, logs and traces across multi-tenant Kubernetes environments. Traditional approaches rely on central message queues and/or vendor pipelines. These struggle with cost, reliability and operational overhead at the GB/s scale. In this talk, we present a practical reference architecture for a queue-less OpenTelemetry pipeline built entirely on Kubernetes using the OpenTelemetry collector, operator and OpAMP. We explore how we run a large fleet of collectors, rollout config changes safely without disrupting ingestion and handle failure without data loss. Our system uses backpressure, autoscaling and object storage for overflow to support high throughput without needing Kafka or Pulsar. You’ll learn concrete pipeline configuration, schema design choices and practices we use that make it possible to support trillions of events per day reliably at reasonable cost.

Tommy Li is a Senior Software Engineer at ClickHouse, working on the massive scale observability platform supporting ClickHouse Cloud. Prior to ClickHouse Tommy built Postgres infrastructure at scale at companies like Brex and Datadog.

Vlad Seliverstov leads the internal observability team at ClickHouse, overseeing the monitoring of ClickHouse Cloud, which handles over 200 petabytes of data and over quadrillion events. With more than a decade of SRE experience at Datadog, Dropbox, and Facebook, Vlad now focuses on building and scaling ClickHouse’s in-house observability platform.

"Ask Me Anything" (AMA) Sessions

Elliott Bay Room

"Ask Me Anything" sessions are an opportunity for attendees to engage directly with SREcon26 Americas presenters.

AMA with Michelle Brush

Thursday, 11:05 am–11:50 am

Michelle Brush, Google, Inc.

This session brings back the speaker of the plenary session “Taming the Unpredictable: Reliability in Chaos” talk as an "Ask Me Anything" Session host. Designed as an informal extension of the presentation, this is an open space to dive deeper into the technical nuances and follow-up questions that didn't fit.

Michelle Brush has 25 years of software experience working across embedded software, distributed systems, enterprise software, and consumer devices. In her current role as an Engineering Director at Google, she leads the global teams of SREs that ensure Compute Engine and Persistent Disk are reliable. She is also the author of 2 of the 97 Things Every SRE Should Know.

Previously, Michelle worked for Cerner Corporation as the Engineering Director responsible for the data engineering platform for Cerner’s Population Health products. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm and later led the map technology department responsible for Garmin’s various spatial search, data compression, and shortest pathfinding algorithms. If you’ve ever gotten a route you didn’t like using a Garmin device, it was probably Michelle’s fault.

AMA with Alex Hidalgo

Thursday, 11:55 am–12:40 pm

Alex Hidalgo, Nobl9

Alex Hidalgo has spent years helping the industry navigate the practical and philosophical challenges of reliability. As the author of Implementing Service Level Objectives, he brings a thoughtful, human-centric perspective to how we measure and manage our systems. This is an informal, off-the-record session with no slides or microphones. It’s a space for candid conversation about the realities of error budgets, the nuances of SLOs, or any other reliability challenges you’re currently facing.

Alex Hidalgo is the Field CTO at Nobl9 and author of Implementing Service Level Objectives. During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching Premier League football. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Discussion Track

Vashon Room

AI in SRE

Thursday, 11:05 am–12:40 pm

Courtney Nash, The VOID, and Robbie Ostrow, OpenAI

From writing scripts to analyzing incidents, AI promises to change how SREs work. But how much of that is practical, and how much is vaporware? In this interactive discussion, engage with other conference participants to talk through what is actually working (real tools and use cases), what is failing (where it creates complexity and noise), and the messy reality of interacting with non-deterministic systems while supporting production environments.

Courtney Nash is the Co-founder and CEO of The VOID. Her research focuses on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she has held a variety of editorial, program management, research, and management roles at Verica, Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

Connect:

Robbie Ostrow works on infrastructure and reliability at OpenAI, where he spends far too much time trying to formally express the invariants he expects from his systems. Before OpenAI, Robbie led engineering teams at Q Bio and Vanta, where he preached the same ideas at much smaller scales. When he isn’t thinking about type systems or breaking changes, you can find him in San Francisco playing word games or hunting down merch from failed companies.

12:40 pm–1:55 pm

Luncheon

Grand Ballroom I & Cascade Ballroom

1:55 pm–3:30 pm

Track 1

Grand Ballroom II

Unlock High-Frequency Deployments without Blowing Up Prometheus

Thursday, 1:55 pm–2:15 pm

Ganesh Vernekar, Reddit

Available Media

High-frequency deployments often create a hidden bottleneck in Kubernetes: time-series churn. As pods come and go, Prometheus accumulates "stale" series in memory, leading to dangerous spikes and OOM crashes.

This talk introduces stale-series compaction, a feature that proactively flushes stale data from memory to disk and protects your Prometheus during high series churn. Beyond the design, I will share critical learnings from production experiments at Reddit, including what to expect in terms of resource usage and what this feature is not for. Attendees will leave with a clear playbook for enabling this feature to unblock high-frequency rollouts without destabilizing their monitoring infrastructure.

Ganesh is a Staff Engineer at Reddit working on observability infrastructure and has been contributing to Prometheus for 8 years. He is also a maintainer of the Prometheus TSDB and member of the Prometheus team. In his previous stint at Grafana Labs he has also worked on Mimir, Cortex, and Grafana.

Connect:

How We Debug 1000s of Databases with AI: Lessons from an AI-Assisted Database Debugging Platform

Thursday, 2:20 pm–2:40 pm

Annie Zhou and Sophie Zhang, Databricks

Available Media

Engineers can now vibe-code with AI, so why aren’t we vibe-debugging yet? At Databricks, we built an AI-assisted debugging system that helps engineers reason about thousands of databases across multiple clouds. It consolidates metrics, logs, and system signals, explains anomalies, and recommends safe next steps – reducing investigation time by up to 90%.

In this talk, we will discuss how to build AI agents that are reliable in production, how to design the platform foundation that makes them safe, and how to drive adoption through fast iteration and user-centered design. Attendees will leave with concrete patterns for reducing incident toil and scaling SRE workflows with AI.

Building SRE Culture (without SREs, Technically)

Thursday, 2:45 pm–3:05 pm

Chris Harding, The Pokémon Company International

Available Media

Are you having a hard time establishing a culture of quality at your company? This talk gives contrasting examples of companies without SREs and tips on how to build an SRE culture, mainly from a software release pipeline perspective.

Chris Harding is a Release Manager at The Pokémon Company International and for Turn 10 Studios in a similar role. Before that, he was a game designer for titles you are too young to remember. Ask him about cars, trading cards, or video games at your peril!

Reliability Engineering for Hybrid Robot-Cloud Systems

Thursday, 3:10 pm–3:30 pm

Jeff Corpuz and Rian Bogle, Agility Robotics

Available Media

Robots as a Service: we deploy autonomous humanoid robots to perform tasks in warehouses. Our robot management software, which handles designing and running task workflows on fleets of robots, runs in AWS. For this talk, we will describe our journey developing a hybrid robot-cloud system, including how we used SRE practices to improve the reliability and stability of the RaaS deployments.

Jeff Corpuz is a Senior Engineering Manager at Agility Robotics, where he leads the Compute Platform and Internal Services teams within the Infrastructure organization. With more than a decade of experience across startups and large enterprises, Jeff has driven the adoption of cloud-native patterns and thrived in creating a bit of organized chaos with Kubernetes. He joined the robotics industry for the fun of controlling robots from the cloud, and today focuses on building strong, empowered engineering teams that keep the infrastructure running smoothly, so he can spend more time sailing, snowboarding, or scuba diving in peace.

Rian Bogle is the Principal Architect for Cloud and Infrastructure at Agility Robotics, where he designs and builds scalable, cloud-native systems for next-generation robotics. With over 20 years of experience in software engineering, platform engineering, and high-performance computing, he has led high-performing teams driving digital transformation and cloud adoption. A product-focused engineering leader, Rian is passionate about helping organizations maximize the impact and value of their software engineering teams.

Track 2

Grand Ballroom III

Achieving Secure, Scalable Multi-Tenancy in Kubernetes with Kyverno and Advanced Namespace Management

Thursday, 1:55 pm–2:15 pm

Roshan Subudhi, Jaison Paul, and Jeff Spahr, Capital One

Available Media

As enterprises scale their cloud-native footprint, multi-tenancy in Kubernetes is essential but presents major challenges: the 'Noisy Neighbor' problem, security drift, and scaling operational governance. This session details how we successfully consolidated over 120 development teams onto shared clusters by implementing a layered, policy-driven architecture.

We detail our systemic approach leveraging UserID vs. AD Group namespaces and Kyverno policies to automatically enforce Resource Quotas and default-deny NetworkPolicies. This strategy delivered a 99.9% reduction in resource contention incidents, achieved 100% security compliance, and cut onboarding time from hours to under 5 minutes.

Roshan Subudhi is an Engineering Manager in the AI/ML org at Capital One. He has been working with Kubernetes over the past several years with flavors like AWS EKS, Azure AKS, GKE/Anthos, Rancher, Mirantis, RedHat OCP and KOTS. Based in Northern Virginia, you'll find him sneaking a ride on his HD Sportster or building his Lego collection when not babysitting (some might say parenting) his two kids.

Connect:

Jaison Paul is a Sr. Lead Engineer at Capital One, specializing as a Site Reliability Engineer (SRE) with a focus on Kubernetes. With 8 years of dedicated experience in the Kubernetes ecosystem, he is an expert in managing and scaling containerized environments. He excels at building resilient infrastructure and optimizing system reliability for large-scale enterprise operations.

Connect:

Jeff Spahr is a Director in the AI/ML organization at Capital One. He has a background in cloud computing and Kubernetes focused on scaling high performing infrastructure and platform teams. He is a long-time Kubernetes enthusiast, and his passion for infrastructure extends from global-scale cloud deployments to the Raspberry Pi cluster running in his home office. Jeff enjoys sharing his technical journey and lessons learned with the broader community at conferences like KubeCon and All Things Open. He lives in Raleigh, NC, with his wife and two children.

Connect:

Keeping a Hypervisor Fleet Up to Date with Minimal Customer Disruption

Thursday, 2:20 pm–2:40 pm

Atalay Kutlay, Akamai

Available Media

Maintaining a large hypervisor fleet at a cloud provider requires prioritizing critical infrastructure updates while limiting customer disruption. Software updates often require hypervisor restarts, triggering potentially disruptive migrations of guest VMs. Planning maintenance is complex for SREs, who must balance customer disruption, datacenter capacity, and timely completion of updates. Traditional batch-based rollout strategies often fail to account for workload distribution, leading to uneven impact and significant manual intervention.

This talk presents a production system that uses optimization-based scheduling to plan VM migrations and host updates together. The model decides when each action should occur, ensuring that only a limited number of hosts are updated at once and no customer experiences excessive simultaneous VM migrations, while completing fleet-wide updates efficiently. The resulting schedule is reviewed by SREs and executed automatically by the control plane, reducing manual planning efforts and increasing efficiency.

Atalay Kutlay is a Senior Software Engineer at Akamai. He likes to work on developing solutions to mathematical optimization problems, including virtual machine placement, scheduling, and capacity planning.

The Evolution of Postgres Performance

Thursday, 2:45 pm–3:05 pm

Ben Dicken, PlanetScale

Available Media

Postgres evolves quickly. Each major version brings new performance improvements. The best recent example is the new io_method feature introduced in version 18, allowing database engineers to choose between sync, worker, and io_uring to control the behavior of I/O handling. These and many others have a huge impact on performance, which in turn impacts reliability.

This talk will cover the performance enhancements that have come to Postgres over the past few major versions, including detailed benchmarks and recommendations to settings to use for different types of workloads.

Ben Dicken spends his days researching, benchmarking, and writing about all things databases and distributed systems. He's currently in developer education at PlanetScale, and was formerly both a computer science faculty member and a research software engineer at a small database company.

Connect:

X

BlueSky

Talking to/with Machines

Thursday, 3:10 pm–3:30 pm

Audrey Simonne

Available Media

Humans and machines hold different contexts about the world. The sudden rise of AI and the subsequent human-machine teams that have formed has created a need for engineers to learn a new language in order to talk with and to machines.

Audrey has held engineering and product roles at Ultimate Kronos Group, Digital Ocean, and Valon Mortgage. She focuses mainly on reliability, platform engineering, and healthy on call rotations.

"Ask Me Anything" (AMA) Sessions

Elliott Bay Room

"Ask Me Anything" sessions are an opportunity for attendees to engage directly with SREcon26 Americas presenters. Information about this AMA session will be available here soon.

AMA with Colette Alexander

Thursday, 1:55 pm–2:40 pm

Colette Alexander, Resilience in Software Foundation

This session brings back the speaker of the "Three Lies We Tell Ourselves about Disaster Recovery and What to Do about Them" talk as an "Ask Me Anything" Session host. Designed as an informal extension of the presentation, this is an open space to dive deeper into the technical nuances and follow-up questions that didn't fit.

Colette has been working as an engineering leader in the software industry for 10+ years. Her obsession with learning from incidents and Resilience Engineering began while managing teams at Spotify. It eventually led her to pursue her Masters in Science at Lund University in Human Factors and Systems Safety. She has led organizations in SRE and observability at HashiCorp and Cognite. She also maintains an active composition and recording career as a rock cellist, and lives with her rescue dog, 2 kids and husband in Ann Arbor, Michigan.

Discussion Track

Vashon Room

Infrastructure Management

Thursday, 1:55 pm–3:30 pm

Clint Byrum, HashiCorp, an IBM Company, and Chris Jones, Google

Whether you're wrestling with the eternal "buy vs. build" debate, fighting configuration drift, or figuring out how to scale without setting money on fire, there's a seat for you here. This session is a loose, open gathering with facilitators, but the agenda is ultimately up to you. Whether you want to talk about wrangling Kubernetes, the depths of networking, or anything in between, this session is designed to be an engaging, community conversation across multiple tables with space to cover the full spectrum of infrastructure challenges. Bring your topic or question of choice and pull up a chair!

Clint Byrum is a Staff SRE at IBM, working on the reliability and performance of HashiCorp Terraform, IBM's cloud offering for running Terraform. Clint has decades of experience in operations, open source, and software engineering, including working as a core developer on Ubuntu and OpenStack. More recently Clint has been a full-time reliability engineer, leading efforts to stabilize and scale systems inside GoDaddy, Spotify, and HashiCorp, with a particular focus on Resilience and Learning from Incidents. Clint co-hosts a podcast about Resilience in Software called "This is Fine!" with Colette Alexander.

Connect:

Bluesky

Chris Jones helps make Google Maps more reliable. Based in San Francisco, he's been an SRE for several of Google's systems, including App Engine, and was an editor of the Google SRE Book (2016). He's also been the technical lead for Google Cloud Platform's privacy engineering team and is a licensed professional engineer.

3:30 pm–4:00 pm

Coffee and Tea Break

Grand Foyer

4:00 pm–5:30 pm

Closing Plenary Session

Grand Ballroom II & III

Reliability Equilibrium: The Hidden Playbook behind SRE Influence

Thursday, 4:00 pm–4:45 pm

Daria Barteneva, Microsoft Azure

Available Media

Reliability and velocity often feel like opposing forces - but what if we treat them as strategic games? This talk reframes sociotechnical trade-offs through a game theory lens, using Nash equilibria, Stag Hunt, Public Goods, and Shapley value to model real-world SRE dynamics.

We’ll explore why, without shared decision models, teams default to fragile equilibria like “freeze all changes,” and how mechanism design - error budgets, canary deployments, and progressive rollouts - can shift incentives toward safer, higher-utility outcomes.

Grounded in SRE practice and backed by DevOps research, this session equips you to diagnose bad equilibria, design guardrails, and influence system-level behavior - not just symptoms. Learn how to apply cooperative and non-cooperative game theory to reliability engineering and craft strategies that scale across teams, products, and platforms.

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly.

The Power of Stories

Thursday, 4:45 pm–5:30 pm

Lorin Hochstein, Airbnb

Available Media

We humans love stories, so much so that we both tell them and listen to them for fun. As SREs, our community has a particular fondness for incident stories.

In this talk, I'll discuss how these kinds of incident stories make for a more effective tool for learning from incidents than bullet points or metric trends. We'll see how stories provide us with glimpses into the complexity of our system that we'd otherwise never see, and enable us to learn from the experiences of others. We'll explore what makes for an effective story, as well the dangers of stories that oversimplify the nature of complex system failure. And we'll look at how to foster an internal incident storytelling culture within an organization.

Lorin Hochstein is a Staff Software Engineer, Reliability at Airbnb. He was previously Senior Staff Software Engineer at Coupang, Senior Software Engineer at Netflix, Senior Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.

Lorin has a B.Eng. in Computer Engineering from McGill University, an M.S. in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland. He is also a proud member of the Resilience in Software Foundation and the Resilience Engineering Association.

5:30 pm–5:35 pm

Closing Remarks

Grand Ballroom II & III

Program Co-Chairs: Patrick Cable, DraftKings, and Laura Maguire, Trace Cognitive Engineering