SREcon25 Europe/Middle East/Africa Program

Track 2

Systems Engineering and Principles

The Liffey B

ACK to the Future: TCP in 2025

Tuesday, 11:00–11:45

Philip Rowlands, Jane Street

Available Media

Are you making the most of that fancy global network? If the answer is between "no" and "shrug", then this talk aims to shine a light on understanding how to drive TCP to higher performance.

It’s hard to get network traffic to behave, even when you’re not running a huge public-facing website.

This talk will cover TCP in 2025, features which should be on everyone’s radar, the bad smells to look for in performance terms, and effective use of diagnostic utilities.

What is a congestion control algorithm, and why are there >10 of them?
How big should a receive window be?
What do all these offload options do?

TCP "out of the box" can leave a lot to be desired. Let's discuss some of the key indicators of health, and which tunables are worth touching.

Philip Rowlands has been an SRE since before he really understood what it meant. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms, all of which used TCP.

Connect:

IPv6 in 2025: Quo Vadis?

Tuesday, 11:50–12:35

Alexandros Kosiaris, Wikimedia Foundation

Available Media

IPv6 was standardized in RFC 1883 in 1995, 30 years ago. It was envisioned as the next Internet Protocol, eventually replacing IPv4. In late 2024, worldwide adoption was around 40% and a linear projection, based on the rate of adoption during the last 10 years, ended up predicting that migration would hit 100% around 2045. It can be argued that this is an optimistic take.

Where does this leave an SRE team in 2025 and thereafter? We will be broaching on a number of aspects to try and answer this question.

A Linux enthusiast, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), turned SRE, Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia Foundation in a Principal SRE role, he has pushed forward for more virtualization, embracing the containerization and orchestration paradigm while putting out his share of fires.

Track 3

Liffey Hall 2

The Bitter and the Sweet of Running a Planet-Scale Build & CI Stack at Google

Tuesday, 11:00–11:45

Tomasz Koczorowski, Google Germany

Available Media

This talk offers a deep dive into managing Google's planet-scale Build & CI stack from an SRE perspective. We'll explore the complexities of supporting over 100,000 monthly users and massive workloads, maintaining a 98% cache hit rate across diverse computing environments. Discover resource management strategies for doubling year-over-year growth, the application of the Pareto principle, and critical caching layers (local, remote, P2P). We'll cover output storage managing hundreds of petabytes daily through deduplication and compression, and trace a user's build journey from desktop to production. Gain insights into optimizing the build stack and maintaining a 24/7 service with a small SRE team through strategic planning, monitoring, and collaboration. Finally, we'll discuss availability risks, standardization, and the pros and cons of a monolithic Build & CI stack.

Tomasz is an Engineering Manager at Google, leading BATS SRE team operating one of the largest Build & CI stacks on the planet. Used to work with Sun Starcat and IBM Regatta UNIX systems back in mid 2000’s. Worked in tech in multiple industries including telco, rolling stock, data and gaming before joining Google. Based in Munich, fan of airshows and krampus.

Honey, I Shrunk the Data Center!

Tuesday, 11:50–12:35

Przemysław Lalak, Allegro

Available Media

Ever wondered what happens if you actually pull the plug on an entire data center? For five years, we've done just that. This talk shares our journey through intense chaos experiments to keep our services running seamlessly, even with one of the data centers completely offline. We'll explore our evolution, the art of physical server shutdowns/startups, capacity testing for critical paths and building reliable dependency maps. Learn how we secured executive buy-in for these daring drills and the new challenges in a hybrid environment. Discover our interesting pitfalls and big wins, offering practical insights for SREs and technical leaders.

Discussion Track

Liffey Hall 1

Tea, Pipelines, and Retries: A Practical Guide to MLOps

Tuesday, 11:00–12:35

Maria Vechtomova, Marvelous MLOps, and Sylvain Kalache, Rootly

The LLM narrative focuses on how AI empowers developers to write code faster, but far less attention is given to its impact on SREs and platform engineers.

This session will explore how AI is reshaping the SDLC after the code is written: from CI/CD pipelines, deployments, scaling, and monitoring, to incident management, reliability tooling, and emerging disciplines like LLMOps. We’ll discuss what's genuinely new, what remains unchanged, and what might be more hype than substance.

Questions we could explore:

As developers write more code, faster, how does this impact CI/CD pipelines?
Can deployment velocity match the new development rhythm?
ML applied to monitoring isn’t new; are LLMs bringing new capabilities?
How will these shifts affect incident responders?

Maria Vechtomova is in the Data and AI for almost 12 years, focusing most of her career on MLOps. Maria believes that everyone should learn MLOps, as machine learning models only start living once in production. She is a co-founder of Marvelous MLOps, teaching courses on Maven: MLOPs and LLMOps with Databricks, and writing a book for O'Reilly.

Connect:

Sylvain leads AI Labs at Rootly, an initiative that aims to augment reliability engineering in the era of AI. Under his leadership, the lab has developed open-source prototypes, tools, and research collaborations, sponsored by organizations like Anthropic, Google DeepMind, and Google Cloud.

Connect:

12:35–13:50

Luncheon

The Forum

13:50–15:25

Track 1

InFocus: Data and AI Reliability

The Liffey A

Dashboards & Dragons: Reliability Magic for AI Platforms

Tuesday, 13:50–14:35

Alexa Griffith and Sal Furino, Bloomberg

Available Media

Scaling a generative AI platform is no fairy tale. Instead, it’s an epic battle through dungeons spanning training workloads, inference services, and infrastructure. GenAI has introduced a whole new level of complexity for infrastructure, bringing heavier resource demands, new requirements for the scaling of token-based patterns, and questions about how to monitor and manage it all.

Building GenAI systems is hard, but keeping it reliable is even harder.

In this talk, we’ll recount our journey taming the complexity of multi-cluster AI platforms using actionable SLOs as our compass. Whether you’re building your first AI platform or defending the reliability of a cluster, you’ll complete your quest equipped with practical, open source-friendly strategies to help make your systems observable, debuggable, and resilient.

Alexa Griffith is a Senior Software Engineer at Bloomberg. She works on building Bloomberg’s AI Inference Platform and the open source KServe & Envoy AI Gateway projects. She enjoys solving engineering challenges at scale, working in open source, and speaking about AI, as well as engaging with the community through her personal podcast, Alexa’s Input.

Connect:

Sal Furino is a Customer Reliability Engineer at Bloomberg. During his career he’s worked as a TPM, SRE, Developer, Sys Admin, and in IT support. When not working, he enjoys cooking, gaming, and traveling. Sal lives in Queens and has a bachelor’s degree in applied mathematics from Marist College.

Connect:

Resilience for AI Workloads at Scale: The Fast and the Finicky!

Tuesday, 14:40–15:25

Lerna Ekmekcioglu, Clockwork.io

Available Media

A Formula 1 car at high speeds maintains a firm grip on the track, yet a piece of debris can force it to slow down and even bring it to a halt. Similarly, AI workloads rely on the network fabrics, like RoCE and Infiniband, for not only high throughput but also reliable paths to communicate.

When the path is littered with "debris" like NIC and link flaps, the results are just as frustrating: performance degrades, jobs crash, and costly rollbacks erode ROI.

Join me as we peek under the hood into the key networking challenges that hold back AI workloads. Through demos, we’ll see how these problems impact performance and reliability of AI jobs. Just as a pit crew ensures the race car thrives despite the obstacles on the track, let’s dive into these networking challenges to ensure AI workloads power through to the finish line at peak performance just like a Formula 1 champion’s car!

Lerna is a Sr. Solutions Engineer at Clockwork Systems where she helps customers meet performance and reliability goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Sr. Solutions Architect at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies focused on problems of scale like centralized authentication systems, distributed caching, and multi region cloud native deployments to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.

Track 2

Systems Engineering and Principles

The Liffey B

Fifty Shades of Caching and How LLMs Paint It Blαck

Tuesday, 13:50–14:35

Effie Mouzeli, Wikimedia Foundation

Available Media

Caching is the quiet workhorse of the web. It makes things faster. And mostly not broken. From edge CDNs to per-process memory stores, caching spoils end-users by accelerating content delivery and helping websites remain resilient under heavy traffic.

Sounds great. Until you try to explain it to someone.

This session offers a practical deep dive into industry-standard caching technologies and explores how the ongoing ~~rise of the machines~~ surge in LLM traffic is steadily sabotaging them. Structured around chapters and real-world, open-source examples, this talk covers both the underlying technology and the current challenges.

Effie spent several years in small organisations. Currently an SRE at the Wikimedia Foundation, she is counting Wikipedia’s rabbit holes so you don’t have to. [citation needed] She’s co-chaired SREcon23 and SREcon24 EMEA, and has been a long-time contributor. Her limited written work include a thesis no one read, a defunct Twitter account and sneaking a couple of articles into 97 Things Every SRE Should Know.

Connect:

The Computer Wants to Lose Your Data

Tuesday, 14:40–15:25

Chris Sinjakli, PlanetScale

Available Media

Storing data is something we expect computers to just do. When your application writes data to a database, you trust it to give you that data back later, but what does it take to make that reliable?

In this session, we'll explore the ways that computers can surprise us by failing to save or corrupting our data. We'll do this through the lens of databases, with a focus on MySQL and Postgres.

Specifically, we'll cover:

The MySQL doublewrite buffer: the mechanism MySQL uses to guarantee writes make it safely to disk
The Postgres fsyncgate incident: where the Postgres team realised that the guarantees around Linux's fsync syscall weren't as strong as they thought
Write-through caches on disks: how manufacturers win benchmarks at the cost of data safety

We'll also look at how database replication can partially paper over these problems for us and the limits of what it can do.

Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

Track 3

Liffey Hall 2

Preventing Avalanche Failures in Large-Scale Distributed Systems

Tuesday, 13:50–14:35

Zhen Zhen, Baidu Online Network Technology, Beijing Co., Ltd.

In complex microservices architectures, a single frontend request can traverse dozens of service nodes. Under high concurrency, this creates tightly coupled scheduling chains where minor performance jitters—when combined with retries and queuing—can trigger avalanche failures. These occur when slow or failed services cause upstream retries and thread pool exhaustion, leading to blocked queues and cascading system-wide outages. Even after the initial fault is resolved, stuck requests may still clog the system, prolonging downtime.

This work presents a systematic approach to prevent and mitigate such failures. Key strategies include: fine-tuning timeout and retry settings, introducing full-link retry budgeting to reduce retry storms, and applying adaptive degradation and queue-shedding techniques during overloads. By breaking the positive feedback loop of retries and queuing, systems can recover faster and maintain availability—even under fault conditions.

Zhen Zhen is a Senior R&D Engineer at Baidu, responsible for the availability of Baidu's search system. His work focuses on stability engineering, infrastructure technologies, and data engineering.

Gaining Insights from a Black Box System

Tuesday, 14:40–15:25

Thiara Ortiz, Netflix

Available Media

In this talk, I'll discuss how we monitor service health as black boxes. SREs often face ambiguity, and I'll show how we use multiple measurement techniques to understand system behavior, aligning with the need for robust observability tools.. These strategies are crucial for system reliability and user experience. By proactively identifying and resolving issues, we ensure smoother playback experience and maintain user trust, even as the platform continues to evolve and gain maturity. The principles shared within this talk can be expanded to other applications such as AI reliability in data quality and model deployments.

Thiara is a Cloud Gaming SRE Manager at Netflix. Over the last five years, Thiara has been working on Open Connect, improving the resilience of the Netflix service for members around the world. Most recently, Thiara has been heavily involved with the introduction of Cloud Gaming on the Netflix platform.

Discussion Track

Liffey Hall 1

Oops to Ops: Establishing SRE Practices

Tuesday, 13:50–15:25

Stephane Dudzinski

Establishing SRE practices, gaining buy-in, and building momentum around reliability can be a significant hurdle for many organizations. This interactive discussion session, designed for both SRE newcomers and seasoned practitioners, will explore common challenges and effective strategies for implementing SRE principles based on our collective experience. We'll delve into practical approaches for demonstrating the value of SRE, fostering a culture of shared ownership, and overcoming resistance to change. Participants are encouraged to share their experiences, ask questions, and contribute to a collaborative exchange of ideas on how to successfully integrate SRE into their organizations and build a strong foundation for operational excellence.

15:25–15:55

Coffee and Tea Break

The Forum

15:55–17:30

Track 1

InFocus: Data and AI Reliability

The Liffey A

Utilization Is the Key to Efficiency: What It Takes to Run Inference on the Geographically Distributed Network

Tuesday, 15:55–16:40

Aleksei Semiglazov, Cloudflare

Available Media

Running AI inference on the geographically distributed network presents a unique set of challenges that traditional cloud-centric SRE practices often fail to address. Unlike centralized data centers, edge deployments involve geographically dispersed processing units, vast model catalogs, and a complex interplay of resource constraints and network variability. A critical, yet often elusive, objective in this domain is achieving high processing unit utilization, which directly impacts both the operational cost (CapEx efficiency) and service quality (latency trade-offs). Underutilized models are a significant financial burden, while over-utilization can degrade user experience.

Aleksei Semiglazov is Senior Systems Engineer at Cloudflare. With over 15 years of experience in different areas he is currently driving large-scale deployments of real-time AI inference infrastructure across 300+ global edge locations. With deep expertise in edge orchestration, observability, and model lifecycle management, Aleksei bridges the gap between infrastructure resilience and AI performance, brings practical insights from building and operating mission-critical edge systems that prioritize both latency and cost-efficiency.

Connect:

Experimenting with AI-Driven Systems

Tuesday, 16:45–17:30

Jay Lees and Javier Martin Montull, Meta

Available Media

In this talk, we'll delve into the world of continuous online A/B-test experiments, where hundreds of engineers are constantly iterating, tweaking, and tuning models that directly impact business outcomes. We'll discuss the challenges of instilling a reliability mindset into this rapidly changing environment, and share our experiences with implementing mechanisms for preventing, detecting, and quickly mitigating issues triggered by AI-experiments in large-scale systems.

Jay is a Production Engineer at Meta in the Monetization Infra & Ranking organization, where he focuses on Reliability problems across the Ads Delivery, Events & Reporting stacks. He spent the last 2 years working on Experiment Safety, making it easier and safer for engineers across the company to test and iterate on their ideas across both ML & non-ML applications.

Javier is a Production Engineering Manager at Meta, supporting teams in the Ads ML space. During his 7+ year career at Meta, Javier has been both an engineer and a manager and has always focused on the Ads infrastructure space. Prior to Meta, Javier spent 10 years working as a software engineer at the physics lab CERN.

Track 2

The Liffey B

Catch Me If You Can: Hunting Misconfigurations Before They Break Prod

Tuesday, 15:55–16:40

Jagadeesh Devaraj and Marcel Punselie, ING

Available Media

"It worked on my machine." Famous last words before prod goes up in smoke. This talk explores how we built a smart "drift radar system" that catches misconfigurations, rogue YAMLs, infra drift, sneaky app bugs before they hit production. Learn how we used control loops, runtime policy enforcement, and CI/CD sensors to stop reliability failures in their tracks. Expect real war stories, platform-agnostic tactics, and practical takeaways for developers, SREs, and platform engineers who want to make resilience effortless and scalable.

Jagadeesh Devaraj is a seasoned Cloud Architect with over 20 years of experience in IT, specialising in cloud-native platforms, infrastructure automation, and site reliability. Currently at ING, he focuses on building resilient systems using technologies like VMware, AWS, Kubernetes, and DevOps tooling. A long-time VMware vExpert and speaker at VMworld and vForum events, Jagadeesh is also an active blogger and community contributor. His work blends deep technical expertise with a passion for operational excellence and continuous learning.

Connect:

Marcel Punselie has over 25 years of experience in IT, and his career has always been driven by a passion for reliability. He began as a Cobol and Java programmer—even carrying a beeper (remember those?) while on standby. For the past 15 years, he has worked as a Solution Architect across various companies, consistently focusing on resilience and high availability. More recently, he has also taken on the role of managing several teams responsible for guardrails, including well-architected reviews, conformity bots, and static code analysis.

Connect:

What I Wish I Knew Before Choosing Spot Instances

Tuesday, 16:45–17:30

Lasse Canth Hels, Maersk

Available Media

How do you run a massive observability platform on nothing but spot instances? With extreme difficulty.

In this talk, I'll cover real-world stories from the front lines and offer honest insight into the pros and—mostly—cons of the setup. I will discuss how we minimize disruption from evictions as well as the band-aid solutions we've inevitably had to set up to keep our platform steady in a constant stream of chaos. I will also reveal the issue that we couldn't crack and which finally forced us to move partially away from spot instances.

Join to hear what it takes to realise the extraordinary cost savings promised by spot instances at scale, and the many ways in which we failed to make it work.

Lasse is a software engineer at Maersk. As a member of the telemetry team, he took part in building the Maersk Observability Platform, and now spends much of his time keeping it running. Outside of computing, his interests include speedrunning, powerlifting, etymology, and box office performance.

Connect:

Track 3

Liffey Hall 2

Making Reproducible Builds Faster with Docker Bake

Tuesday, 15:55–16:15

Liz Fong-Jones, Honeycomb

Available Media

SREs working on build and engineers working in DevOps are likely familiar with container orchestration and build systems using Docker, but may not necessarily be familiar with some of the latest developments in build tooling. Docker Bake is a powerful new toolchain embedded into the Docker buildkit daemon that allows for decomposing Docker images into reusable steps and parallel building of multiple Docker images from the same sources at once. This allows for faster builds, with better provenance. Learn how Honeycomb slashed our build times in half using Docker Bake.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Connect:

Speeding Up Terraform Caching with OverlayFS

Tuesday, 16:20–16:40

Ricard Bejarano, Cisco

Available Media

The Terraform plugin cache, unfortunately, does not support concurrent Terraform inits. This leaves us no other choices than disabling caching or serializing our inits, both of which would significantly slow down our Terraform pipelines, which are expected to make hundreds of plans in single-digit minutes.

In this talk we go over how we solved that with OverlayFS, a tool we rarely see outside of its niche, but so powerful that it helped us cut our plan times by 90%!

Ricard is a Lead Site Reliability Engineer at Cisco ThousandEyes' SRE team. His background is mostly networking, observability, incident management, infrastructure automation and hunting down the weirdest of bugs.

Ricard is responsible for a Terraform pipeline with 500+ developers, over 120k resources under management, and multiple thousands of plans a day.

Connect:

Service Messy: Pagerduty's Service Mesh Migration

Tuesday, 16:45–17:30

Liz Frost, Pagerduty

Available Media

Service Meshes are extremely trendy right now, and the benefits are many. Traffic management, flow control, multi-cluster load balancing, mTLS are all advertised benefits. But of course all these are benefits of having a service, and unless your company is brand new, nobody starts with one.

Pagerduty featured around 500 individual services needed to be migrated. What's worse, we had a very lean team to do it with. We'd use the "away team" model, where we made all the initial code changes required, then asked other teams nicely to review those changes.

It seemed like a nearly impossible task, but a year later we're looking at 95% completion and making good progress on the long tail. Along the way we've learned a lot about how to make a migration successful, what's good in a service mesh, and how to keep a team from burning out on an extremely monotonous task.

Liz Frost is a Senior SRE on Pagerduty's Core Infrastructure team. She's previously worked at Heroku, Heptio and Buzzfeed, and she's passionate about building platforms in any form they take. She's based in beautiful Vancouver, Canada.

Discussion Track

Liffey Hall 1

On Building Systems Where Normal Engineers Can Do Great Work

Tuesday, 15:55–17:30

Charity Majors, Honeycomb.io

As an industry, we are obsessed with being (or hiring) the best, the smartest, the most wildly productive individuals. But engineering teams own production software, not individuals. And placing too much focus on individuals has a way of letting leaders off the hook for crafting—and investing in—the sociotechnical systems that unlock higher productivity, which is a big part of what forges great software engineers over time.

What if the greatest engineering orgs in the world are the ones where normal engineers can show up every day and ship code, be productive, and move the needle materially forward on business priorities, each and every day.

Open Questions

Is this just an excuse for mediocrity, for not valuing excellence? Why or why not?
How should we define “excellence”? Is it the same everywhere?
How can managers distinguish between low performers and engineers with low metrics, who quietly make everyone around them better?
Can you change a culture or improve productivity by hiring in different people? How does a culture change, how do productivity standards go up?

Charity Majors is the co-founder and CTO of honeycomb.io. She pioneered the concept of modern Observability, drawing on her years of experience building and managing massive distributed systems at Parse (acquired by Facebook), Facebook, and Linden Lab building Second Life. She is the co-author of Observability Engineering and Database Reliability Engineering (O'Reilly). She loves free speech, free software and single malt scotch.

17:30–19:00

Conference Reception at the Sponsor Showcase

The Forum

Enjoy hors d'oeuvres and beverages while networking with other attendees and visiting the exhibits as we close out the first day of sessions!

Wednesday, 8 October

08:00–09:00

Morning Coffee and Tea

The Forum

09:00–10:30

Opening Plenary Session

InFocus: Platform Engineering

The Liffey

HyperRouter: Lessons Learnt from Building an L4 Load Balancing Service

Wednesday, 09:00–09:45

Linhua Tang and Jayaganesh Kalyanasundaram, Huawei Ireland Research Center

Available Media

This talk shares hard-won lessons from designing and operating a large-scale Level 4 load balancing service built for high performance, resilience, and reliability. We’ll cover critical design decisions—including choosing DPDK over eBPF/XDP for the data plane, using BGP path prepending for safer node degradation, adopting local health checks, and building a decentralized peer-to-peer control plane to survive network partitions. Beyond architecture, we’ll explore how focusing observability on Critical User Journeys (CUJs) enhanced monitoring and incident response. Intended for engineers, SREs, and architects, this session offers practical insights into building robust, scalable infrastructure with real-world trade-offs and operational strategies that can be applied across distributed systems.

Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.

Connect:

Jayaganesh Kalyanasundaram is a principal software engineer for the observability space in Huawei Ireland Research Center. Before that he worked at Google as a tech lead for the CI/CD team.

Connect:

SRE for AI and AI for SRE

Wednesday, 09:45–10:30

Todd Underwood, Anthropic

Running frontier AI systems creates significant reliability challenges while simultaneously offering new tools to address them. This talk explores both sides of this equation. We'll examine the unique SRE challenges of large scale AI/ML systems - how training creates distributed systems nightmares, how accelerator scheduling defies traditional patterns, and the unique challenges of LLM serving. But we'll also explore the flip side: how these models are becoming part of the reliability toolkit itself.

This talk covers the general idea of what works in model-assisted technical operations, covering the functions that work well now and those where human expertise remains critical. It relies on real-world experience and reports across a variety of environments. Whether you're running ML workloads or considering LLMs as operational tools, you'll leave with concrete strategies for navigating the new reality where AI is both the challenge and part of the solution.

Todd Underwood leads reliability at Anthropic, a company trying to create AI systems that are safe, reliable, and beneficial to society. Prior to that he led reliability for the Research Platform at Open AI briefly. Before that he was a Senior Engineering Director at Google leading ML capacity engineering at Alphabet. He also founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services. He was also the Site Lead for Google’s Pittsburgh office. Along with several colleagues, he published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).

10:30–11:00

Coffee and Tea Break

The Forum

11:00–12:35

Track 1

InFocus: Platform Engineering

The Liffey A

Leader Election: Pitfalls and Alternatives

Wednesday, 11:00–11:45

Andrew Medworth, Google UK

Available Media

Leader election is a popular design pattern for distributed systems managing critical state. But despite its simple and innocent appearance, hidden dangers lurk. A reliable leader-elected service requires more than just a proven consensus implementation and a superficial strategy for handling the lower availability that comes with strong consistency.

This talk will present the theory and practice of fencing, illustrated by a serious outage of a Google service which we thought was doing everything right. It will also discuss some challenges with the operation of leader elected services, and some alternatives to leader election.

Andrew Medworth is a Staff SRE at Google. He manages the London half of Traffic Interconnect SRE, a team responsible for hybrid connectivity and NAT for Google Cloud Networking. He is currently Tech Lead for Torpedo, which is the system responsible for transport-layer egress from Borg, Google's cluster manager.

Connect:

Lessons from 25 Years in Infra: What’s Changed, What Hasn’t, and What’s Next

Wednesday, 11:50–12:35

Michal Gryko, Tempest

Available Media

Infrastructure has evolved from hand-managed servers to YAML pipelines and internal developer platforms. But even with new tools, challenges around speed, scale, and safety still persist.

This talk explores 25 years of change, from sysadmins and DevOps to cloud sprawl and the rise of platform engineering. We’ll look at how internal platforms aimed to simplify infrastructure, where they fell short, and what lessons teams can carry forward.

We’ll also explore what's next. AI agents are beginning to let developers request infrastructure in natural language. Platform and SRE teams continue to play a critical role by defining safe systems and enforcing guardrails.

You’ll walk away with:

A clear view of how infrastructure has evolved
Insights into the challenges of building and scaling internal platforms
An introduction to agentic platform engineering
Practical ideas on where human oversight remains essential

Since childhood, Michal has been fascinated by electronics, often disassembling devices to understand how they worked. This early curiosity led to a degree in Biomedical Engineering, providing a strong foundation in electronics, signal processing, and low-level programming. Over the past 20 years, Michal has worked as what today would be called a systems engineer or infrastructure specialist, building and maintaining complex systems across a wide range of environments.

Connect:

Track 2

Full-Stack Observability

The Liffey B

Level Up Your Edge: Do You Know Where Your Traffic Comes From?

Wednesday, 11:00–11:45

Radha Kumari, Slack

Available Media

Many companies struggle with understanding their traffic sources, leading to incidents caused by malicious activity, faulty automation, or legitimate but overwhelming use. This is often due to authentication occurring too deep within the network stack, making vital information unavailable at the ingress.

This talk details how Slack addressed this by implementing an Authentication Service at the L7 edge load balancers resulting in increased visibility into traffic patterns, more fine-grained control, and fewer incidents, leading to happier SREs.

Radha is a Staff Software Engineer for the Demand Engineering team at Slack (Ireland) where she focuses on ensuring "bytes" move in and out of Slack as expected.

Outside work, she loves travelling around the world and has been to nearly 40 countries since 2013. She enjoys playing keyboard, knitting and also has a passion for collecting shoes.

Connect:

Auto-Instrumentation for GPU Performance using eBPF

Wednesday, 11:50–12:10

Nikola Grcevski, Grafana Labs

Available Media

This talk explores the potential of leveraging eBPF to capture CUDA calls made to GPUs, including kernel launches and memory allocations. Data from these probes can be used to export Prometheus metrics, facilitating a detailed analysis on kernel launch patterns and associated memory usage. This approach offers significant benefits as eBPF imposes minimal overhead and requires no intrusive instrumentation. By leveraging eBPF, the instrumentation can be enabled (or disabled) while the GPU application is running, for example AI/ML training monitoring/profiling can be enabled after the training has started.

Nikola Grcevski has worked as a software engineer for more than 20 years, mostly in the field of compilers, managed runtimes and performance optimization. Most recently he's working on low level application instrumentation with eBPF at Grafana Labs and he's a maintaner of the OpenTelemetry eBPF Instrumentation project.

Track 3

InFocus: Data and AI Reliability

Liffey Hall 2

Introduction to MLOps

Wednesday, 11:00–12:35

Maria Vechtomova, Marvelous MLOps

MLOps isn’t just about getting models into production. It’s about making them reliable, traceable, and easy to maintain. In this workshop, we’ll explore the core principles of MLOps and put them into practice, focusing on traceability and reproducibility in real-world ML systems.

While the principles are tool-agnostic, we’ll demonstrate them using MLflow and Databricks: tracking models, deploying pipelines, and showing how these practices help you move fast without breaking production.

Maria Vechtomova is in the Data and AI for almost 12 years, focusing most of her career on MLOps. Maria believes that everyone should learn MLOps, as machine learning models only start living once in production. She is a co-founder of Marvelous MLOps, teaching courses on Maven: MLOPs and LLMOps with Databricks, and writing a book for O'Reilly.

Connect:

Discussion Track

Liffey Hall 1

Keep Calm and Handle the Incident

Wednesday, 11:00–12:35

Chris Sinjakli, PlanetScale, and Laura de Vesine, Datadog Inc

As long as software keeps breaking, incident management will be a core skill for SREs, but it’s one that’s always evolving.

Has the new wave of incident management platforms changed everything, or are they only as good as the practitioners using them? What’s different about how we handle incidents in the face of constrained budgets and layoffs? How does the industry desire to “just sprinkle some AI on it” interact with social norms built up over years?

The session will run as an unconference-style discussion. We could discuss the topics above, or take it in a completely different direction! We’ll spend the first part of the session writing up topics we’re interested in, group them together, and then open up the floor for discussions.

Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

Laura de Vesine is a 25+ year software industry veteran. She has spent the last 9 years in SRE working in incident analysis and prevention, systems understanding, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her kittens nap on her diploma.

Connect:

12:35–13:50

Luncheon

The Forum

13:50–15:25

Track 1

InFocus: Platform Engineering

The Liffey A

The Chicken or the Egg: Creating a Stable Preprod Environment as a Public Cloud Provider

Wednesday, 13:50–14:35

Christine Pukropski, Akamai

Available Media

As a public cloud company, we needed an internal reliable environment to safely test changes before shipping to Production. Supporting our infrastructure requires validating an increasingly widening array of components and services. This required our SRE team to build and maintain a full-featured replica of Production from scratch. This talk tells the story of how we brought "Testcloud" online. We’ll discuss a technical overview of the challenges we faced building our "Testcloud" including how we tackled the unique operational challenges that come with building isolated ecosystems. These include how we evaluated and adapted existing industry examples, challenges in achieving parity with production, partitioning production from lower environments, and managing circular dependencies. The target audience is anyone who finds themselves having to test software code or physical components while managing customer expectations.

Christine Pukropski is a Senior SRE at Akamai (formerly Linode) for 8 years. She loves tackling scaling problems and creating repeatable and reliable platforms which the Compute team can build a solid foundation on. She lives in Philadelphia with her partner, two vocal cats, and parrot who just turned 27!

Connect:

ARM Migration Made Practical

Wednesday, 14:40–15:25

Nati Cohen, AWS

Available Media

ARM processors are shaking up the cloud by delivering faster performance, lower costs, and greener computing. But with all these benefits, why do so many teams still hesitate to make the leap? This session makes ARM migration practical: we’ll clarify the architecture, identify easy-to-migrate workloads, and share proven steps for evaluation and transition. Learn why real-world testing matters, discover essential tools, and build multi-arch containers without increasing CI time. Whether you’re starting a new project or updating legacy apps, get actionable insights for a smooth, successful migration.

Nati is a Solutions Architect with AWS. He delights in helping customers simplify complex systems, teaching them about the inner workings of cloud services and debugging annoying technical oddities. When he is not at his computer he is soldering electronic kits, tinkering with smaller computers and drumming on a Taiko.

Connect:

Track 2

AI and Automation in SRE

The Liffey B

From Vibes to Outages: Riding the AI Code Wave

Wednesday, 13:50–14:35

Sylvain Kalache, Rootly

Available Media

AI-assisted coding is exploding. Cursor is the fastest-growing SaaS company, and over a third of the Fortune 500 are using Copilot. But this acceleration doesn't translate into reliability. Quite the opposite.

This talk dives into the quirks of LLM-assisted development, with real-world examples: hard-to-trace bugs, AI-generated tests that mirror flawed logic, and hallucinated dependencies or slopsquatting that open up security gaps.

We'll examine the operational fallout: skyrocketing code churn, higher incident rates from shipping more changes faster, and large batch deployments that make debugging harder.

As developers become less familiar with their own code, leaner SRE teams are left to deal with the consequences.

We'll also explore the cultural shift: blaming outages on "the AI," eroding accountability. This talk offers actionable strategies for SREs, from adopting AI-powered incident tools to developing "incident vibing".

Sylvain leads AI Labs at Rootly, an initiative that aims to augment reliability engineering in the era of AI. Under his leadership, the lab has developed open-source prototypes, tools, and research collaborations, sponsored by organizations like Anthropic, Google DeepMind, and Google Cloud.

Connect:

How We Used Statistics to Find Toil among 36,000 Changes

Wednesday, 14:40–15:00

Dylan Ratcliffe, Overmind

Available Media

How much precious engineering time are you spending on changes that don’t really need it? In this talk, we’ll reveal how we analysed nearly 37,000 infrastructure modifications to rigorously quantify "toil"; the work that slows teams down without adding real value. We’ll share the statistical techniques and practical models we developed to find the best opportunities for safe auto-approval, and show how you can use similar methods to identify (and eliminate) wasted effort in your own workflows. We’ll also discuss pitfalls, trade-offs, and how AI can take your approval process even further. Whether you’re a platform engineer or an SRE, you’ll learn how to free up your team to focus on what really matters while keeping your systems safe and reliable.

Dylan Ratcliffe is the Founder & CEO of Overmind, where he leads efforts to harness AI for preventing outages and giving platform teams real confidence in deploying infrastructure changes. Before founding Overmind, Dylan spent nearly seven years at Puppet, holding senior engineering and leadership roles across Australia and the UK, including Senior Manager of Professional Services for EMEA. At Overmind, he has raised funding from some of Silicon Valley’s most successful venture capitalists, fueling his mission to revolutionise change management and deployment safety. Based between London and San Francisco, Dylan is also passionate about motorcycle racing, various electronics projects, and travel.

Connect:

Track 3

Culture and SRE Maturity

Liffey Hall 2

Training New Incident Commanders: Pokemon Style!

Wednesday, 13:50–14:10

Laura de Vesine, Datadog Inc

Available Media

Training engineers on the skills to be effective incident commanders (communication, coordination, delegation, etc.) can be a challenge—we tend to get bogged down in whether a scenario is "realistic" enough—and then derailed by debugging, instead of focusing on incident command itself. What if we solved for that by removing engineering from practice scenarios entirely? This talk presents a real, in-use training and practical exercise for incident command skills doing just that, by using a made-up, lighthearted scenario (with Pokemon!).

Laura de Vesine is a 25+ year software industry veteran. She has spent the last 9 years in SRE working in incident analysis and prevention, systems understanding, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her kittens nap on her diploma.

Connect:

Run, Walk, Crawl, or How We Failed Our Way to SLO Readiness

Wednesday, 14:15–14:35

Rob Durst, Spring Health

Available Media

Scaling a site-reliability culture from the ground up at a hyper-growth, resource-constrained startup is a uniquely challenging endeavor: plenty of playbooks to scale SRE teams exist, yet every startup’s socio-technical reality is its own puzzle. And while "reliability is the most important feature" rings true, tight deadlines and shifting priorities often sideline proactive reliability initiatives. Thus, since these cycles are a precious commodity, ensuring their success is paramount.

By retracing our SLO adoption journey, highlighting failures, missteps, and near wins en route to our eventual breakthrough, we uncover an effective litmus test for gauging readiness (or recognizing when a team isn’t quite there yet).

Today this readiness framework guides how we assess the timing of reliability investments at Spring Health. It also serves as a practical tool for teams in fast-growing engineering orgs still early in their reliability journey, especially those navigating similar constraints.

Rob is a Site Reliability Engineer at Spring Health where he leads the engineering organization’s SLO rollout and all things perf lab. He transitioned into his current role from the software engineering side, bringing with him experience across a range of domains from edge networking to blockchain systems. On the side he is also a programming language enthusiast who has spent entirely too much time tinkering with a declarative DSL for defining service expectations.

Rob lives at the foot of the beautiful Wasatch Front, where he and his wife spend their time outside of work cheering on the Mammoth and paddleboarding with their watermelon-obsessed Yorkie.

Connect:

Taming the Cost of Telemetry: How Riot Games Reined In Observability Costs

Wednesday, 14:40–15:25

Maxfield Stewart

Available Media

Every growing company hits the same wall: skyrocketing costs from metrics and logs. Riot Games was no exception: ingesting over a petabyte a month, we had to rethink everything.

In this talk, we’ll share how we tackled our ballooning observability bill. Not just with tools, but by reshaping engineering culture to make cost a shared responsibility. From clever process tweaks to accountability frameworks, we’ll show you what actually moved the needle.

Whether you're an IC drowning in dashboards or a leader staring down a budget cliff, this session will give you the insights (and battle scars) to take control of your telemetry spend.

Maxfield Stewart has been shipping software and supporting production environments for over 25 years. Having worked in private consulting for fortune 500 companies like Goldman Sachs and Sprint, to over a decade and a half in the game industry. For the last 12 years Max has been helping Riot transition to continuous delivery, micro-services and changing culture around production availability, RCA's, post-mortems and observability.

Discussion Track

Liffey Hall 1

Responsible Use of ML in SRE Work: Between a (Thinking) Rock and a Hard Place

Wednesday, 13:50–15:25

Laura Nolan and Niall Murphy, Stanza Systems

This session explores the moral responsibilities of SREs in an AI-driven world. As AI becomes part of technology work, including SRE, new questions of professional responsibility emerge.

What can we reasonably delegate to AI tools? What do we need to keep doing ourselves and why? How do we manage more advanced tools, particularly those which may 'hallucinate' or otherwise be unpredictable? What opportunities exist to reduce the human cost of software operations work without creating unacceptable production risks?

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Relevant to this session, she also holds an MA in Ethics from Dublin City University. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Niall Richard Murphy has worked in computing infrastructure since the mid-1990s, and has been employed by every major cloud provider (specifically Amazon Google, and Microsoft) from their Dublin, Ireland offices in a variety of roles from IC to Director. He is currently CEO/Co-founder of Stanza Systems, a small startup in the ML/AI/reliability space. He is the instigator, co-author, and editor of multiple award-winning books on networking, reliability, and machine learning, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

15:25–15:55

Coffee and Tea Break

The Forum

15:55–17:30

Track 1

InFocus: Platform Engineering

The Liffey A

Is My Thing Reliable? Adventures in Driving Consistency of Reliability across the Organisation

Wednesday, 15:55–16:40

Dominic Hutton, HashiCorp

Available Media

In a statement unsurprising to just about everybody, achieving reliability isn’t exactly straightforward. What would you do if your mission is to enable the teams in your organisation to progress towards it? You’re outnumbered and competing for service owners' finite attention against alternate incentives.

This is an honest reflection on techniques we relied on to enable teams to understand and make steps towards improving their reliability posture. We’ll look at some of the major activities we undertook along our journey as the sociotechnical system grew around us. Scaling our SRE-in-the-loop operational readiness assessments whilst trying to mitigate common reliability issues with platform primitives had us stretched thin. We invested in building out a reliability data platform for continuous attestation to help us scale our efforts. We'll also discuss our service review programme that involved collaborating with teams to surface architectural and operational readiness issues that they were experiencing.

Dominic is an engineer who has been doing engineer-y things long enough to understand the issue with appeal to accomplishment or tenure. He’s spent time within early stage startups trying to do it all through to large organisations with teams just for enabling groups of other teams.

He’s been privileged enough to work within all sorts of domains in software and reliability adjacent functions. He likes to point to the time he worked on large software systems and ground control hardware for satellite constellations but would feel he’s leading people astray if he neglected to mention his various stints serving fast food and coffee.

Tales from the Platform Adoption Mines

Wednesday, 16:45–17:30

Dave O'Connor, Himself

Available Media

We've built a beautiful platform. It's got everything people need - it will improve utilisation, make reliability easy, and generally reduce headaches for all involved. It has wonderful documentation, and we've worked out the more obvious pitfalls and solved those problems for everyone once and for all.

So, why aren't people adopting it? We've done the hard part, so why is there so much resistance?

Turns out, no matter how beautiful a platform we build, our approach to adoption can be the difference between enthusiasm and resistance when it comes to people actually using it. The methods we imagine for driving adoption don't seem to work, and the buy-in we had seems to have evaporated.

What's next?
This talk will run through a set of pitfalls, thought processes, practices and war stories from the platform adoption mines across several companies of various sizes, and give practical advice on approaches and methods to try, if the front door seems to be jammed.

Dave O’Connor has been an SRE (or SRE-adjacent) since 2004, and has built and operated teams, services and organisations at Google, Elastic.co, Twilio, and as a freelancer.

His principal areas of interest are working with busy teams and organisations, care and maintenance of the business end of Reliability as a practice, and building high-functioning organisations who don't hate their lives. He is based in Dublin, Ireland.

Connect:

Track 2

AI and Automation in SRE

The Liffey B

From 4 Hours to 8 Minutes with AI Agents That Transform SRE Incident Response

Wednesday, 15:55–16:40

Peter Jausovec, Solo.io

Available Media

Tired of spending hours troubleshooting certificate rotation failures, load balancer misconfigurations, and database connection issues? This session introduces AI Reliability Engineering (AIRE) - a framework of specialized AI agents designed to automate incident response and reduce SRE toil.

Learn how to build three core agents that form a modern reliability engineering backbone: a Terraform agent that generates configurations following AWS practices, a GitOps agent that manages complete PR workflows from creation to deployment, and an infrastructure validation agent that verifies post-deployment resources.

The talk covers implementation details any platform engineering team can adopt, including agent instruction design, MCP server integration, and testing strategies. You'll see a demo of how these agents work together to potentially save you hours of manual work.

Peter Jausovec is an engineer at Solo.io with over 17 years of experience spanning software development, QA, and engineering leadership. A recognized expert in cloud-native technologies, he specializes in Kubernetes, Istio, and AI infrastructure. Peter is a maintainer of the kagent project and has been at the forefront of AI gateway and agents development at Solo.

His recent work bridges traditional cloud-native architectures with emerging AI workloads, helping organizations navigate the intersection of service mesh, API management, and artificial intelligence.

Connect:

Modernizing Incident Response with LLMs, RAG, and the MCP

Wednesday, 16:45–17:30

Theofilos Papapanagiotou, Amazon

Available Media

In traditional SRE workflows, incident response often relies on fragmented tools and tribal knowledge. This talk shares how a large-scale SRE team transitioned to secure, LLM-powered workflows using the Model Context Protocol (MCP), a novel pattern for injecting real-time, local system data (logs, configs, tickets) into LLM prompts without violating security controls. We’ll cover how we paired MCP with a domain-specific retrieval system built on OpenSearch and Bedrock Titan embeddings, enabling semantic search over incidents, dashboards, and playbooks. Learn how we tackle adoption, safeguard sensitive data, and measurably reduce resolution times.

Theofilos Papapanagiotou is a Senior Applied Scientist at Amazon, specializing in serving large language models (LLMs) at scale with a focus on performance, reliability, and cost-efficiency. He brings deep expertise in ML infrastructure, Kubernetes, and GPU optimization to help organizations deploy custom GPT-based models in secure, production environments. His work supports large-scale Site Reliability Engineering (SRE) operations, where he leads the integration of LLM-powered workflows that combine real-time logs, metrics, and incident data using the Model Context Protocol (MCP). Theofilos’ contributions bridge the gap between AI innovation and resilient system operations.

Track 3

Culture and SRE Maturity

Liffey Hall 2

The Un-Incident: Extracting Value from the Gray Area of Incident Response

Wednesday, 15:55–16:40

Andreas Deuschl, Dynatrace

Available Media

Not every learning moment comes from a declared incident. Across many years in SRE and incident response, I’ve seen that a significant portion — often between 30% and 60% — of potential incidents never make it into formal tracking. While this reduces the immediate load on response teams, it also means we miss critical opportunities to learn.

This talk explores the "Un-Incident" — those ambiguous, gray-zone events that don’t meet formal criteria but still reveal gaps in systems, processes, or assumptions. You’ll learn how to recognize these moments, reduce friction in triage and reviews, and shift from debating classification to extracting insight.

With real-world examples and practical strategies I’ve learned over the years, this session offers a fresh lens on incident response — one that values learning over labels.

With over 25 years in tech leadership across Operations, SRE, and Security, Andi Deuschl has encountered (and occasionally caused) his fair share of incidents. As a Product Lead for Delivery, Reliability & Security at Dynatrace, he focuses on sustainable reliability and security practices, incident prevention and response, and continuous learning from both Incidents and Un-Incidents.

Connect:

Learning Inside the Box: The Scope of Incident Reviews

Wednesday, 16:45–17:05

Rachel Silber

Available Media

Effective incident reviews are those that have the right scope - the right people in the room, the right systems under consideration, a view of events that may start before the incident was declared. What can we take away about the challenges of getting the most out of the limited time we have to prepare and discuss our learnings from incidents?

Rachel Silber provides technical leadership for people, processes, and technology that govern complex systems. Standing on the pillars of attainable goals and sustainable processes, I focus on safety engineering to help teams make their work better.

Connect:

Embedding Platform SREs: A Hybrid Model for Driving Adoption

Wednesday, 17:10–17:30

Jorge Lainfiesta, Rootly

Available Media

Driving platform adoption is hard, especially when feature teams are laser-focused on delivery metrics and SRE teams are overstretched. In this talk, we introduce a hybrid model that blends platform engineering and SRE to tackle both challenges: the Platform SRE.

Platform SREs focus on building self-service reliability features directly into the platform, so new services are production-ready from day one.

But for legacy or business-critical systems, they take a different approach: embedding temporarily into feature teams to lead focused reliability initiatives like observability migrations, alerting, or SLO alignment.

This time-boxed, tactical engagement accelerates platform adoption while reducing the operational burden on central SRE teams.

Attendees will hear real-world insights from implementing this model inside a mid-size engineering org, along with lessons on team design, cultural prerequisites, and common pitfalls to avoid.

Ideal for platform engineers, SREs, and engineering leaders driving operational excellence at scale.

Jorge is a Reliability Advocate at Rootly and the author of the Linux Foundation Introduction to Backstage (LFS142) course. He has a background in software engineering (ex-PayPal) and digital communication (UCLA). He's also a certified sommelier (CETT Barcelona).

Connect:

Discussion Track

Liffey Hall 1

Your Observability Is Expensive (and So Are Your Feelings)

Wednesday, 15:55–17:30

Alex Hidalgo, Nobl9

The last decade has seen an explosion of new technologies in the observability and monitoring space. Today we have more options than ever before in terms of producing, analyzing, and storing telemetry about how our services are operating and performing. However, this has also led to an ever-increasing drain on our wallets. More telemetry, more tools, more vendors… it all ends up costing us more money; however, there are plenty of ways to address these concerns. Join this session to discuss ways to bring things under control. This will be a led discussion, but everyone should be ready to share their own experiences, ideas, and proposed solutions.

Alex Hidalgo is the Field CTO at Nobl9 and author of Implementing Service Level Objectives. During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching Premier League football. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

17:40–18:40

Lightning Talks

The Liffey A

Lightning Talks are four-minute talks by different speakers addressing a variety of SRE-relevant topics. The lightning talks session will conclude with Slide Karaoke, a chance for any attendee to show off their improv skills by presenting a slide deck that they have never seen before.

"Thundering Herd Testing: A CI-friendly Micro Loadtest Technique" by Laura Nolan
"Parquet: The Pillar of Modern Telemetry Data Analysis" by Nitish Tiwari
"AI is like Chocolate" by Liz Fong-Jones
"The Executive Signal: How to Cut Through the Noise and Drive Change" by Eric D. Schabell, Chronosphere
"Mama Look, I Compiled SLOs!" by Rob Durst
"A Day in the Life of an SRE... With Some Help from AI" by Marcel Birkner, Dash0 Inc
"Computers Were a Mistake, We Should Burn it All Down and Live in Yurts" by Laura de Vesine, Datadog

Thursday, 9 October

08:00–09:00

Morning Coffee and Tea

The Forum

09:00–10:35

Track 1

InFocus: Reliability in Finance

The Liffey A

Lessons from an Asset Manager’s First Embedded SRE

Thursday, 09:00–09:45

Callum Donald, BlackRock

Available Media

Our decentralised operational model effectively supported our system for years, but we recognised it might not scale with increasing complexities and reliability expectations. To explore a solution, we initiated a deliberate trial embedding an SRE within a key trading systems engineering team. Working directly alongside the engineers enabled us to drive operational accountability, align closely with organisational OKRs, and build the trust necessary for meaningful reliability improvements from within.

We navigated unique constraints; traditional tools like error budgets and gradual SLO-based alerting were incompatible with finances zero tolerance for delays. Instead, we reshaped alerting around practical telemetry and revitalised incident retrospectives, enhancing their effectiveness and drawing out actionable insights. Over 12 months, this approach rebuilt trust, significantly reduced incidents and required no additional headcount.

This session shares practical insights for embedding SRE in your organisation, demonstrating how to adapt standard practices, deliver immediate value, and foster a sustainable reliability culture.

Callum is a Senior Site Reliability Engineer at BlackRock. He was the first embedded SRE within the company's Aladdin platform, where he helped pioneer an approach that has since been scaled across dozens of product and platform teams. His work focuses on building reliability practices that are practical, sustainable and tailored to the realities of financial systems. Callum is a first-time speaker at SREcon.

Practical DORA (Digital Operational Resilience Act) for SREs

Thursday, 09:50–10:35

Laura Woo, DORA.report

Available Media

Practical DORA for SREs will dive into the Digital Operational Resilience Act, the EU regulation effective from January 2025. This session is tailored to SREs working in the finance sector and those working in companies that are critical third-parties to the finance sector.

We'll demystify DORA by covering:

An overview of the DORA regulation
The role of SREs in helping organisations navigate DORA
Key focus areas of DORA relevant to SREs (Incident Management and Operational Resilience)

You will leave with a basic understanding of DORA and practical implementation advice to ensure your incident management processes are DORA compliant.

Laura has a background in Platform Engineering and has worked across various industries from FinTech to Renewable Energy. Building on her experience working on operational resilience and incident management and a desire to reduce the complexities of regulatory reporting, she founded DORA.report - a platform that simplifies reporting for the Digital Operational Resilience Act.

Connect:

Track 2

The Liffey B

CPU Utilization: The Hidden Cost of Running Hot

Thursday, 09:00–09:45

Andreas Strikos, GitHub

Available Media

As we are in an AI era where demand for computation increases and traffic/load increases rapidly, performance and efficiency are more important that ever. This talk presents a systematic exploration of how CPU utilization impacts system performance, grounded in a controlled experiment. We designed and executed a series of tests on a production-grade workload to directly measure CPU utilization on latency and throughput. The talk culminates in the derivation and validation of a practical formula ("the golden formula") that connects CPU utilization, request latency, and system capacity. Attendees will leave with concrete, actionable insights for capacity planning and CPU-bound workload optimization, based on real data and reproducible methods.

This presentation will be based on previous public blog post https://github.blog/engineering/architecture-optimization/breaking-down-cpu-speed-how-utilization-impacts-performance/

Andreas Strikos is a seasoned software and platform engineer with almost 20 years of diverse experience. His professional journey has taken him from the heart of non-profit organizations to the dynamic landscapes of large corporations, where he has seamlessly navigated through various engineering roles. Currently he is part of the Performance Engineering team at GitHub.

Lessons from a Year with Backstage: What Worked, What Didn’t, and What’s Next

Thursday, 09:50–10:35

Nick McKenzie, Mauritius Commercial Bank

Available Media

What happens when you adopt Backstage without planning to? In this talk, we’ll share what we learned from rolling out Backstage across 40 teams and 350+ developers in a highly regulated environment. What started as a simple effort to improve service discoverability quickly grew into a broader platform initiative — one that now supports engineering health checks, skills tracking, and cross-functional alignment across Architecture, QA, and SRE.

We’ll share the good, the hard, and the unexpected. You’ll walk away with practical, hard-earned advice on adoption, driving engagement, avoiding common traps, and keeping things simple.

Whether you're just starting your Backstage journey or trying to scale it, this talk will give you the real story — and the tools to make it work in your context.

Nick McKenzie is a technologist with 20 years’ experience in designing and implementing software solutions. Over the course of his career he has worked as a developer, architect, analyst and development manager delivering systems in myriad industries and platforms.

He is passionate about working with teams to deliver great software and building the right culture to let that happen. He has been fortunate to have worked with and learned from some of the best thinkers and practitioners in Agile & DevOps both in South Africa and Internationally. His view is that approaches such as Clean Code, Test Driven Development and Continuous Delivery are fundamental to delivering great software.

Connect:

Track 3

Liffey Hall 2

Quality Gates in Production: How We Turn OpenTelemetry Signals into Deployment Decisions

Thursday, 09:00–09:20

Marcel Birkner, Dash0

Available Media

Engineering teams constantly balance deployment velocity with system reliability. This challenge is particularly relevant for critical infrastructure like monitoring platforms that need to be more reliable than the systems they observe.

This talk demonstrates how to build automated quality gates that make deployment decisions based on production monitoring data. I'll show a practical implementation using open-source tools including GitHub Actions, ArgoCD, TestContainers, Playwright, and OpenTelemetry that validates deployments before they reach users.

You'll see the actual pipeline in action, including:

How we correlate deployment events with error rates, latency, and business metrics using OpenTelemetry traces, logs, and metrics
Quality gate criteria that catch regressions across data pipelines and application services
Open-source tooling integration that teams can adapt to their environments

This approach is particularly useful for data-heavy and AI workloads where traditional health checks provide limited insight. Using MLFlow for experiment tracking and model management, we've also implemented quality gates that validate model accuracy, detect drift, and verify inference performance before promoting AI services to production.

Key takeaways include practical quality gate patterns you can implement with existing tools, the specific metrics that indicate deployment success, and lessons learned from operating this system in production. Whether you're deploying traditional applications or AI workloads, you'll gain concrete strategies to improve deployment confidence while maintaining development velocity.

Marcel Birkner is a Founding Engineer and Head of Platform at Dash0, where he architects and operates the infrastructure that powers Dash0. With extensive experience as a Site Reliability Engineer at ClickHouse and Instana (IBM), Marcel specializes in building secure and scalable cloud infrastructure.

Maximizing Utilization for LLM Accelerators

Thursday, 09:25–09:45

John Lunney, Google

Available Media

Accelerators for serving LLMs are a very scarce resource, both globally and inside your organization. You must show you're making good use of the resources, otherwise someone else will.

John Lunney is a Senior Staff Reliability Engineer at Google. He is the technical lead for Workspace AI SRE, running a platform for LLM-powered features. He holds a degree in Computational Linguistics from Trinity College in Dublin, Ireland. Before Google, he worked on several lexicography projects for the Irish language. [email protected]

Finding the Needle in a Haystack: Tenant-Level Network Impact Analysis in Global-Scale Infrastructure

Thursday, 09:50–10:35

Giscard Fernandes Faria

Available Media

A single switch failure in a cloud provider's network can impact millions of users. So why is it still so difficult to trace the network path of a single customer? While large-scale network monitoring is a solved problem, observing individual tenant flows is not. This talk dives into the complexities of the cloud backbone and introduces a powerful tool for end-to-end flow tracking. You'll learn how this approach helps us rapidly troubleshoot issues and resolve failures that are nearly invisible to traditional, large-scale monitoring systems.

With over 20 years of experience as a Software Engineer, Giscard has a diverse background in building and scaling critical systems. His accomplishments include leading a major cloud migration for one of Latin America's largest ERP providers, implementing VoIP protocols for soft-switches, and maintaining and evolving the primary network monitoring system for AWS. In his current role at Huawei, he focuses on enhancing infrastructure observability and reliability for the Cloud team.

Connect:

Discussion Track

Liffey Hall 1

SRE 101

Thursday, 09:00–10:35

Kurt Andersen, Clari, and Laura Nolan

Are you confused by the alphabet soup: PRRs, ICS, MTTR, o11y, SLOs, SLIs? Is SRE the same thing as DevOps? Will doing everything the SRE books say lead to success and a good night's sleep oncall?

By day, Kurt works as an infrastructure software architect at Clari. In addition, he serves on the USENIX Board and has had the pleasure to work with amazing people around the globe in the SREcon conferences. He also helps with the annual SRE survey and report that is graciously supported by Catchpoint.

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Relevant to this session, she also holds an MA in Ethics from Dublin City University. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

10:35–11:00

Coffee and Tea Break

The Forum

11:00–12:35

Track 1

InFocus: Reliability in Finance

The Liffey A

Failover Without Fear: A Practical Guide to Active-Active Multi-Region Readiness

Thursday, 11:00–11:45

Sina Moghaddas, Mollie B.V.

Available Media

Mollie undertook a major infrastructure initiative to expand from a single-region setup to a multi-region architecture on Google Cloud, aiming to enhance disaster recovery and platform resilience. With rapid growth, high transaction volumes, and a commitment to uptime, the shift was driven by the need for better fault tolerance and business continuity. The project involved significant technical preparation, including updating application code for multi-region awareness, upgrading edge load balancers, re-architecting database failover strategies, and refining disaster recovery runbooks. Execution was carefully planned, following incident response best practices and rigorous testing across environments. The initiative represented a critical step in Mollie’s evolution toward a more resilient, scalable cloud infrastructure—one capable of maintaining performance and availability even in the face of regional disruptions. This case offers valuable insights into the complexities of cloud expansion and the importance of preparation, iteration, and operational discipline when building for resilience at scale.

Sina is an infrastructure engineer with over 15 years of experience, spanning from early work setting up Linux servers—some of which are still in operation—to designing cloud architectures and leading technical projects for fintech companies. In recent years, his focus has been on building resilient systems and ensuring high service reliability at scale. With a strong background in both traditional and modern infrastructure, Sina brings a practical, experience-driven perspective to cloud engineering and platform reliability.

Defending the Global Financial Backbone: Euroclear's Reliability Journey

Thursday, 11:50–12:35

Haytham Elkhoja, Euroclear / Kyndryl, and Erik Van Nooten, Euroclear

Available Media

In the heart of global financial markets, where transactions worth billions flow daily, reliability isn't just a best practice—it's a regulatory mandate and global economic necessity.

As a Central Securities Depository (CSD) handling over €40 trillion in assets, Euroclear faces unique and escalating challenges in reliability engineering. As one of the backbones of the world’s economy and its critical role in the global financial ecosystem, any operational disruption could have cascading effects across international markets, potentially impacting thousands of financial institutions and millions of transactions daily.

Euroclear’s reliability strategy is evolving to a model that must simultaneously address scenarios ranging from natural disasters, technical outages and applications failures to coordinated state-sponsored cyber-attacks while ensuring data integrity preservation and rapid recovery capabilities.

This talk will show how Euroclear is building an ‘Always On’ strategy to achieve application-level resilience and regulatory compliance whilst managing a major multi-data center transformation.

We’ll outline how our MVFMI (Minimum Viable Financial Market Infrastructure) model secures top-priority business services, how our Application Reliability Factory addresses reliability, resilience and distributed consistency at the architectural level to support self-healing operations and minimize all kinds of outages for both legacy and modern applications, and how regulatory compliance is seamlessly integrated through core architectural principles that guide our decision-making.

This talk will target reliability architects, engineers, and technology leaders in critical or heavily regulated industries seeking to future-proof critical platforms and showcases Euroclear's approach to ensuring mission-critical financial applications maintain the utmost levels of availability while navigating a complex data center migration program.

Principal Architect and Director in Kyndryl’s Office of the CTO, Haytham leads the Always On Practice, focusing on transforming and modernizing mission-critical applications. He also heads the Euroclear Application Reliability Factory, where he architects and drives the enhancement of application-level reliability, ensuring technical integration and viability.

Resilience Program Transformation Leader at Euroclear. Erik leads Euroclear’s flagship resilience program, overseeing its strategic design, planning, and overall governance. He works closely with regulators to ensure the program’s compliance and executability.

Track 2

The Liffey B

The SRE’s Crystal Ball: Predicting System Performance with Queues and USL

Thursday, 11:00–11:45

Aravindh Sampathkumar, Booking.com

Available Media

Ever grapple with systems exhibiting perplexing slowdowns or hitting unseen capacity ceilings under load? This talk empowers SREs to move beyond reactive troubleshooting with a practical, analytical framework for understanding and predicting system performance.

We'll dive into the fundamentals of Queueing Theory, learning how concepts like arrival rates, service times, and the impactful "hockey stick" utilisation curve can help you precisely diagnose delays and comprehend system behaviour under stress. Then, harness the Universal Scalability Law (USL) to quantitatively predict scalability limits, uncovering critical contention and coherency bottlenecks.

Through a guided example using a sample service, you'll see these powerful theoretical models applied in practice, demonstrating how to collect relevant metrics, interpret performance insights, and drive informed architectural and capacity planning decisions. Move beyond common analysis pitfalls and equip yourself with the toolkit to shift from reactive firefighting to proactive, data-driven performance engineering.

Currently a Site Reliability Engineer at Booking.com, my 17-year career has spanned diverse environments, from operating Mainframes, building HPC clusters, optimising enterprise storage for peak throughput and latency, and of course skilfully engineering yaml files for Kubernetes. This deep understanding of how systems behave under load, from the bare metal to Kubernetes, underpins my passion for applying analytical models to predict and proactively manage performance.

Profiling Your Code: 5 Tips to Significantly Boost Performance

Thursday, 11:50–12:10

Sonam Gupta and Kunal Nawale, SigLens

Available Media

Optimizing performance isn’t about guessing — it’s about measuring. Yet, many teams skip systematic profiling and jump straight into rewriting code or scaling infrastructure. The result? Wasted time, minimal gains, and recurring issues. In this talk, we’ll break down five high-impact tips to make profiling a practical, results-driven part of your performance workflow. You’ll learn:

How to choose the right profiler for your language and environment.
Where performance bottlenecks really hide (hint: it’s not always your code).
What to look for in flame graphs, CPU/memory profiles, and I/O traces.
How to set up lightweight, continuous profiling without slowing down production.
Ways to prioritize performance fixes that actually improve user experience.

Whether you’re debugging a slow backend or shaving milliseconds off a critical path, these techniques will help you go from vague complaints to concrete wins — backed by data, not guesswork.

Sonam Gupta is a Founding Software Engineer at SigLens, where she focuses on building high-performance observability solutions that are both fast and cost-effective. With a passion for distributed systems and developer experience, she is dedicated to making observability accessible to all.

Connect:

Kunal Nawale is a Founder/CEO of SigScalr. Previously he worked as Observability Architect at Salesforce responsible for providing monitoring solutions to 10,000+ salesforce engineers. Kunal has spent 20+ years in software development. He has been mentoring young engineers for the past 12+ years. Kunal did his Bachelors from University of Pune, and M.S. from UMass Lowell, USA.

Connect:

Performance Consistency in the Cloud

Thursday, 12:15–12:35

Nati Cohen and Sercan Karaoğlu, AWS

Available Media

Cloud environments offer incredible flexibility, but sometimes workloads experience unexpected performance fluctuations. While many applications can handle these variations, tasks such as high-frequency trading, online gaming, and performance regression testing require consistent performance. This session explores the root causes of performance variation in the cloud and provides actionable strategies to minimize unpredictability in processing, networking, and storage. We’ll also discuss the trade-offs and costs involved, helping you design and optimize cloud-based systems with confidence.

Nati is a Solutions Architect with AWS. He delights in helping customers simplify complex systems, teaching them about the inner workings of cloud services and debugging annoying technical oddities. When he is not at his computer he is soldering electronic kits, tinkering with smaller computers and drumming on a Taiko.

Connect:

Sercan Karaoğlu is a Principal Solutions Architect at Amazon Web Services, specializing in designing and implementing cutting-edge solutions for High Frequency Trading Companies. With extensive experience in data-intensive solutions, big data, low latency high throughput distributed systems, and machine learning, Sercan brings unique insights to performance optimization in cloud environments.

Connect:

Track 3

Liffey Hall 2

STPA for Software Systems–Illuminate the Unknown Unknowns

Thursday, 11:00–12:35

Theo Klein, Garrett Holthaus, and Ruben Barroso, Google

Available Media

SREs know about some of the flaws and vulnerabilities in their systems. They might also have intuition on where to look for additional issues–"known unknowns." But what about the "unknown unknowns"–outages waiting to happen that nobody is even looking for? With the vast complexity of modern software systems, this dark space of unknowns can be huge. And, what's worse, most of the outages in this space happen due to complex interactions between various parts of the system, even when everything is working according to specification, i.e. no implementation bugs.

What if we had a way to shine a light into the unknown unknowns? What if we could understand our systems enough to be able to methodically explore these complex interactions and build a comprehensive list of possible outage scenarios? In STPA, we model systems based on control-feedback loops, creating a hierarchical control structure, or HCS. In this session, we'll use a real Google system to show how an HCS can help you gain a new perspective and understanding of your system. We'll note similarities to common patterns in software design so you can start thinking about similar vulnerabilities in your own systems.

This session will be an interactive workshop. Attendees should plan to actively participate in the small group exercises in order to get the most benefit from the session.

Theo Klein is a Staff Site Reliability Engineer working on Google Maps. Over the past two years, he has lead an effort to improve the safety and reliability of road disruptions data on Google Maps. Previously, he lead efforts to remove unneeded dependencies on critical systems, which de-risked Google's many serving layers from global outages.

His primary interests are in systems thinking, dependency management and horizontal analyses of large-scale systems.

Garrett Holthaus is a technical writer for Site Reliability Engineering at Google. He has a background in electrical and computer engineering, as well as experience teaching and designing science and technology curricula. In addition to writing and maintaining SRE documentation, Garrett develops and gives training in System Theoretic Process Analysis (STPA) at Google.

Ruben Barroso is a Staff Site Reliability Engineer (SRE) at Google. For five years, he has applied advanced systems safety engineering methods, including Systems-Theoretic Process Analysis (STPA) and Causal Analysis based on Systems Theory (CAST), to rigorously analyze and secure dozens of critical internal software systems at Google.

Discussion Track

Liffey Hall 1

Between Roles: Navigating the Pause in a Fast Moving Industry

Thursday, 11:00–12:35

Effie Mouzeli, Wikimedia Foundation, and Lerna Ekmekcioglu, Clockwork.io

Effie spent several years in small organisations. Currently an SRE at the Wikimedia Foundation, she is counting Wikipedia’s rabbit holes so you don’t have to. [citation needed] She’s co-chaired SREcon23 and SREcon24 EMEA, and has been a long-time contributor. Her limited written work include a thesis no one read, a defunct Twitter account and sneaking a couple of articles into 97 Things Every SRE Should Know.

Connect:

Lerna is a Sr. Solutions Engineer at Clockwork Systems where she helps customers meet performance and reliability goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Sr. Solutions Architect at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies focused on problems of scale like centralized authentication systems, distributed caching, and multi region cloud native deployments to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.

12:35–13:50

Luncheon

The Forum

13:50–14:35

Track 1

InFocus: Reliability in Finance

The Liffey A

Reliability in Finance: Delivering Quality Market Data at Scale

Thursday, 13:50–14:35

Alan Campbell, Bloomberg

Available Media

In the world of financial systems, data is everything, but not all data is created equal and not all systems are built to handle its volatility. At Bloomberg Engineering, managing market data is a uniquely complex challenge. This plenary will dive into how as Reliability Engineers we build, monitor and evolve our systems to ensure high quality data delivery amidst the chaos of global markets.

We’ll ask the uncomfortable question: what if we didn’t catch it? Through real-world scenarios, we’ll highlight the cost of bad data: from regulatory risk, wasted engineering hours and eroded customer trust.

From the daily issues like file delays, malformed data, and repeated deliveries to the abnormal market occurrences that cause excess load and bespoke handling; there are many tools and techniques familiar to Reliability Engineers that can contribute to the delivery of quality data.

Join us to discover how Reliability Engineers at Bloomberg handle these challenges.

Alan Campbell is a Reliability Engineering Team Lead with over 12 years of experience. He began his career at Bloomberg as a full-stack engineer, developing in Java, Python and JavaScript, before transitioning into the SRE space. In his current role, he focuses on defining and maintaining SLOs, implementing anomaly detection, scaling systems, and conducting performance testing. Alan also configures, deploys and maintains monitoring and observability tools such as Prometheus, OpenTelemetry, and Grafana to ensure Bloomberg’s systems meet the highest standards of reliability and availability.

Track 2

InFocus: Platform Engineering

The Liffey B

From Toil to Empowerment: Building Self-Service Ingress with GitOps

Thursday, 13:50–14:10

Jeswin Koshy Ninan, Workday, Inc.

Available Media

This talk provides a technical overview of building a declarative, GitOps-driven platform for self-service public ingress at scale. We'll show how we transformed manual, error-prone ingress management (load balancers, TLS, DNS) into an automated, streamlined process. Discover the operational challenges of managing ingress across thousands of Kubernetes nodes, and how our solution, powered by custom Kubernetes operators and tools like cert-manager and external-dns, empowers both platform engineers and service teams with zero-touch ingress provisioning.

Jeswin Koshy Ninan is an experienced software engineer specializing in connectivity, traffic management, Kubernetes, and service mesh technologies.

Systematising Production Operations: Reddit Control Tower

Thursday, 14:15–14:35

Kelly Dodd, Reddit

Available Media

Many environments start out being lovingly maintained by hand, and then grow more complex maintenance functions over time. We present Reddit's approach to taming this complexity - a framework which allows teams to incrementally turn complex sequences of privileged manual steps into well defined high level API calls. Safety, testability, repeatability and accuracy are all improved.

Kelly Dodd is a Senior SRE at Reddit with a multidisciplinary SRE background thanks to an 8 year career spent mostly in the startup world. Most of her time at Reddit, one of the world’s most visited websites, has been focused on developing controls for edge traffic management. Today she is working with Reddit’s Risk team to build a platform for governing high-risk production operations.

When she isn’t staring at a screen she loves to sew (badly) and hang out with her three cats.

Kelly is thrilled to be speaking at SREcon for the first time.

Track 3

InFocus: Data and AI Reliability

Liffey Hall 2

Cross-Platform Data Lineage with OpenLineage, A Foundational Layer for Data Reliability

Thursday, 13:50–14:35

Maciej Obuchowski, Datadog

Available Media

As Data & AI systems become foundational to modern software, the pipelines that power them deserve the same engineering rigor as production services. Yet many software engineers still treat data pipelines as someone else’s responsibility. We’ll discuss why end-to-end lineage is essential for observability, debugging, and trust in Data & AI workflows, how it can allow software engineers to maintain the health of data pipelines and how this aligns with principles familiar to the SRE community. The talk will introduce the OpenLineage standard (a project under LF AI & Data), explain how it compares and complements OpenTelemetry, and present current integrations with Airflow, Spark, dbt, Flink, and more. Presented by a Technical Steering Committee (TSC) member of OpenLineage, this session is both a practical introduction and a call to action: to treat Data & AI pipelines as first-class citizens in reliable, scalable systems.

Maciej is a Senior Software Engineer, OpenLineage TSC member and Apache Airflow committer, currently working on data observability at Datadog. In the free time he likes petting his cat, rock climbing and contributing to Open Source projects.

Discussion Track

The Liffey Hall 1

Hardware & Datacenter Reliability

Thursday, 13:30–14:30

Panos Christeas and John Looney, Crusoe.ai

The hardware space has not traditionally had a lot of attention from SRE, but that's changing. SREs in the datacenter automation space are working with hardware and firmware teams in their vendors to improve that layer, and make it easier to run for the many years after hardware leaves the factory.
Come and ask questions of John Looney and Panos Christeas, as well as other SREs who have decades of experience working with bare metal, new product introductions, and what the old hyperscalers can teach the new clouds that are springing up.

John Looney has been a full stack SRE for 20 years, working at every layer from hardware design to 100 million RPC/s revenue booking services. The last year has been a spent building the fastest AI training clusters possible, and learning they are very different to typical datacenters.

14:35–15:10

Coffee and Tea Break

The Forum

15:10–16:40

Closing Plenary Session

InFocus: Reliability in Finance

The Liffey

Zero-Regression Network CI/CD for Finance-Grade Reliability

Thursday, 15:10–15:55

David Ferrandez and Damian Krogul, TransFICC Ltd

Available Media

Learn how we eliminated post-deployment network issues in a live trading environment by building a CI/CD pipeline for network config. We use a full virtual replica of our network to test every change against real traffic patterns and simulated failures, before it ever reaches production. The result: faster, safer, zero-touch deployments with no out-of-hours surprises. This talk shares our tools, process, and lessons learned building finance-grade network reliability without bureaucracy or fear.

David Ferrandez: Ex-F5 Networks, Ex-Facebook, Ex-VMware. David has always been involved in networking to some degree, but from different angles: Systems, Support, Development, Automation, and Infosec. David likes working on the big-picture, global network architecture and new PoPs/DCs, but also getting into the detail of anything going over the wire (protocol, formats, payloads).

Damian Krogul is a seasoned Site Reliability Engineer with deep expertise in networking, automation, and large-scale infrastructure operations. He currently focuses on designing and maintaining resilient, high-performance systems across hybrid environments, blending bare metal, cloud, and automation tooling like Ansible and Terraform.

What Do We Do Now, Now That We’re Happy?

Thursday, 15:55–16:40

Niall Murphy, Stanza Systems

Available Media

Niall Richard Murphy has worked in computing infrastructure since the mid-1990s, and has been employed by every major cloud provider (specifically Amazon Google, and Microsoft) from their Dublin, Ireland offices in a variety of roles from IC to Director. He is currently CEO/Co-founder of Stanza Systems, a small startup in the ML/AI/reliability space. He is the instigator, co-author, and editor of multiple award-winning books on networking, reliability, and machine learning, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect: