SREcon24 Americas Conference Program

Monday, March 18

7:30 am–8:45 am

Continental Breakfast

Grand Foyer

8:45 am–9:00 am

Opening Remarks

Grand Ballroom
Program Co-Chairs: Sarah Butt, SentinelOne, and Dan Fainstein, The D. E. Shaw Group

9:00 am–10:30 am

Opening Plenary Session

Grand Ballroom

20 Years of SRE: Highs and Lows

Monday, 9:00 am–9:25 am

Niall Murphy, Stanza Systems

Available Media

SREcon is 10 years old this year. The precise starting point for SRE as a profession is a bit harder to pin down, but it's certainly considerably more than ten years ago. In this session, Niall Murphy will tell the story of SRE as he has seen it evolve, from its beginnings within Google and other early adopter organisations to its subsequent spread throughout the tech industry and beyond.

Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

Scam or Savings? A Cloud vs. On-Prem Economic Slapfight

Monday, 9:25 am–9:50 am

Corey Quinn, The Duckbill Group

Available Media

Since not giving a single crap about money turned out to be a purely zero-interest-rate phenomenon, the "cloud is overpriced vs. no it is not" argument has reignited—usually championed on either side by somebody with something to sell you. The speaker has nothing to sell you, and many truths to tell. It's time to find out where the truth is hiding.

Corey is the Chief Cloud Economist at The Duckbill Group, where he specializes in helping companies improve their AWS bills by making them smaller and less horrifying. He also hosts the "Screaming in the Cloud" and "AWS Morning Brief" podcasts; and curates "Last Week in AWS," a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark and thoughtful analysis in roughly equal measure.

Is It Already Time To Version Observability? (Signs Point To Yes.)

Monday, 9:50 am–10:30 am

Charity Majors, honeycomb.io

Available Media

Pillars, cardinality, metrics, dashboards ... the definition of observability has been debated to death, and I'm done with it. Let's just say that observability is a property of complex systems, just like reliability or performance. This definition feels both useful and true, and I am 100% behind it.

However, there has recently been a generational sea change in data types, usability, workflows, and cost models, along with what users report is a massive, discontinuous leap in value. In the parlance of semantic versioning, it is a breaking, backwards-incompatible change. Which means it’s time to bump the major version number. Observability 1.0, meet Observability 2.0.

In this presentation, we will outline the technical and sociotechnical characteristics of each generation of tooling and describe concrete steps you can take to advance or improve. These changes are being driven by the relentless increase in complexity of our systems, and none of us can afford to ignore them.

Charity is the cofounder and CTO of honeycomb.io, the O.G. observability company, and the coauthor of O'Reilly books "Database Reliability Engineering" and "Observability Engineering". She writes about tech, leadership and other random stuff at https://charity.wtf.

Connect:

X

10:30 am–11:00 am

Break with Refreshments

Pacific Concourse

11:00 am–12:35 pm

Track 1

Grand Ballroom A

Capacity Constraints Unveiled: Navigating Cloud Scaling Realities

Monday, 11:00 am–11:45 am

Kevin Sonney and Marc-Andre Dufresne, Elastic

Available Media

Do you believe cloud capacity is limitless? Have you ever run into an out of capacity error from your cloud service provider? How do you satisfy the capacity demands of your application? You cannot imagine a world where you can’t scale up your auto-scaling group anymore? Have you felt desperate and powerless when it happens?

You are not alone.

We believe that capacity planning should become a more prominent skill set of SREs. This talk explores what challenges we have faced growing capacity for Elastic Cloud, how we overcame those issues, and what we will not repeat in future products.

Kevin Sonney is a technology professional, media producer, and podcaster. An SRE and Open Source advocate, Kevin has over 30 years in the IT industry, with over 20 years in Open Source. Kevin and his wife, award winning author and illustrator Ursula Vernon (aka T. Kingfisher), co-host the Productivity Alchemy podcast and routinely attend sci-fi and comic conventions. Kevin also voiced Rev. Mord on The Hidden Almanac, wrote articles for opensource.com, and keeps (many) chickens.

Connect:

about.me

Marc is an SRE, dad and mountain biker. He has over 15 years of experience in the industry in multiple roles, from infrastructure consulting to leading developer teams. He has a passion for optimisation and automation. He currently helps Elastic manage capacity and optimize resource usage for Elastic Cloud.

Connect:

Sharding: Growing Systems from Node-scale to Planet-scale

Monday, 11:50 am–12:35 pm

Adam Mckaig, Stripe

Available Media

Sharding is an important part of scaling systems. Most start life without it, rightly preferring the simplicity of the monolith on the single-node database. But sooner or later, most systems need to be split up: the database is too big, the workload is too diverse, the risk (and consequences) of total outages is too dire. But when, what, and how? Sharding can increase cost, latency, and — most perniciously — complexity, so the trade-offs must be considered carefully.

This talk aims to provide SREs with a map to embark on this journey, showing the problems commonly encountered as systems grow in various dimensions, and the sharding patterns which can address each. We then present a highly opinionated Golden Path outlining how these patterns can be combined into the default route from node-scale to planet-scale, along with some traps and anti-patterns to avoid.

Adam is a staff engineer at Stripe, where he works on a petabyte-scale document database targeting five nines of availability. Previously he worked on datastores at Datadog and Google, and on other backend systems at the NYT, Bloomberg, and the UN.

Track 2

Grand Ballroom BC

Product Reliability for Google Maps

Monday, 11:00 am–11:45 am

Micah Lerner and Joe Abrams, Google

Available Media

As our organization has gotten very good at protecting server SLOs with reliability best practices like scaling globally distributed at-scale architectures, toil mitigation, and continuous reliability improvements we noticed that a majority of incidents impacting our end-users were not showing up as an SLO miss.

In many cases these outages were not even observable from the server side - for example, the rollout of a new version of the consumer mobile application (that our services powers) to an app store could break one or more critical feature(s) due to bugs in client code. This reality has led to a change in the way we approach reliability - we’re shifting our focus from server reliability to product reliability.

We’re not yet finished with the transition, but we’re starting to see very positive results. Our talk shares challenges we've solved so far, lessons we've learned, and our vision for the future.

Micah Lerner is a tech lead at Google, focused on consumer Geo products. Previously, Micah helped build the Geospatial datasets powering Mapbox and was an early employee at Strava (where he first read Google's book on SRE).

Joe leads site reliability engineering for Google Maps products. He and his team are constantly looking for new ways to protect users from potential production issues. As a self-professed outage nerd, he enjoys hearing about interesting failure tales from inside and outside of Google. When he is not poring over last month's postmortem reports, you can find him on a tennis court trying to make his serve more fault-tolerant.

Using Generative AI Patterns for Better Observability

Monday, 11:50 am–12:35 pm

John Feminella, Nuvalence

Observability has always been a cornerstone for understanding and maintaining complex digital systems. Our familiar friends of metrics, logs, and traces have been enriched over the years with more powerful tools for greater understanding. Now, a new emerging superpower for practitioners has been added to that toolbox: generative AI.

In this talk, we'll cover several new generative AI patterns of interest that we've found to be of particular relevance to observability practitioners in production settings. We'll show how even relatively basic generative AI scaffolding can be extraordinarily helpful for practitioners, how to leverage this in your day-to-day work, and how to get started right now. By the end of the talk, we hope to leave you with a mix of practical advice and solid theoretical grounding.

John Feminella is a technology leader, occasional public speaker, and curiosity advocate. At Nuvalence, he helps enterprises transform the way they write, operate, and deploy software and organize software organizations so that businesses and people alike can be more effective. Before that, he led technology organizations at Pivotal, VMware, and ThoughtWorks; helped build the world's largest database of pollutants and air emissions; authored a distributed-sensor technology; and helped design FedNow, an instant-payments network.

Connect:

Mastodon

12:35 pm–1:50 pm

Luncheon

The Atrium
Sponsored by Cortex

1:50 pm–3:25 pm

Track 1

Grand Ballroom A

Build vs. Buy in the Midst of Armageddon

Monday, 1:50 pm–2:35 pm

Reggie Davis

Available Media

What happens when you've got a team of SREs who get downsized and smacked with an increasing workload? This is the story of my small team's wild ride of revamping our in-house incident tool.

Spoiler: It turned into a "make or buy" debate real quick!

Picture this: a team of 10 becomes a merry band of 3, drowning in a sea of issues, requests, and feedback. Unexpectedly, our team changed like a game of musical chairs, skills shuffling all over. In our pursuit to be the heroes, we hit the brakes and pitched buying a ready-made tool for a change. It wasn't just a business decision; it was a cultural rollercoaster for everyone involved! So, let's take a stroll down memory lane, exploring how we switched gears to evaluate third-party tools to pave the ground for improving the reliability of our platform.

Reggie's a laid-back, jovial, and curious Senior SRE and Technical lead for the Platform Core SRE team at Elastic. Focusing on service management, incident management, and operational excellence in cloud-native environments, Reggie enjoys working with leaders across the platform to develop processes that help service teams "shift left" on their reliability efforts. Outside of work, Reggie's a avid Yogi, classic hip-hop vinyl collector, and a frequenter to coffee shops in whatever city he finds himself in.

Connect:

X

YouTube

Compliance & Regulatory Standards Are NOT Incompatible with Modern Development Best Practices

Monday, 2:40 pm–3:25 pm

Charity Majors, honeycomb.io

Available Media

Modern software development is all about fast feedback loops, with best practices like testing in production, continuous delivery, observability driven development, and feature flags. Yet I often hear people complaining that only startups can get away with doing these things; real, grown-up companies are subject to regulatory oversight, which prevents engineers from deploying their own code due to separation of concerns, requires managers to sign off on changes, etc.

This is categorically false: there is NOTHING in ANY regulation or standard to prevent you from using modern development best practices. Let's take a stroll through the regulatory landscape and do some mythbusting about what they do and don't say, and what this means for you. Teams that figure out how to follow modern best practices can build circles around teams that don't, which is a huge competitive advantage. Your competition is working on this right now: you should be too.

Charity is the cofounder and CTO of honeycomb.io, the O.G. observability company, and the coauthor of O'Reilly books "Database Reliability Engineering" and "Observability Engineering". She writes about tech, leadership and other random stuff at https://charity.wtf.

Connect:

X

Track 2

Grand Ballroom BC

The Ticking Time Bomb of Observability Expectations

Monday, 1:50 pm–2:35 pm

David Caudill, Capital One

Available Media

Explore the fundamental problems with the popular "monitor everything" maxim, and allowing vendors to control our discourse about monitoring. You'll leave with some fundamental principles to guide your approach to observability in a cost-conscious manner.

David Caudill is a Site Reliability Engineer and recovering Engineering Manager. He has wide ranging experience in distributed datastores, healthcare, adtech, biotech, education, ophthalmology, logistics, and most recently, fintech. In his spare time he makes strange sounds with synthesizers and is an avid collector of folk and ethnic art.

Connect:

Synthesizing Sanity with, and in Spite of, Synthetic Monitoring

Monday, 2:40 pm–3:25 pm

Daniel O'Dea, Atlassian

Available Media

Synthetic monitoring, particularly browser-based monitoring, is hard to do well. When tests pass, synthetic monitoring provides a uniquely intuitive kind of psychological safety - human-like, verified confidence, compared to other forms of monitoring. When tests fail, synthetic monitoring is often blamed as flaky, misconfigured, or unreliable. If not properly implemented, it can not only be financially, mentally and organisationally draining, but damaging to real customer experience.

This talk is a conceptual and technical story of 4 years working with Atlassian’s in-house synthetic monitoring solution, being the owning developer for a tool actively used by 30-40 internal teams to build and manage synthetic monitoring for Jira. How can we make synthetic monitoring better serve its purpose of providing useful signal?

Daniel O’Dea is part of the Jira Site Reliability Engineering team at Atlassian, where he leads database improvements, drives incident resolution, and builds tools used by many teams. Daniel is also a classical pianist, composer, and artist. He previously spoke at SREcon22 in APAC, about high-cardinality monitoring (and AI-generated ice cream).

3:25 pm–3:55 pm

Break with Refreshments

Pacific Concourse

3:55 pm–5:30 pm

Track 1

Grand Ballroom A

Migrating a Large Scale Search Dataset in Production in a Highly Available Manner

Monday, 3:55 pm–4:15 pm

Leila Vayghan, Shopify

Available Media

Shopify is an ecommerce platform supporting over 3 million global merchants which uses Google Kubernetes Engine to run Elasticsearch on Google Cloud Platform. The COVID-19 pandemic led to an increase in global clients, causing latency issues and GDPR compliance challenges. To address this, the search infrastructure team at Shopify migrated European merchants’ search data to European regions. However, this migration was complex due to the mixed storage of European and non-European merchants’ data and the constraints of the indexing pipeline. Moreover, the scale of data that needed to be migrated was large and would lead to outages for merchants’ search services which would negatively impact their revenue. This talk tells the story of how this team designed an architecture to migrate a large dataset to European regions without impacting merchants’ sales. We will review the technical decisions and the tradeoffs that were made to address the challenges faced.

Leila is an engineer at Shopify, where she spends her days enabling millions of merchants to grow by making sure buyers are able to search and find their products. She does this by running a large-scale search infrastructure on Kubernetes in many regions of the world. Leila has completed her master’s degree on the availability of stateful applications running on Kubernetes and has presented her work at conferences.

OIDC and CICD: Why Your CI Pipeline Is Your Greatest Security Threat

Monday, 4:20 pm–4:40 pm

Mark P Hahn, Qualys, and Ted Hahn, TCB Technologies

Available Media

Your CI/CD Process is chock full of credentials, and almost anyone in your company has access to it. Configuring your CI correctly is vital to supply chain security. We discuss how to reduce that attack surface by enforcing proper branch permissions and using OIDC to reduce long-lived credentials and tie branches to roles.

Mark Hahn is Qualys’s Solutions Architect for Cloud and DevOps Security. In this role he works with Qualys’s clients to ensure that cloud applications and infrastructure are secure and reliable. Mark uses DevSecOps and Site Reliability Engineering practices to ensure that software and applications are deployed with high velocity and with the utmost security. He shows clients how to build security into software using agile methods and cloud native distributed systems world built for DevOps and rapid change.

Ted Hahn is a experienced Site Reliability Engineer, having worked at Google, Facebook and Uber, and most recently having been the primary SRE for Houseparty - Maintaining an infrastructure used for thousands of QPS by millions of users in a company of less than 50. He is currently an independent consultant.

When Your Open Source Turns to the Dark Side

Monday, 4:45 pm–5:30 pm

Dotan Horovits, Logz.io, CNCF Ambassador

Available Media

Imagine waking up one morning to find out that your beloved open source tool, which lies at the heart of your system, is being relicensed. What does it mean? Can you still use it as before? Could the new license be infectious and require you to open source your own business logic?

This doom’s day nightmare scenario isn’t hypothetical. It is, in fact, very real, for databases, for Infrastructure-as-Code tooling, and for other OSS, with several examples over the past year alone.

On this talk Horovits will review some of the less known risks of open source, and share his lessons learned facing such relicensing moves, as well as other case studies from the past few years. If you use OSS, you’ll learn how to safeguard yourself. If you’re in the process of evaluating a new OSS, you’ll learn to look beyond the license and consider additional criteria. If you're debating open-sourcing a project, you'll gain important perspectives to consider.

Horovits lives at the intersection of technology, product and innovation. With over 20 years in the hi-tech industry as a software developer, a solutions architect and a product manager, he brings a wealth of knowledge in cloud and cloud-native solutions, DevOps practices and more.

Horovits is an international speaker and thought leader, as well as an Ambassador of the Cloud Native Computing Foundation (CNCF). Horovits is an avid advocate of open source and communities, an organizer of the CNCF Tel-Aviv meetup group and of Kubernetes Community Days and DevOpsDays local events, a podcaster at OpenObservability Talks, and a blogger, among others.

Currently working as the principal developer advocate at Logz.io, Horovits evangelizes on Observability in IT systems using popular open source projects such as Prometheus, OpenSearch, Jaeger and OpenTelemetry.

Connect:

X

Track 2

Grand Ballroom BC

The Sins of High Cardinality

Monday, 3:55 pm–4:15 pm

Jef Spaleta, Isovalent

Available Media

High cardinality is a sin in observability metrics collection. Cardinality in the form of granular labels or other metadata can cause exponential growth in your observability time-series storage and compute resources; costing money and slowing down queries. It's not always economical nor practical to collect all possible metrics with all possible labels and then worry about how to extract value using queries after the fact. Sure, we want as much granularity as possible in our observability data, but it's a trade-off, we need to be strategic in using metric cardinality to get the granularity needed to discover and remediate problems.

This talk will focus on presenting different strategies to constrain the impact of high metrics cardinality referencing applicable open source Prometheus metrics collection examples.

Jef Spaleta has more than a decade of experience in the technology industry; as software engineer, open source contributor, IoT hardware developer, operations, and most recently as a community advocate at Isovalent.

Connect:

X

Facebook

Optimizing Resilience and Availability by Migrating from JupyterHub to the Kubeflow Notebook Controller

Monday, 4:20 pm–4:40 pm

David Hoover and Alexander Perlman, Capital One

Available Media

This presentation details our transition from JupyterHub to the Kubeflow Notebook Controller.

JupyterHub was architected in a backend agnostic way that "supports" Kubernetes but isn't truly Kubernetes-native. As a result, it has significant shortcomings with respect to resilience and high availability. In particular, the core component, the hub API, can only have one replica at any given time.

In contrast, The Kubeflow Notebook controller is built from the ground up to be Kubernetes native using the operator pattern. There's far less complexity, fewer components, less brittleness, and improved resilience and high availability.

As a result, our platform has been able to scale to four times as many users, including ten times as many concurrent executions. Our users are happier and there's less operational overhead for platform engineers. Our journey illustrates how properly leveraging Kubernetes-native architecture confers significant benefits.

David is a Sr. Lead DevOps Engineer at Capital One. He works on an enterprise-scale Machine Learning Platform to facilitate superior outcomes for Data Scientists and machine learning engineers. His professional interests include Docker, Cybersecurity, Python and Kubernetes and he spends his free time listening to heavy metal and as a cinema-phile.

Connect:

Alexander Perlman is a senior lead software engineer at Capital One's Machine Learning Experience organization. His areas of focus include distributed compute and workflow orchestration. He lives in the NYC metro area with his wife and three young children. He believes that the correct pronunciation of "kubectl" is "kube-control," not "kube-cuddle." His favorite bubble tea flavor is taro.

Connect:

99.99% of Your Traces Are (Probably) Trash

Monday, 4:45 pm–5:30 pm

Paige Cruz, Chronosphere

Available Media

Distributed tracing is still finding its footing in many organizations today, despite being an industry concept for 14 years! One challenge to overcome is managing the data volume. Keeping 100% of traces is expensive and unnecessary - enter sampling. From head-based, tail-based or mixing and matching there is a buffet of configuration options available with OpenTelemetry. Let’s review the tradeoffs associated with different sampling configurations and their impact to the cost and value of your tracing data.

Paige is passionate about realizing the dream of “instrument once, observe anywhere” powered by standards like OpenTelemetry. As a developer advocate at Chronosphere, she focuses on making observability and cloud native concepts approachable for all.

Starting out in technical recruiting at New Relic, she discovered her inner data nerd and transitioned to software engineering working on their first distributed tracing product. Then she followed her curiosity about operations, containers and cloud native technologies to hold the pager as an SRE for several startups before foraying into advocacy and marketing.

Connect:

Mastodon

5:30 pm–6:30 pm

Showcase Happy Hour

Pacific Concourse

Tuesday, March 19

8:00 am–9:00 am

Continental Breakfast

Pacific Concourse

9:00 am–10:30 am

Tuesday Plenary Session

Grand Ballroom

Meeting the Challenge of Burnout

Tuesday, 9:00 am–9:45 am

Christina Maslach, University of California, Berkeley

Available Media

Burnout is an occupational phenomenon that results from chronic workplace stressors that have not been successfully managed. Research on burnout has identified the value of fixing the job, and not just the person, within six areas of job-person mismatch. Improving the match between people and their jobs is the key to managing the chronic stressors, and can be done on a routine basis as part of regular organizational checkups. Better matches enable people to work smarter, rather than just harder, and to thrive rather than to get beaten down.

Christina Maslach, PhD, is a Professor of Psychology (Emerita) and a researcher at the Healthy Workplaces Center at the University of California, Berkeley. She received her A.B. from Harvard, and her Ph.D. from Stanford. She is best known as the pioneering researcher on job burnout, producing the standard assessment tool (the Maslach Burnout Inventory, MBI), books, and award-winning articles. Her latest book is The Burnout Challenge: Managing People’s Relationships with their Jobs (2022). The impact of her work is reflected by the official recognition of burnout, as an occupational phenomenon with health consequences, by the World Health Organization in 2019. She has been honored with multiple awards, both academic and public, including an award from the National Academy of Sciences for scientific reviewing (2020), and her inclusion in Business Insider’s 2021 list of the top 100 people transforming business.

What We Want Is 90% the Same: Using Your Relationship with Security for Fun and Profit

Tuesday, 9:45 am–10:30 am

Lea Kissner, Lacework

Available Media

Security and SRE are natural allies in the fight against randomness, terrible systems, and how those systems can hurt people and give us yet another reason to hate surprises. This talk goes over where our interests overlap, where they don’t, why, and how to take advantage of this in a world of infinite possibilities and limited, prioritized time.

Lea is the CISO of Lacework and on the board of USENIX. They work to build respect for users into products and systems through deeply-integrated security and privacy. They were previously the CISO and Head of Privacy Engineering at Twitter, Global Lead of Privacy Technology at Google, came in to fix security and privacy at Zoom, and CPO of Humu. They earned a Ph.D. in computer science (with a focus on cryptography) at Carnegie Mellon University and a BS in electrical engineering and computer science from UC Berkeley.

10:30 am–11:00 am

Break with Refreshments

Pacific Concourse

11:00 am–12:35 pm

Track 1

Grand Ballroom A

Thawing the Great Code Slush

Tuesday, 11:00 am–11:45 am

Maude Lemaire, Slack

Available Media

October 2020 wasn't a great month for Slack. Plagued with multiple hours-long outages, our engineering leadership team called for a code slush: all pull requests to our build and configuration mono-repo, aptly named "chef-repo", would need to be reviewed live, over Zoom, by a change advisory board (CAB) until further notice.

Five months later, CAB was still alive and well. We'd made some ergonomic improvements to the process, but development for our infrastructure team had slowed to a crawl.

Armed with a beginner's mindset– I'd only committed to the repo once under the new process– I decided to take matters into my own hands. What follows is a tale of building trust, making compromises, and slowly, but surely restoring our long-lost productivity without eroding reliability.

Maude is a Senior Staff Engineer at Slack where she is a founder and technical lead for the backend performance infrastructure team. She's responsible for large-scale load test tooling, performance regression monitoring, and successfully onboarding the world's largest companies to Slack. Over the past six years, she's helped the product scale from just 60,000 users per team to over 2 million. When she doesn't have her nose in a flamegraph, you can find Maude building strong, empathetic engineering cultures.

In October 2020, Maude published ""Refactoring at Scale"" with O'Reilly Media, a blueprint for how technical leaders can successfully navigate large, complex refactors.

Connect:

X

Resilience in Action

Tuesday, 11:50 am–12:35 pm

Will Gallego

Available Media

We can only improve on things we introspect on. We often say that we want to build resilient socio-technical systems, ones in which the human operators are able to respond and adapt to unexpected failure modes, so how do we develop this resilience?

In this session we'll look at common, everyday tasks we all participate in to build resilience within our organizations without even realizing it. Coupled with research from many other areas outside of technology, we can continue to grow our ability to develop deep resilience in preparation for the unexpected incidents just around the corner.

Will Gallego is a systems engineer with 20+ years of experience in the web development field. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers grow. He believes in a free and open internet, blame aware incident retrospectives, and pronouncing gif with a soft “G”.

Connect:

Mastodon

Bluesky

Track 2

Bayview Room

"Logs Told Us It Was Kernel – It Wasn't"

Tuesday, 11:00 am–11:45 am

Valery Sigalov, Bloomberg

Available Media

In this presentation, we will demonstrate that the Linux kernel is not always responsible for application performance problems. We will review various techniques that can be used to investigate application performance issues. The audience will learn how to write cache-friendly code to optimize application performance and how to use compilation flags to achieve further performance optimizations.

I am a Software Engineer in System Internals Team at Bloomberg. Our team is working on certifying new Linux kernels and related subsystems, troubleshooting performance issues and developing tools for performance analysis. Outside of work, I enjoy reading, listening to classical music, and travelling with my family around the world.

Autopsy of a Cascading Outage from a MySQL Crashing Bug

Tuesday, 11:50 am–12:35 pm

Jean-François Gagné, Aiven, and Swetha Narayanaswamy, HubSpot

Available Media

Once upon a time, an application query triggered a crashing bug. After automated failure recovery, the application resented the query, and MySQL crashed again. This was the beginning of a cascading failure that led to a full datastore unavailability and some partial data loss.

MySQL stability means that we can easily forget to implement operation best practices like cascading failure prevention and testing of unlikely recovery scenarios. It happened to us and this talk is about how we recovered and what we learned from this situation.

Come to this talk for a full post-mortem of a cascading outage caused by a crashing Bug. This talk will not only share the incident operational details, but will also include what we could have done differently to reduce its impacts (including avoiding data loss), and what we changed in our infrastructure to avoid this from happening again (including cascading failure prevention).

Jean-François is a System / Infrastructure Engineer currently working as a MySQL Open Source Developer in Aiven’s Open Source Program Office (OSPO). Before that, his missions were improving operations and scaling MySQL and MariaDB infrastructures at HubSpot, MessageBird and Booking.com. J-F is also the maintainer of Planet for the MySQL Community: a news aggregator for the MySQL Ecosystem. Before being involved with MySQL, he worked as a System / Network / Storage Administrator in a Linux and VMWare environment, as an Architect for a Mobile Telco Service Provider, and as a C & Java Programmer in an IT Service Company. Even before that, while learning computer science, Jeff studied Cache and Memory Consistency in Distributed Systems and Network Group Communication Protocols (yes, the same as Group Replication).

Connect:

X

Swetha Narayanaswamy is Director, Engineering leading the Data Infrastructure team at Hubspot. The HubSpot application platform is made up of over 15,000 components that are deployed 3,000+ times a day. Our systems make hundreds of billions of requests per day to HBase, Kafka, Elasticsearch and MySQL. Prior to Hubspot, Swetha held leadership roles at a variety of Infrastructure companies including Netapp, Microsoft and EMC bringing innovative services to market in a high-growth environment. In addition, Swetha is a non-profit Board member, Diversity advocate and Patent holder.

Connect:

12:35 pm–1:50 pm

Luncheon

The Atrium
Sponsored by Incident

1:50 pm–3:25 pm

Track 1

Bayview Room

Navigating the Kubernetes Odyssey: Lessons from Early Adoption and Sustained Modernization

Tuesday, 1:50 pm–2:35 pm

Raul Benencia, ThousandEyes, part of Cisco

Available Media

The cloud-native ecosystem has evolved. Infrastructures self-heal. Features rollout seamlessly. Rollbacks happen automatically. Speed, quality, cost - all optimized. But you're in a different boat: you're dealing with an outdated legacy system and a migration is essential to stay in the game.

Embark with us as we tackled this challenge, moving from scavenged servers to a global Kubernetes infrastructure. Our path sailed the complexities of modernizing legacy systems, managing a baremetal Kubernetes cluster, migrating it to self-managed Kubernetes cluster in the cloud, and migrating it once again to a managed installation using a service mesh. We faced tough trade-offs and learned to balance resources with operational efficiency. Now, as we reflect on our journey, we're left with valuable insights and a compelling narrative. A narrative that promises to resonate with all SREs facing similar challenges, as we continue to evolve in this rapidly changing ecosystem.

Hailing from Argentina and now living in San Francisco, Raul Benencia has navigated a dynamic 15-year tech journey, from system and network administration, to operating systems development, to security engineering. He's currently an SRE Technical Leader at ThousandEyes, part of Cisco, where he has assumed a myriad of roles and is currently leading its Kubernetes and containers strategy. Off duty, he immerses himself in compelling books, contributes to the Debian project, and nimbly M-x's through Emacs.

Kube, Where’s My Metrics? The Challenges of Scaling Multi-Cluster Prometheus

Tuesday, 2:40 pm–3:25 pm

Niko Smeds and Iain Lane, Grafana Labs

Available Media

Service and systems monitoring is crucial for healthy, happy applications. For small to medium-sized Kubernetes environments, a single Prometheus HA pair is often sufficient. But when you scale your environments, new challenges arise. As Prometheus ingests and stores significantly more time-series metrics, it can struggle to keep up. And if cluster numbers increase, it becomes difficult to locate the metrics you need. These are known problems with a few common solutions.

In this talk, Niko and Iain will discuss the challenges of scaling Prometheus by walking you through the history of monitoring Grafana Labs' internal services. As they scaled to 40+ clusters over five cloud providers, the Grafana Labs teams frequently iterated on their internal monitoring architecture. They'll cover the basics of Prometheus for monitoring and alerting before moving into remote write and monitoring of multiple clusters.

Niko is a senior software engineer at Grafana Labs, where he helps build and monitor the Kubernetes and cloud platforms. From OpenStack to K8s, he has experience with both private and public cloud infrastructure.

Connect:

Iain is a senior software engineer at Grafana Labs. A member of the Cloud Platform team, his focus is on maintaining the infrastructure - Kubernetes clusters - which runs Grafana Cloud, helping build tools and processes for engineers to deploy their software into this environment with maximum confidence.

Connect:

Track 2

Grand Ballroom BC

What Is Incident Severity, but a Lie Agreed Upon?

Tuesday, 1:50 pm–2:35 pm

Em Ruppe, Jeli.io/PagerDuty

Available Media

Is your incident severity a complex word problem? Do you arbitrate which severity an incident should be during the incident itself? Do you know what the difference between your severities are? Do you even have incident severity in your org? Let's talk about the lies we tell ourselves about incident severity, and how to find what kind of severities might actually work for your organization.

Em Ruppe is a Product Manager at Jeli.io (recently acquired by PagerDuty) who has been referred to as “the Bob Ross of incident reviews.” Previously they have written hundreds of status posts, incident timelines and analyses at SendGrid, and was a founding member of the Incident Command team at Twilio. Em believes the most important thing in both life and incidents is having enough snacks.

Connect:

Mastodon

Hard Choices, Tight Timelines: A Closer Look at Skip-level Tradeoff Decisions during Incidents

Tuesday, 2:40 pm–3:25 pm

Dr. Laura Maguire, Trace Cognitive Engineering, and Courtney Nash, The VOID

Available Media

Unexpected outages in software service delivery - also known as incidents—often require making rapid tradeoff decisions on the road to recovery. Tradeoffs can be relatively minor-—rolling back a recent change or temporarily disabling a certain feature—or they can represent significant threats to reliability or reputation, such as when facing a loss of customer data. While the resolution of incidents is unquestioningly in the hands of engineers, senior management also have an active role in making tradeoff decisions during significant incidents. As researchers interested in software incidents, we recognized a gap in the industry’s understanding of how different levels across the organization work together to resolve challenging incidents.

Our objective in this research is to examine the kinds of tradeoff decisions management faces during incidents, the patterns in how and when they become involved, and the strategies used to coordinate effectively with their incident response teams.

During this talk you’ll get a behind the scenes (and between the ears!) look at management tradeoff decisions and how this knowledge can be used to increase an organization's capacity to handle unexpected events.

Dr. Laura Maguire is the Principal Research Engineer at Trace Cognitive Engineering where she works with software organizations to bring new insights to their hard problems and to support the design of software for complex, cognitively demanding work. She is a Fellow with the Cognitive Systems Engineering Lab at Ohio State University and a Thesis Supervisor at Lund University in Sweden. Laura is an engaging and thought-provoking presenter who routinely speaks to software companies around the world on data driven strategies for enhancing cognitive and team performance.

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. She created the VOID in 2021 to help shine a light on how we can more effectively learn from software incidents. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon.

Connect:

Track 3

Seacliff Room

Workshop: Cloud-Native Observability with OpenTelemetry

Tuesday, 1:50 pm–5:30 pm

Liz Fong-Jones, Honeycomb

Available Media

Modern systems architecture often splits functionality into microservices for adaptability and velocity. The challenge of managing infrastructure for microservices has led to the cloud native ecosystem, including Kubernetes, Envoy, gRPC, and other projects. Observability, including application performance management (APM), is an essential component of a cloud-native stack. Without observability, application developers and operators cannot understand the behavior of their applications and ensure the reliability of those applications.

OpenTelemetry (the successor to OpenCensus and OpenTracing) is a standardized library and specification that collects distributed traces, metrics, and logs from instrumented services. By instrumenting once with OpenTelemetry, developers and operators can understand how data and events flow through their applications through a variety of different visualization backends.

In this workshop, you'll learn how to instrument a distributed set of microservices for traceability using OpenTelemetry, and how to analyze your service's traces using open source software backends like Jaeger/Zipkin. Finally, you'll be able to leverage OpenTelemetry vendor-neutral flexibility to try out various other tracing backends, without recompiling. And you can go home feeling comfortable with implementing OpenTelemetry in your own applications, and being prepared to choose how you will store and visualize traces.

Tech requirement: Laptop with wifi is sufficient, we use Gitpod to provide IDEs for the workshop.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 18+ years of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

3:25 pm–3:55 pm

Break with Refreshments

Pacific Concourse

3:55 pm–5:30 pm

Track 1

Bayview Room

Kubernetes: The Most Graceful Termination™

Tuesday, 3:55 pm–4:40 pm

Harrison Katz, ngrok

Available Media

As engineers, we’re constantly aware of topics like high availability, reliability, and defensive programming, but often we gloss over what happens to our applications when they behave as expected! Graceful termination and shutdown are common expectations of our container workloads in an orchestrated world. How does termination really work? Can we take advantage of a deeper understanding of Kubernetes’ termination pathways to improve our apps’ behaviours?

In this talk, I will review the Kubernetes approach to Pods, containers, and their runtime behaviours. I’ll cover the various termination pathways that exist via kubectl, kubelet, and the k8s API. The talk will finish with how to use this knowledge to implement The Most Graceful Termination for Kubernetes Apps™.

Harrison is giving his first ever conference talk as a Senior SRE at ngrok on the topics of Kubernetes. He has worked in the DevOps and SRE spaces for just under 10 years and is familiar with CNCF tools, container orchestrators, and Infrastructure as Code tooling. Harrison's hobbies include rock climbing, learning languages, and puzzling. He also enjoys digging deep into technical details of systems such as kubernetes, and enjoys sharing this knowledge through teaching people of all levels of familiarity.

Connect:

How We Went from Being Astronauts to Being Mission Control

Tuesday, 4:45 pm–5:30 pm

Laura Nolan, Stanza

Available Media

Laura looks at the common architectural shapes of dynamic control planes, and some examples of how they fail spectacularly - many major cloud outages are caused by dynamic control plane issues. Why are dynamic control planes so hard to run, and what can be done about it?.

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal Engineer at Stanza, where she is building software to help humans understand and control their production systems. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Connect:

X

Track 2

Grand Ballroom BC

Triage with Mental Models

Tuesday, 3:55 pm–4:40 pm

Marianne Bellotti, bellotti.tech

Available Media

What powers those amazing insights that certain engineers bring to the conversation during triage? How do some people just look at a monitoring dashboard and immediately have a hypothesis of what's really going wrong? Experts carry around their own libraries of mental models of systems built up over years of operational experience, but you can learn how to form them intentionally, how to extract them from others and how to manipulate them to see complex systems clearly.

Marianne Bellotti is a software engineer and relapsed anthropologist. Her work focuses on how culture influences the implementation and development of software. She runs engineering teams and teaches other people how to tackle complex systems. Most of her work has focused on restoring old systems to operational excellence, but she also works on the safety of cutting-edge systems and artificial intelligence. She has worked with the UN, served two Presidents, and is proud to clear blockers for a team of amazing engineers.

Connect:

Defence at the Boundary of Acceptable Performance

Tuesday, 4:45 pm–5:30 pm

Andrew Hatch, LinkedIn

Available Media

In the 1990s, Jens Rasmussen (Danish system safety, human factors, and cognitive systems engineering researcher) published "Risk Management in a Dynamic Society." The Dynamic Safety Model was a key element of this, used to illustrate how socio-technical organizations cope with pressure from competing economic, workload, and performance forces. We will use this model to demonstrate how it can represent forces acting within large technology organizations, continually pushing the point of operations closer to the Boundary of Acceptable Performance and, as we approach or cross it, how our lives as SREs become negatively impacted.

We will unpack the ruthless nature of forces protecting economic boundaries, manifesting as layoffs and budget cuts. How pushing people to exhaustion at the workload boundary decreases system safety and, ultimately, profitability. Lastly, we will examine how this model forms the underlying theory behind "chaos engineering" to detect and reinforce risk boundaries through feedback loops to build more resilient systems.

I have worked in the technology industry for over 25 years, predominantly in Australia, with time spent in India and, for the last three years, in the USA. My experience ranges from small to large-scale projects in multiple roles and industries spanning software engineering, consulting, and operations. In 2020, I migrated to the San Francisco Bay Area to take up a role at LinkedIn as an SRE Manager. Before this, I spent six years working at Australia's biggest online jobs and recruitment platform with the critical role of moving the business into AWS and up-leveling their Platform Engineering and Incident Management practices to support this. Since 2013, I have worked primarily in SRE Management roles and, through this experience, developed a passion for learning and adapting to complex systems and helping teams and organizations learn more from incidents to create better software, more resilient systems, and happier, empowered teams. I am a lifelong surfer and can now be found adapting to the crowds at Santa Cruz in California.

Connect:

X

Track 3 (continued)

Seacliff Room

Workshop: Cloud-Native Observability with OpenTelemetry

Tuesday, 1:50 pm–5:30 pm

Liz Fong-Jones, Honeycomb

Available Media

Modern systems architecture often splits functionality into microservices for adaptability and velocity. The challenge of managing infrastructure for microservices has led to the cloud native ecosystem, including Kubernetes, Envoy, gRPC, and other projects. Observability, including application performance management (APM), is an essential component of a cloud-native stack. Without observability, application developers and operators cannot understand the behavior of their applications and ensure the reliability of those applications.

OpenTelemetry (the successor to OpenCensus and OpenTracing) is a standardized library and specification that collects distributed traces, metrics, and logs from instrumented services. By instrumenting once with OpenTelemetry, developers and operators can understand how data and events flow through their applications through a variety of different visualization backends.

In this workshop, you'll learn how to instrument a distributed set of microservices for traceability using OpenTelemetry, and how to analyze your service's traces using open source software backends like Jaeger/Zipkin. Finally, you'll be able to leverage OpenTelemetry vendor-neutral flexibility to try out various other tracing backends, without recompiling. And you can go home feeling comfortable with implementing OpenTelemetry in your own applications, and being prepared to choose how you will store and visualize traces.

Tech requirement: Laptop with wifi is sufficient, we use Gitpod to provide IDEs for the workshop.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 18+ years of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

5:30 pm–7:00 pm

Conference Reception

The Atrium
Sponsored by FireHydrant

7:00 pm–8:00 pm

Lightning Talks

Grand Ballroom BC

Under Pressure: How We Make Decisions
Laura Nolan
Life as A Firmware Engineer for Server Manageability
Chitkala Sethuraman, Microsoft
Lessons Learned Hearing and Publishing SRE Stories
Prathamesh Sonpatki
Retrospectives, Blameless, and Traffic Courts: Oh My!
J. Paul Reed, Spective Coherence, Inc.
Chrome Now Supports Name Constraints in User-added Certificate Authorities. Now What?
Ted Hahn, TCB Technologies, Inc.
Introducing: Service Level Offsets, A New Way To Buy Reliability
Cail Young, Octopus Deploy
The Why and How of BGP Integration in On-Premise Kubernetes for Enhanced Reliability
Tushar Gupta, Google
You Build It, I Run It: Separate Dev and SRE Teams Are Better
Adam Mckaig
Degenerative AI and You: A Cautionary Talk
Corey Quinn, The Duckbill Group

Available Media

Wednesday, March 20

8:00 am–9:00 am

Continental Breakfast

Grand Foyer

9:00 am–10:35 am

Track 1

Grand Ballroom A

System Performance and Queuing Theory - Concepts and Application

Wednesday, 9:00 am–9:45 am

Jeff Poole, Vivint / NRG

Available Media

What is queueing theory and how does it help us understand how software systems perform? This talk will go over the basics of queueing theory and how it can apply to software systems. It will then go into some intuition that we can get from the math about how we can expect our systems to perform in different configurations. Finally, it will go through some examples of how to use this information with real-world data and metrics. Attendees should leave the talk with some understanding of how to configure systems in the real world and how to understand when those systems are approaching a breaking point.

Jeff Poole currently is a Sr. Director of Engineering over back-end software and operations teams managing the customer-facing infrastructure at Vivint Smart Home. Professionally, he has worked in digital hardware development, software engineering, and operations. His personal interests include collecting certifications, road cycling, and skiing.

Connect:

X

It Is OK to Be Metastable

Wednesday, 9:50 am–10:35 am

Aleksey Charapko, University of New Hampshire

Available Media

Metastable failures are self-perpetuating performance failures characterized by the positive feedback loop that keeps systems in a degraded state. For a system to enter a metastable failure state, it first needs to be in a metastable vulnerable state in which some event triggers an overload condition and starts the feedback mechanism. A naïve way to dodge metastable failures is to avoid operating in the metastable vulnerable state, precluding the ""trigger, overload, feedback loop"" failure sequence. Unfortunately, avoiding the metastable vulnerable state is a moot solution; in some cases, this is simply impossible, and in others, it leads to running systems with a high degree of overprovisioning, resulting in poor resource utilization and high cost.

In this talk, I will discuss why it is OK to be in a metastable vulnerable state and what strategies we can use to mitigate the risk of developing a metastable failure. I will present three cornerstones of metastable failure risk mitigation for large systems. The first one is understanding the environments, algorithms, and workloads. The second and third cornerstones — metastable failure trigger-resistant design and protection of vulnerable components — build on the insight developed in the understanding phase.

Aleksey Charapko is an assistant professor at the University of New Hampshire. He broadly works at the intersection of performance, reliability, and efficiency of distributed systems. Before settling for an academic career, Aleksey had a nearly decade-long engineering career, ranging from freelance and consulting to working in big tech.

Connect:

X

Track 2

Grand Ballroom BC

The Art of SRE: Building People Networks to Amplify Impact

Wednesday, 9:00 am–9:45 am

Daria Barteneva, Microsoft

Available Media

If SRE were a well-defined science, we wouldn't need too much mentoring/coaching/cross-team groups. We could just do "the thing" and be done with it. But the truth is that our profession needs a significant component of human interaction to unlock a degree of success that would be unattainable without it.

In this talk we will look at some learnings from another field: choir direction. As a trained opera singer, I always sang in choirs—there is no better way to improve your technique! Going deeper into how choirs are balanced to elevate overall choir performance and help inexperienced singers, we will uncover an opportunity not broadly formalized in the engineering field: implicit coaching. Common in music and sports, implicit coaching is a powerful way to help engineers build and practice critical skills.

Expanding beyond traditional coaching techniques and understanding their pros and cons will allow you to find a solution that works for your individual situation and help you and your team to learn and effectively improve across different dimensions.

Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organisational culture, processes, and platforms to improve service reliability and on-call experience. She has spoken at conferences on various aspects of reliability and human factors that play a key role in engineering practices, and has written for O'Reilly. Daria is originally from Moscow, Russia, having spent 20 years in Portugal, 10 years in Ireland, and now lives in the Pacific NorthWest.

Teaching SRE

Wednesday, 9:50 am–10:35 am

Mikey Dickerson, Layer Aleph LLC and Pomona College

Available Media

After 15 years complaining about how we can never find enough SREs to hire, I designed and taught a class for computer science majors called "Managing Complex Systems." I will talk about how this went and what it says about the long-term future of the industry.

Mikey Dickerson was an SRE and manager at Google from 2006-2014. He led the ad-hoc healthcare.gov repair effort in 2013, then founded the U.S. Digital Service in the Obama White House. Currently, he does consulting work with Layer Aleph LLC, teaches part time, and operates an astronomical observatory in rural Arizona.

10:35 am–11:05 am

Break with Refreshments

Pacific Concourse

11:05 am–12:40 pm

Track 1

Grand Ballroom A

Cross-System Interaction Failures: Don't Fail through the Cracks

Wednesday, 11:05 am–11:50 am

Tianyin Xu and Xudong Sun, University of Illinois Urbana–Champaign

Available Media

Modern cloud systems are orchestrated by many independent and interacting subsystems, each specializing in important services such as scheduling, data processing and storage, resource management, etc. Hence, overall cloud system reliability is determined not only by the reliability of each individual subsystem, but also by the interactions between them. With recent practices of microservice and serverless architectures, each individual subsystem becomes simpler and smaller, while their interactions grow in complexity and diversity. We observe that many recent production incidents of large-scale cloud systems are manifested through failures of the interactions across system boundaries, which we term "cross-system interaction failures", or "CSI failures". However, understanding and addressing such failures requires new techniques and practices that are often unavailable or under-developed.

In this talk, we will present our recent work on understanding and preventing cross-system interaction failures. Specifically, we will characterize the many faces of cross-system interactions and their diverse failure modes in modern cloud-native system stacks. We will then discuss the gaps in the current practices and existing techniques in the context of software testing and verification. Lastly, we will present some of the new techniques we developed at the University of Illinois at Urbana-Champaign to address cross-system interaction failures.

Tianyin Xu is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). His research focuses on techniques for design and implementation of reliable computer systems, especially those that operate at the cloud and datacenter scale. He has been in the UIUC List of Teachers Ranked as Excellent for five times since he joined the CS department in 2018. He is the recipient of a Jay Lepreau Best Paper Award at OSDI 2016, a Best Paper Award at ASPLOS 2020, a Best Student Paper Award at SIGCOMM 2021, two SIGSOFT Distinguished Paper Awards at ISSTA 2021 and FSE 2021, a Gilles Muller Best Artifact Award at EuroSys 2023, and a CACM Research Highlight. He is a recipient of NSF CAREER Award, an Intel Rising Star Faculty Award, a Facebook Distributed Systems Research award, and a Doctoral Award for Research at the Department of CSE at the University of California San Diego. He is an editor of the SIGOPS Blog and is an area chair of the Journal of Systems Research. More details can be found on his webpage: https://tianyin.github.io/.

Connect:

X

Xudong Sun is a 5th-year Ph.D. student in the Computer Science Department at the University of Illinois at Urbana–Champaign. He is broadly interested in any topics related to system correctness and reliability. Currently, his research focuses on (1) improving the quality of existing cloud systems using systematic testing and (2) building provably correct cloud systems using formal verification.

Connect:

X

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Wednesday, 11:55 am–12:40 pm

Ze Li, Microsoft Azure, and Ryan Huang, University of Michigan

Available Media

Cloud scale provides the vast resources necessary to repair and fix failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this talk, we discuss our experiences with gray failure in Microsoft Azure to show its broad scope and consequences with several case studies. We also argue that a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them. We will show how Microsoft Azure applied the differential observability in practice and bridged the gap between different components’ perceptions of what constitutes failures.

Dr. Ze Li is a principal data scientist manager in Microsoft Azure. Currently, he is focusing on using data driven and AI/LLM technologies to enable efficiently and effectively building and operating cloud service, such as safe deployment in large scale system, intelligent anomaly detection and auto-diagnosis through data mining in cloud services. Previously, he worked as data scientist/engineer in Capital One and MicroStrategy, where he provided data driven solutions to improve efficiency in financial services and business intelligent services. He published more than 40 peer review papers in the field of data mining, distributed networks/systems and mobile computing.

Dr. Ryan Huang is an Associate Professor in the EECS Department at University of Michigan, Ann Arbor, where he leads the Ordered Systems Lab. He conducts research broadly in computer systems, with specialties in designing principled methods to improve the reliability and performance of large-scale systems. His work received multiple best paper awards in top conferences. He is a recipient of the NSF CAREER Award and a Meta research award. More information about him can be found at https://web.eecs.umich.edu/~ryanph/

Track 2

Grand Ballroom BC

The Invisible Door: Reliability Gaps in the Front End

Wednesday, 11:05 am–11:50 am

Isobel Redelmeier, Discord

Available Media

It's hard for anyone to experience our services' many nines if our apps keep crashing. Front end reliability is critical to keeping our users happy - so how can we reliability engineers help improve it?

This talk will explore the unique technical and organizational challenges facing the front end, with particular emphasis on observability. Through specific examples from my experience as an SRE working on mobile reliability, you will learn about hurdles like how difficult it can be to add even basic metrics as well as get inspired by how well-connected front end tooling is to business analytics. Ultimately, we'll imagine what a more holistic approach to reliability could look like and consider a path to achieving it.

No CSS knowledge required :)

Isobel Redelmeier is currently at Discord, where as the first site reliability engineer she helped grow the reliability practice and has worked with a variety of teams to make their services more resilient. Her previous endeavours include getting OpenTelemetry off the ground while at Lightstep and making Cloud Foundry more secure during her time at Pivotal. She loves pointing out shiny things, from weird graphs to spectacular sunsets, and hanging out with her dog Mischa.

Automating Disaster Recovery: The Ultimate Reliability Challenge

Wednesday, 11:55 am–12:40 pm

Ricard Bejarano, Cisco Systems Inc.

Available Media

Here's how I explain my job to non-techies: if a meteor struck our servers, it's on my team to fix it. But what if it did? Realistically, what would happen if a meteor struck your datacenter?

Here's the story of a vision, one to fully automate disaster recovery away, how I pushed back on it claiming it was impossible, and how we still executed on it to great success.

Ours is also a case study on why looking at these wide surface problems through the sociotechnical lens will set you up for success in places where you could've never anticipated.

So if a metaphorical meteor hit our datacenter, we would just press our metaphorical big red button.

Ricard is a Lead Site Reliability Engineer at ThousandEyes' SRE team. His background is mostly networking, observability, incident management, infrastructure automation and hunting down the weirdest of bugs. He has captained the execution on our vision to fully automate disaster recovery away.

Connect:

12:40 pm–1:55 pm

Luncheon

The Atrium
Sponsored by Sentry

1:55 pm–3:30 pm

Track 1

Grand Ballroom A

From Chaos to Clarity: Deciphering Cache Inconsistencies in a Distributed Environment

Wednesday, 1:55 pm–2:15 pm

Akashdeep Goel and Prudhviraj Karumanchi, Netflix Inc

Available Media

In this session, we'll discuss a distributed caching system used at Netflix for streaming, live, games, Ads etc in multiple regions on a public cloud. There are various components that make the system highly resilient - control plane, replication engine, proxy, cache warmer etc. In this session, we will touch up on replication engine which processes around 30 million requests per second and manages to keep the response times for 95% of these requests under 2 seconds across the regions. We present and deep dive into a problem that almost put us at the risk of delaying the global launch of a business critical initiative. We will walk you through the entire process from the start to finish, the debugging journey, our takeaways and how these debugging techniques are generally applicable to any organization. The talk will focus on how and when problems get introduced, how can a simple assumption break the entire stack and how can these be caught sooner.

Akashdeep Goel is a Senior Software Engineer at Netflix working on distributed systems handling large scaling caching deployments for both streaming and gaming workloads across Netflix. Prior to this, Akashdeep was working on a distributed control plane at Azure CosmosDB (Microsoft) delivering standby and failover infrastructure. Outside of work, he enjoys road trips, playing snooker and exploring different cuisines.

Connect:

Prudhviraj Karumanchi is a Staff Software Engineer in Data Platform@Netflix building large-scale distributed storage systems and cloud services. Prudhvi is currently leading the Caching infrastructure at Netflix. Prior to Netflix, Prudhvi worked at large enterprises such as Oracle, NetApp and EMC/Dell building infrastructure for cloud, and contributed to File, Block and Object storage systems.

Connect:

Patching Your Way to Compliance with a Small Team and a Pile of Technical Debt

Wednesday, 2:20 pm–2:40 pm

Filipe Felisbino, Udemy Inc

Available Media

Feeling overwhelmed by a mountain of overdue system patches and a tiny team? You're not alone!

Our team was on a similar position a few years ago: buried in tech debt, unable to keep up with patching for new vulnerabilities, being consumed by toil - all while trying to keep up with business growth and stricter requirements for security compliance in general.

In this talk you'll learn how we used a three pronged strategy to break this vicious cycle without growing the team, the risks we faced, the trade offs and hard decisions we made.

Filipe is a site reliability engineer at Udemy working on infrastructure, automation, kubernetes, etc. Before that he's worked with software development for network security solutions for several years. Originally from Brazil, currently living in California!

Strengthening Apache Pinot's Query Processing Engine with Adaptive Server Selection and Runtime Query Killing

Wednesday, 2:45 pm–3:30 pm

Jia Guo, LinkedIn

Available Media

Apache Pinot, a real-time, distributed OLAP database, encountered resiliency challenges in its large production deployment due to server failures, slowness, and unpredictable query patterns. To ensure adherence to strict SLAs, enhancements were made to its Query Processing framework through two key features: Intelligent Query Routing and Runtime Query Killing.

Intelligent Query Routing replaces the traditional round-robin server query distribution with a dynamic approach based on continuous server performance metrics. This adaptive method proactively routes queries to more efficient servers, reducing the need for manual intervention and preserving availability during transient server issues.

Runtime Query Killing addresses the issue of complex queries causing Out of Memory errors and availability dips. By monitoring resource consumption in real-time and employing a lockless sampling algorithm for minimal overhead, this framework can terminate resource-intensive queries preemptively, safeguarding system availability and maintaining high SLAs.

This technical talk will delve into the design, deployment practices, and observed benefits of these features in a large-scale Pinot deployment, offering insights for developers and administrators of similar large-scale distributed data management systems.

Jia Guo is a System and Infrastructure Engineer at LinkedIn's Online Analytics team, focusing on Pinot query execution, performance, and reliability. Prior to LinkedIn, Jia specialized in advancing computing systems through his academic research, and received his PhD in Distributed and Parallel Computing from The Ohio State University.

Track 2

Grand Ballroom BC

Taming the Linux Distribution Sprawl: A Journey to Standardization and Efficiency

Wednesday, 1:55 pm–2:15 pm

Raj Shekhar and James Fotherby, Quantcast

Available Media

We encountered a significant challenge: the proliferation of numerous Linux distributions in our production environment following our cloud migration, which complicated management. In this session, we will reveal how we identified the right distributions, persuaded our team to embrace them, and navigated the crucial trade-offs that facilitated this important transition. This talk is specifically designed for Site Reliability Engineers (SREs) who are striving for system standardization

Raj Shekhar is a Staff System Engineer at Quantcast, working in the Cloud Infrastructure team at Quantcast. He enjoys poking sleeping dragons in large scale distributed systems. When not working, you can find him planning his next getaway, usually to a place accessible by motorcycle.

James Fotherby is a Systems Engineer at Quantcast, adept in Kubernetes Infrastructure and Core Platform Tools, maintaining smooth infrastructure operations. He possesses a passion for reliability, monitoring, and developing self-healing systems with vendor agnostic components. His dedication to innovation and excellence plays a pivotal role in enhancing system efficiency and resilience.

Frontend Design in SRE

Wednesday, 2:20 pm–2:40 pm

Andreas F. Bobak, Google

Available Media

Historically, SREs built tools that were often only useful for a small set of users, leading companies to put their focus (and money) on training users and not on building a frontend that is easily comprehensible to a large number of different users. As a company grows, the percentage of people that understand the intricacies of the full production stack will shrink compared to the percentage of people who have a desire to understand the state of production, requiring a shift in tradeoffs: away from scrappy tooling to a more deliberate approach to designing user experiences for SRE. This talk summarizes three important aspects to keep in mind when making those tradeoffs.

Andreas has developed scalable frontends for >20 years. He joined Google SRE 5 years ago to help improve the frontend of the internal monitoring application. Prior to that, he worked for different banks, focusing on developing modern and user-friendly trading software (FX, structured products) for internal and external users.

Measuring Reliability Culture to Optimize Tradeoffs: Perspectives from an Anthropologist

Wednesday, 2:45 pm–3:05 pm

Kathryn (Casey) Bouskill, Meta

Available Media

Understanding how culture influences engineering practices is often a black box. At Meta, we focused on transforming our mantra of “move fast and break things,” to “move fast with stable infrastructure” by giving attention to the cultural elements of doing reliability work. This talk describes this process and decodes how to systematically measure the on-the-ground perspectives of engaging with reliability work so that we can have an informed perspective on how to best optimize the right degree of reliability. The audience will take away actionable practices that can be used to understand how to evaluate their underlying reliability culture, take data-driven approaches to measuring reliability sentiment and barriers and facilitators to performing the work, and identify practices that allow for a more holistic data-driven prioritization of reliability efforts that aligns with cultural values, especially when there are competing demands and increasing pressure to optimize efficiency.

Kathryn (Casey) Bouskill is a researcher on Meta's Reliability Engineering team. An anthropologist by training, she applies qualitative and quantitative research methods to support reliability engineering work. She also continues a portfolio of public health and health security research. Bouskill has a Ph.D. in anthropology and an M.P.H. in epidemiology from Emory University.

Connect:

X

Storytelling as an Incident Management Skill

Wednesday, 3:10 pm–3:30 pm

Laura de Vesine, Datadog, Inc

Available Media

Managing incidents well requires a number of skills — debugging and systems understanding, strong communication, and high-speed project management — but we rarely talk about the power of storytelling in our incident management loop. From oncall preparation, to incident handling, to postmortem creation, skill at storytelling can support and even improve our engineering skills. Establishing a setting, building a coherent narrative, and understanding your characters builds memorable and compelling communication that can level up all stages of your incident management.

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 8 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

3:30 pm–4:00 pm

Break with Refreshments

Grand Foyer

4:00 pm–5:30 pm

Closing Plenary Session

Grand Ballroom

Real Talk: What We Think We Know — That Just Ain’t So

Wednesday, 4:00 pm–4:45 pm

John Allspaw, Adaptive Capacity Labs

Available Media

Dr. David Woods once said to me “We cannot call it a scientific field unless we can admit we’ve gotten things wrong in the past.” Do we, in this community, do that? Well-formed critique is critical for any field — including SRE — to progress.

I’d like to talk about a few ideas, assumptions, and concepts often talked about in this community but whose validity is rarely questioned or explored.

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to The DevOps Handbook. His 2009 Velocity talk with Paul Hammond, 10+ Deploys Per Day: Dev and Ops Cooperation helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Connect:

Mastodon

X

What Can You See from Here?

Wednesday, 4:45 pm–5:30 pm

Tanya Reilly, Squarespace

Available Media

What you can see depends on where you stand. Your vantage point plays a big part in how you work, what you think is important, and how you interpret what's going on around you. Even your job title or team name can influence–and limit!–how you see the world. It's easy to find yourself stuck: no longer learning, polishing a service nobody else cares about, or wondering why it's so hard to get anything done outside your own team.

A change in perspective can change what you think is important, how you influence the decisions that you care about, and even what you think is possible for yourself. In this talk, we'll look at how to get a broader view.

Tanya Reilly is the author of The Staff Engineer's Path and a senior principal engineer at Squarespace.

Connect:

Mastodon

5:30 pm–5:35 pm

Closing Remarks

Grand Ballroom
Program Co-Chairs: Sarah Butt, SentinelOne, and Dan Fainstein, The D. E. Shaw Group