9:30 am–9:40 am
Opening Remarks
Program Co-chairs: Mohit Suley, Microsoft, and Heidi Waterhouse, LaunchDarkly
9:40 am–11:00 am
Plenary Session
The 'Success' in SRE Is Silent
Casey Rosenthal, Verica.io
Establishing the Return-on-Investment (ROI) for Availability work/projects/investments is really hard. There are some strategies for doing it, which I'll explain in the meat of the presentation. If we don't change the course of common discourse regarding ROI, we're going to miss huge opportunities to invest in SRE, or worse, end up in a regulatory hell that sucks all the fun out of it and produces worse outcomes.
Building and Running a Diversity-focused Pre-internship Program for SRE
Andrew Ryan, Meta Inc.
At Meta, almost all of our early-career Production Engineering hires are former university interns. But, most university students do not even know that something like PE/SRE exists as a career, or have an opportunity to get any training or coursework towards an internship or career in SRE.
To address these gaps, in 2021 we started our first SRE-focused, fully remote pre-internship program, with 96 college students, focusing on diverse participants who would not normally not be recruited through our traditional recruiting pipelines.
In this talk, we will explain why and how we started this program, give extensive details about program administration, curriculum, and our learnings, as well as some preliminary results from the program. It is our hope that audience members will walk away with an understanding of what we did, and be able to go back to their own organizations to implement a similar program.
Andrew Ryan, Meta, Inc.
Andrew has been a Production Engineer at Meta since 2009. During this time, he has worked on teams across Meta's infrastructure, including big data, large scale caches, and CDN infrastructure. He has also been heavily involved with building successful large-scale programs for hiring and developing engineers at Meta, including our Production Engineering intern program.
11:00 am–11:30 am
Break with Refreshments
In-person attendees: Visit the Sponsor Showcase in the Pacific Concourse, Lower Level.
11:30 am–12:50 pm
A Postmortem of SRE Interviewing
Michael Kehoe, Confluent
After applying at over 40 companies and participating in more than 110 interviews, I found my role. In this talk, I want to discuss how we can make the interviewing experience better for all parties involved (interviewers, interviewees, hiring managers and recruiters). The session will cover everything from application to offer acceptance and discuss actions we can all take to make SRE hiring better.
Michael Kehoe, Confluent
Michael is an author, speaker, and Sr Staff Security engineer currently architecting Confluent’s cloud security. Previously, Michael worked at LinkedIn’s, architecting the company’s move to Microsoft Azure. Before graduating with a Bachelor of Electrical Engineering from the University of Queensland (Australia), Michael interned at NASA Ames Research Center building small-satellites known as Phonesats. Michael has spoken at numerous events all over the world and in 2018 was a co-author of the book “Reducing MTTD for High Severity Incidents” and the recently released “Cloud Native Architecture with Azure”.
Self-Destructing Feature Flags
Jamie Gaskins, Forem
You deployed an optimization behind a feature flag, tested it yourself in production, and QA signed off. After enabling the feature flag, you monitor error rates for a while between this version vs the last and there has been no significant change, so you decide to go to lunch. Right after you walk out the door, the error rate skyrockets because a group of power users from your largest clients have all just come online.
Obviously the answer is to disable the feature flag, but you're AFK so you have to hop onto Slack to tell someone on your team that that feature flag is the likely culprit so they can disable it. What if they didn't have to disable it manually? What if we automated that? This would be a complete non-event. This approach is what you'll learn about in this talk.
Jamie Gaskins, Forem
Jamie is a Principal Site Reliability Engineer at Forem, the company behind the DEV community and the Forem open-source software on which it runs.
12:50 pm–1:50 pm
Luncheon
Sponsored by DBS
1:50 pm–3:10 pm
Tales from the VOID: The Scary Truth about Incident Metrics
Courtney Nash, Verica
This talk presents research collected from the VOID—a new open database of public incident reports. Containing nearly 2,000 reports for 660 organizations, the database allows for more structured review and research about software-related incident reporting. Key results from our research challenge standard industry practices for incident response and analysis, like tracking Mean Time To Resolve (MMTR) and using Root Cause Analysis (RCA) methodology. In particular, we demonstrate how unreliable MTTR can be, and how RCA can lead to environments where people are less likely to admit mistakes and speak up about things that could lead to future incidents. We propose alternate metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organizations better learn from their incidents, and make their systems safer and more resilient.
Courtney Nash, Verica
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.
How We Survived (and Thrived) During The Pandemic and Helped Millions of Students Learn Remotely
Chinmay Tripathi, McGraw Hill
Everything was going well until COVID-19 emerged. A pandemic-induced traffic surge across all our platforms which consists of 100s of services was something unexpected. Not only did we survive this traffic surge, but we thrived without incurring huge infrastructure costs or compromising security. Was it adding capacity or implementing CORE SRE principles that saved the day? Learn how we applied 'Upstream Thinking' and be proactive in dealing with crisis situations in the context of SRE.
Chinmay Tripathi, McGraw Hill
Chinmay Tripathi is Sr. Director, Engineering at McGraw Hill. He has 20 years of experience operating applications and services in production with a focus on reliability and security. He has a passion for automation, observability, and software-defined operation. He joined McGraw Hill in 2014 as a DevOps Engineer and successfully performed his first cloud migration. Currently, he is growing SRE and Security capabilities at McGraw Hill.
3:10 pm–3:40 pm
Break with Refreshments
In-person attendees: Visit the Sponsor Showcase in the Pacific Concourse, Lower Level.
3:40 pm–5:00 pm
The Pandemic and The Classroom—Enabling Education for Millions
Alper Selcuk, Group Engineering Manager, Microsoft Teams
COVID-19 impacted education from kindergarten to college in unforeseen ways almost overnight. The world had to adapt to an online-only mode of education. What happened behind the scenes to make that happen is unlike any typical growth story for a large product or service. This talk will describe the problems, solutions, and design patterns that Microsoft Teams leveraged to be successful at re-enabling education for millions of remote students. We will focus on education-specific scenarios ranging from seasonal tenant onboarding to optimizing for school hardware to changing customer expectations.
Alper Selcuk, Microsoft Teams
Alper Selcuk is the Group Engineering Manager for Microsoft Teams for Education. He leads the strategic and operational engineering activities within Microsoft to enable online and hybrid learning in schools. He partners with many teams within Microsoft to meet the ever-changing requirements of the education world.
Applied Science Fiction: Operating a Research-Led Product
Noah Kantrowitz, Geomagical Labs/IKEA
Scientists have increasingly become a part of many product teams, bringing many new skills to the table. This has been especially true in our augmented reality (AR) products, with our core computer vision pipelines being entirely researcher-led. With new opportunities comes new challenges, and we have had to adapt many of our SRE and product engineering practices to ensure everyone feels productive and supported. This talk shares the lessons we have learned in building a healthy and happy product team in a researcher-first environment.
Noah Kantrowitz, Geomagical Labs/IKEA
Noah Kantrowitz is a web developer turned infrastructure automation enthusiast, and all around engineering rabble-rouser. By day he runs infrastructure at Geomagical Labs/IKEA and by night he makes candy and stickers. He is an active member of the DevOps community, and enjoys merge commits, cat pictures, and beards.
Taking the 737 to the Max
Nickolas Means, Sym
Ten years ago, Boeing faced a difficult choice. The Airbus A320neo was racking up orders faster than any plane in history because of its fuel efficiency improvements, and Boeing needed to compete. Should they design a new plane from scratch or just update the tried-and-true 737 with new engines?
The 737 MAX entered service seven years later as the result of that and hundreds of other choices along the way. Let's look at some of those choices in context to understand how the 737 MAX went so very wrong. We'll learn a thing or two along the way about making better decisions ourselves and as teams.
Nickolas Means, Sym
Nickolas Means is infatuated with disasters of all kinds and the amazing things we can learn from them. When he's not stuck in a Wikipedia binge loop reading about plane crashes, he leads the engineering team at Sym, helping create the building blocks engineering teams need to build effective security and compliance workflows. He works remotely from Austin, TX, and spends his spare time hanging out with his wife and kids, going for runs, and trying to brew the perfect cup of coffee.
5:00 pm–6:00 pm
Showcase Happy Hour
Sponsored by Lightstep
9:00 am–10:30 am
Securing Your Software Delivery Chain with Process Auditing
Shaun Mouton, Mastercard
Tasked with "securing the supply chain" for your employer due to a high profile CVE or breach? Overwhelmed by vendor pitches and trying to find some data to start tackling the problem? Curious about what's happening when an application is executing for some other reason? Want to know what you can discover about un-instrumented applications?
Let's go over how you can use strace and eBPF to discover what applications are doing. Then, we'll cover how to improve your security posture with that knowledge.
Shaun Mouton, Mastercard
Shaun Mouton has been using computers for well over 30 years and is beginning to wonder why nobody has made him stop. He is a Principal engineer at Mastercard on an enterprise automation frameworks team and pays the bills for some silly websites.
The Future of above-the-line Tooling
Richard I. Cook, MD, and John Allspaw, Adaptive Capacity Labs
Above-the-Line (AtL) tooling helps people working above-the-line of representation keep track of what is happening AtL. This sort of tooling is quite different from Below-the-Line stuff—e.g., top or htop. There are only a few AtL tools available but rapid growth seems likely. What should AtL tooling look like? What are the challenges to making it useful? What functions can AtL tooling support? This talk sets up a framework that you can use to assess and plan for future tooling and, possibly, spur you to build some yourself.
Richard I. Cook, Adaptive Capacity Labs
Dr. Richard Cook is a Principal with Adaptive Capacity Labs. He is an internationally recognized expert on complex system failures, post-accident reactions to failure, and human performance at the sharp end of these systems. He has investigated a variety of problems in such diverse areas as urban mass transportation, semiconductor manufacturing, military software, and internet-based business systems. His publications include "Above the Line, Below the Line," "How Complex Systems Fail," "Gaps in the continuity of patient care and progress in patient safety," "Operating at the Sharp End: The complexity of human error," "Adapting to New Technology in the Operating Room," "A Tale of Two Stories: Contrasting Views of Patient Safety," and "Going Solid: A Model of System Dynamics and Consequences for Patient Safety."
John Allspaw, Adaptive Capacity Labs
John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to The DevOps Handbook. His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.
Tracing Bare Metal with OpenTelemetry
Amy Tobey and Shelby Spees, Equinix
Equinix Metal runs two dozen software services deployed across 70+ Kubernetes clusters on six continents. The path from plucky startup to global cloud infrastructure player was rocky, to say the least. There were frequent incidents that lasted hours, with engineers poring over logs and dashboards and often walking away unsatisfied.
Principal Engineer Amy Tobey and SRE Shelby Spees share how the Equinix Metal Engineering team deployed OpenTelemetry tracing for the bare metal provisioning process. After initial efforts from the SRE team to open PRs adding instrumentation for each service, they gained momentum by creating on-ramps for engineers across the org to instrument their own code. The shared effort facilitated knowledge transfer for the globally-distributed, multidisciplinary team, empowering veterans and newbies alike to debug issues more quickly and easily.
Amy and Shelby close with examples of system issues they only identified because of tracing, plus a few major reliability wins.
Amy Tobey, Equinix
Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she spends her time building an innovative Site Reliability Engineering program at Equinix, where she is a principal engineer. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga poses in the sun.
Shelby Spees, Equinix
Shelby Spees is a site reliability engineer who's been making the tech industry more accessible and equitable through better engineering practices since 2015. Shelby joined Equinix Metal in 2021 to implement distributed tracing and service-level objectives for the bare-metal provisioning process. Her goal is to help to build a healthy cloud engineering org without firefighting and burnout. Shelby lives in Los Angeles, CA, where she enjoys drinking iced lattes and making up songs about her rescue pitbull, Nova.
Are We There Yet? Metrics-Driven Prioritization for Your Reliability Roadmap
Christina Tan and Mindy Stevenson, Blameless
Reliability is a journey for any organization and unfortunately, we can't just skip to the end. There is what feels like an endless amount of work that could be done to help improve reliability and we often find ourselves in situations with more things to do than there is time to do it. It can be overwhelming to figure out where to start, what to do next, and how to get support (and funding) from leadership for it. Prioritization and tradeoffs have to happen whether you are building out a new SRE team, diving in when everything is on fire, or even leading an existing SRE team that is performing well. The goal of this talk is to give SRE leaders a metric-driven framework to prioritize what will make the most impact based on real data and highlight the successes throughout the journey.
Christina Tan, Blameless
Christina is on the strategy team at Blameless, architecting interpersonal dynamics for conflict resolution, high-performance teams, and executive alignment. Prior to Blameless, Christina coached TED speakers for public speaking and startup founders for fundraising. Her clients have collectively raised over $240M. In her spare time, she runs the mindfulness community Serenity Lounge.
Mindy Stevenson, Blameless
Mindy is Director of Engineering at Blameless, responsible for Platform, Infrastructure, and Site Reliability Engineering. Before joining Blameless, Mindy had 15 years of experience in software engineering at various companies including work with startups, non-profits, government, and large corporations. She enjoys building an engineering culture with a focus on continuous improvement and team empowerment.
10:30 am–11:00 am
Break with Refreshments
In-person attendees: Visit the Sponsor Showcase in the Pacific Concourse, Lower Level.
11:00 am–12:30 pm
SRE stands for...Skydiving Resilience Engineer
Victor Lei, USPA A-96761
Skydiving, especially more than once, can be viewed as an exercise in risk management (and also a lot of fun!)
In this talk, we take a look at the safety procedures practiced by the skydiving community that relate to the guiding principles of SRE. We discuss the evolution of altimeters and software metric systems, highlight the dangers of hero mentality and overconfidence, as well as draw parallels between gamedays and incident management with pre-jump preparations and free fall plan execution. This talk will share concrete examples and parables that will give attendees a visceral understanding of soft concepts such as the "bus factor", blameless postmortems, the Goldilocks approach to tribal knowledge, and the importance of developing emotional resilience.
tl;dr: we will explore core SRE concepts through skydiving goggles; come jump with me!
Victor Lei, USPA A-96761
After a 3-year tenure as an SRE for the internal telemetry team at Bloomberg, I helped build an internal self-service chaos engineering platform as a resilience engineer.
I think curiosity is a core value of SREs and action to investigate such curiosity a natural impulse. I found myself wondering what safety systems the skydiving community must have developed and so, naturally acquired my "A license in a week" at Skydive Spaceland.
I was trained as an electrical engineer ’17 at Cooper Union, so please talk to me about how stationarity relates to time series analysis. Connect with my personal brand "Skydiving Resilience Engineer" or find me as @Aergonus.
Building a Path to the Future: Mentoring New SREs
Chastity Blackwell
Mentorship has long been an important part of bringing people into the operations and reliability space, especially people early in their careers, due to the relative lack of formal training available in these fields. However, being a good mentor is not always easy—and the pandemic, with many of us moving to remote work for the long term, has made it even more difficult. This talk will discuss strategies you can use to be a more effective mentor, even in a remote work setting.
Chastity Blackwell[node:field-speakers-institution]
Chastity Blackwell has been working in operations for more than twenty years, in everything from universities and civic tech to startups and large tech companies, and will have this whole computer thing down any day now. She is a strong believer in the character building qualities of Midwest winters and deep dish pizza.
eBPF: The Next Power Tool of SREs
Michael Kehoe, Confluent
Created in 2013, eBPF is a newer Linux kernel technology that will revolutionize SRE's ability to observe, serve and protect the infrastructure we're responsible for. While eBPF has the potential to completely change the way we do our daily work, only a minority of companies are taking advantage of its power. This session will be an introduction to eBPF that will allow the attendee to see its uses and capabilities first hand, and see how to get started using it.
Michael Kehoe, Confluent
Michael is an author, speaker, and Sr Staff Security engineer currently architecting Confluent’s cloud security. Previously, Michael worked at LinkedIn’s, architecting the company’s move to Microsoft Azure. Before graduating with a Bachelor of Electrical Engineering from the University of Queensland (Australia), Michael interned at NASA Ames Research Center building small-satellites known as Phonesats. Michael has spoken at numerous events all over the world and in 2018 was a co-author of the book “Reducing MTTD for High Severity Incidents” and the recently released “Cloud Native Architecture with Azure”.
12:30 pm–2:00 pm
Luncheon
Sponsored by Cisco
2:00 pm–3:30 pm
How the Metrics Backend Works at Datadog
Adam Mckaig and Tahia Khan, Datadog
Datadog is a popular cloud monitoring service which operates at scale in all three major cloud providers, ingesting 10s of GB/s of points across many billions of timeseries into PiBs of hot and cold storage. Naturally, reliability is paramount.
In this talk, we'll show how our very large distributed system works today, and how it grew from a very small not-distributed system. We'll share the most interesting scaling and reliability challenges we faced along the way, how we solved them (for now), and some important lessons and strategies which emerged. We'll also share a couple of bonus problems which are still very much unsolved today, and what we're planning next.
Adam Mckaig, Datadog
Adam Mckaig is a Staff Engineer at Datadog in New York, where he runs Metrics Reliability. Previously he has built things at Google, the New York Times, Bloomberg, and UNICEF. His favorite sound is a pager not going off.
Tahia Khan, Datadog
Tahia Khan is a Toronto-based SRE at Datadog. Before settling on SRE, she’s worked on everything but frontend at a bunch of startups, Mozilla and Amazon. Outside of work, Tahia draws bad art.
Disney Global SRE—Creating Digital Magic
Jason Cox, Brian Scott, and Alexi Varanko, The Walt Disney Company
Disney is one of the world’s largest media companies and home to some of the most respected and beloved brands around the globe. Embracing the latest technology is an important strategic focus at Disney, allowing guests to better connect with Disney and allowing Disney to better connect with guests in innovative and delightful ways.
We will tell you a story about a century-old organization that has scaled its SRE practice to ignite digital magic across the globe. This team of SRE Jedi Knights are on a mission to foster curiosity, communities of practice, and technology awesomeness while venturing where no SRE has gone before.
In this talk, we will deliver epic stories of successes, setbacks, and failures while pushing large-scale platforms to their limit and delivering the best in-seat, digital experiences, products, and content to our guests and subscribers across the globe. Showcasing some of the technology and automation we have built.
Jason Cox, The Walt Disney Company
Jason Cox is the Director of SRE at The Walt Disney Company. Jason is a champion of SRE practices, collaboration, curiosity, automation, agile and lean methodologies. He spent several years using technology to design and build public infrastructure. He later co-founded an ISP and web hosting startup, managing datacenters, and business operations until it was sold, shortly after joining Disney. He has had the privilege of speaking at several tech conferences and enjoys writing on leadership and DevOps topics.
Brian Scott, The Walt Disney Company
Brian Scott is Disney’s first Technology Evangelist & Engineering Advocate, helping teams with new technology, building best practices in the Cloud & Automation. He majored in computer science, helped build, and has managed many SRE Teams within The Walt Disney Company. He is an avid evanglist of Go & Automation.
Alexi Varanko, The Walt Disney Company
Alexi Varanko is the Vice President of Cloud, Infrastructure, DB/Systems Reliability Engineering, for The Walt Disney Company. He has had a long career of leading technology transformation efforts, and continues to innovate new ways for Disney to use technology to engage guests, support cast members, and manage businesses systems across the Company (Content Creation, Media Distribution, Parks & Resorts, and Interactive Experiences), including the implementation and adoption of Public and Private Cloud services as a company standard.
Automated Operating System and Environment Certification at LinkedIn—Reducing Toil and Increasing Velocity
Adam Debus, LinkedIn
In 2019 and early 2020 LinkedIn encountered some issues with the Linux kernel that necessitated in-depth troubleshooting and a rapid process to validate new kernels against our tooling and application stack. Around the same time, LinkedIn started the process of migrating to Azure. These two things, together, revealed some gaps in the existing operating system and tooling certification process that needed to be addressed to better drive LinkedIn's continued growth.
The initial focus of the project was targeted toward kernel certification and, through many discussions with stakeholders, rapidly expanded to encompass a wide range of custom tooling and environments ranging from our physical data centers to Azure.
This is an on-going project which has seen its initial release. In this talk, I intend to discuss early direction, mistakes made, and lessons learned, and share some early results from the enhanced certification process.
Adam Debus, LinkedIn
Adam Debus has been a Staff Site Reliability Engineer at LinkedIn since 2019. Following over a decade of experience leading systems engineers in the eDiscovery and Compliance fields, he now leads the Fleet Compliance initiative at LinkedIn. Currently hailing from San Jose, California with his husband, he's frequently found repeatedly removing his cats from his keyboard in the middle of meetings.
Triaging Real-time Security Threats with eBPF-powered Observability
Daniel Kim and Robert Prast, New Relic
Time is of the essence for Incident Commanders when they are working to resolve a security threat. Unfortunately, valuable time can be wasted manually aggregating and querying logs in different sources and formats as data becomes increasingly siloed in large, complex systems. Without proper observability, security teams are handicapped, not being able to fully contextualize the impact of the security threat.
In this talk, you will learn how observability-first principles can be adopted to triage ongoing security threats leveraging Pixie, a CNCF sandbox observability project. Pixie uses eBPF to leverage the Linux Kernel to extract observability data into a single source of truth, providing end-to-end traces and performance insights. With Pixie, engineers no longer have to hunt for data across multiple layers of the OSI model from raw DNS queries down to process stats. Being able to analyze data flow from high-level user space down to low-level system calls across an entire environment can help pinpoint the root cause of an attack.
Daniel Kim, New Relic
Daniel Kim (He/Him) is a Senior Developer Relations Engineer at New Relic and the founder of Bit Project, a 501(c)(3) nonprofit dedicated make tech accessible to underserved communities. He wants to inspire generations of students in tech to be the best they can be through inclusive, accessible developer education. He is passionate about diversity & inclusion in tech, good food, and dad jokes.
Robert Prast, New Relic
Robert Prast is on the Application Security team at New Relic. As an AppSec engineer, he works with developers to write secure code and review New Relic's security posture across all products. He is a huge security nerd who tried to hack video games as a kid to make sure he won against his brother.
3:30 pm–4:00 pm
Break with Refreshments
In-person attendees: Visit the Sponsor Showcase in the Pacific Concourse, Lower Level.
4:00 pm–5:30 pm
Exemplars in Practice: Finding the Needle in Your Observability Haystack
Gibbs Cullen, Chronosphere
A common desire among SREs is to have the ability to quickly jump between different data types. For example, being able to jump from a dashboard showing a metric to the distributed traces that the metric represents. Vendors have come up with various proprietary approaches to meet this need with varying degrees of success, but open-source solutions have started to gravitate towards the idea of exemplars to provide a common solution to this problem.
In this talk, Gibbs will give an introduction to exemplars and the open-source tracing landscape (e.g. OpenTelemetry), and will discuss how exemplars give SREs a straightforward path to jump from a metric on a dashboard to a trace related to that metric. She will explore some limitations around exemplars, and present an alternative method for linking metrics and distributed traces—ultimately making "finding the needle in the observability haystack" easier and more efficient for SREs.
Gibbs Cullen, Chronosphere
Gibbs Cullen is a developer advocate at Chronosphere and makes it possible for the community to understand the concepts behind Prometheus and using M3 as a long term storage, in addition to helping the community with best practices in alerting, monitoring and configuring their deployment of Prometheus and M3 in Kubernetes. Prior to Chronosphere she was a product manager on the AWS Data Lab team.
Dark Sky Camping: Reducing Alert Pollution with Modern Observability Practices
Kristin Smith, Campspot
Over the course of the pandemic, several factors converged to create an amazing problem at Campspot: more traffic! Increased load stressed our applications and unpleasant customer-facing incidents stressed our engineering teams. In response, we doubled down on existing tools and processes: increased alerting, beefed up on-call rotations, more dashboards, and more high-urgency Slack channels. We put spotlights on so many areas of the system it became hard to see where issues were.
Recognizing the chaos, we pivoted in Spring 2021 to unify teams around a single observability tool and implemented Service Level Objectives. The result: fewer alerts, faster troubleshooting, and clearer indicators of when to focus on performance vs. features. Come hear how we cleared out the alert pollution so we could see the constellations we were actually searching for all along. If you're building a case for the move to observability, this talk is for you.
Kristin Smith, Campspot
Kristin Smith (she/her) serves as a DevOps Services Team Lead for a distributed team of cloud and data engineers at Campspot. She transitioned into the technical industry seven years ago, bringing with her a background in history and archival sciences. Along the way she has worked in technical organizations ranging from three people to over 700, in both the private and public spheres. Her professional interests include infrastructure provisioning, monitoring and traceability in distributed systems, and writing documentation that people actually read.
Ten-year Journey to 10,000 Production Machines
Rob Hirschfeld, RackN
What does it take to scale to provisioning and automation management to 10,000 machines? We'll cover a decade of improvements starting from humble wrappers on a Chef with a 50 machine limit, through API optimizations to multi-threaded maintenance systems and shared indexes that handle 10,000 concurrent operations.
Rob Hirschfeld, RackN
Do you keep wondering why building automation is so hard and even harder to share as a community? That really bugs Rob too! He has been creating software to collaboratively automate infrastructure for over 20 years. His latest startup, RackN, focuses on providing Distributed IaC automation and abstraction layers for provisioning Cloud, Edge and Enterprise data centers. He is also building a forward looking operator community at the2030.cloud with weekly DevOps and future hallway-type discussions.
Beyond Distributed Tracing
Kyusoon Lee, Google
In the era of microservice architecture, distributed tracing solutions offer visibility across services. However, this visibility is at the level of individual requests, failing to deliver any sort of aggregated observability to average users.
In this talk, we introduce a novel yet simple method (“CUI attribution”) of creating an aggregated end-to-end view at the level of what we call “Critical User Interaction (CUI)” (e.g. “play a video”, or “purchase an item”) using the baggage mechanism from Google’s Census.
The aggregated end-to-end view is intuitive for average users to grok, reducing time to root-cause failures and outages. The method is applicable to many other areas, such as dependency analysis and fault-tolerance testing in production. Any open source projects or enterprise distributed tracing solutions that support a similar baggage mechanism can easily adopt our method with little effort to offer richer insights to their users.
Kyusoon Lee, Google
Kyusoon Lee is a Site Reliability Engineer at Google, whose passion lies in acquiring visibility from internal systems and applying it to improve reliability for the external users.
Since 2019, he has been leading CUI attribution efforts with primary focus on the application in automated root-causing, impact assessment, and outage prevention. He currently drives a few long-term cross-org technical roadmaps at Google based on CUI attribution, while continuing to further explore the value of CUI attribution via experimentation.
He would love to exchange experiences and insights with anyone. Feel free to catch him during the conference or email him at qsoonlee@google.com.
History-based Latency Prober Tuning
Jeff Borwey, Google
Probers are an indispensable tool in monitoring production. When configured correctly, they offer high-fidelity insight into a system's performance and can provide fast detection and alerting for regressions. Performance, however, is not static and environments/deployments can behave radically different from one another. This talk will present some simple techniques for tuning latency prober alerts based on historical data. These techniques can increase sensitivity and reduce manual configuration toil while limiting false positives.
Jeff Borwey, Google
Jeff has been an SRE at Google for four years. Initially a BigQuery SRE, he now focuses on improving understanding and modeling of production performance more generally.
5:30 pm–7:00 pm
Conference Reception
Sponsored by FireHydrant
9:00 am–10:20 am
Using Serverless Functions for Real-time Observability
Liz Fong-Jones and Jessica Kerr, honeycomb.io
In this talk, you will learn how a sub-second latency query engine inspired by Facebook's Scuba was extended from running in RAM only, to querying the most recent data that could still fit on local SSD, to querying months of data at a time using cloud storage and serverless functions. We'll describe the pitfalls of managing lambdas at scale, including impatience, maximum concurrency, runtime and architecture configuration/experimentation, and the price/performance of renting 20,000 parallel workers.
Liz Fong-Jones, Honeycomb
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
Jessica Kerr, Honeycomb
Jess is a symmathecist, in the medium of code. She sees development teams as learning systems made of people and running software. If we make that software teach us what's happening, it's a better teammate. And if this process makes us into systems thinkers, we can be better persons in the world.
Improving How We Observe Our Observability Data: Techniques for SREs
Dan Shoop, iWiring
Time-series charts have been around for hundreds of years, yet were originally created with a narrative intent often missed by many engineers today. This session will examine some historic time-series examples, explore what we can learn as SREs, and look at concrete charting techniques we can use to improve the cognition of our SLO narratives, engineering reports and incident retrospectives using multi-variate relationships, small multiples and sparklines, while avoiding some common pitfalls we often we find in engineering presentations.
Dan Shoop, iWiring
Dan Shoop is a Systems Reliability Engineering Manager with over 30 years of experience building distributed systems and infrastructure that are performant, highly-available, scalable and fault tolerant. While working at HBO solving Internet problems at Game of Thrones scale when that Google SRE book came out, he and his team realized they had actually been SREs most of their engineering careers, embraced the new paradigm to restructure their production operations teams, and driving their focus heavily on telemetry & observability as a critical component for understanding, measuring, monitoring and alerting on key indicators of systems health as related to its service impacts and architectural improvements. He went on to lead SRE at Venmo and has also worked at Sesame Street, United States Technical Services and operated his own consulting company. Having taken Edward Tufte's course on Information Visualization three times, he both recognizes and enjoys sharing techniques for improving the cognition and presentation of our observability data in terms of SLI/SLO narratives, and enhancing our engineering retrospectives and reports. He lives in New York City and enjoys good food, mountaineering, photography and UAVs.
10:20 am–10:50 am
Break with Refreshments
In-person attendees: Visit the Sponsor Showcase in the Pacific Concourse, Lower Level.
10:50 am–12:10 pm
Principled Performance Analytics
Narayan Desai and Brent Bryan, Google
This talk presents an exciting analytical method that is successfully delivering high fidelity insights useful in analyzing and diagnosing distributed systems. It has been used in production in a variety of complex services at scale (up to 1.4T events/day), where traditional methods have failed, with good results. We will sketch out the problem domain in detail, present the statistical methods used, as well as the intuition behind the approach.
Attendees will gain an alternative lens through which they can analyze performance, as well as an understanding of pitfalls.
Narayan Desai, Google
Narayan is an SRE at Google Cloud, where he is responsible for the reliability of GCP Data Analytics products. He has a checkered past, having worked on scheduling, configuration management, supercomputers, and metagenomics—always in the context of production systems.
Brent Bryan, Google
Brent is an SRE at Google Cloud focused on developing statistical and ML approaches to monitor service reliability. Prior to GCP SRE, Brent worked on ads optimization, serving, and measurement, as well as founding Google Domains.
12:10 pm–1:40 pm
Luncheon
1:40 pm–3:00 pm
Modeling Alert Quality
Moshe Zadka
What are good alerts? What are bad ones?
The difference is important for reliability. But how do you measure it? What kind of trade-offs are possible?
A model of alert quality will be presented, including parameters like cost and accuracy.
Moshe Zadka[node:field-speakers-institution]
Moshe has been doing SRE since before the word existed. From build pipeline to monitoring, and from 5 engineers to 20,000, Moshe has seen SRE from many different perspectives. They are the author of "DevOps in Python."
Emergent Organizational Failure: Five Disconnections
Mattie Toia, Shopify Inc.
A look at five successive assumptions or mistakes that leaders can easily make when trying to guide teams building reliable systems. This talk focuses on the people and organizational issues that can contribute to well meaning teams building software that is insufficiently reliable. This knowledge is helpful for both technical and managerial leaders guiding their organizations, as well as folks stretching to think more broadly about reliability at any level in their team.
Mattie Toia, Shopify Inc.
Mx. Toia is a leader in the reliability and infrastructure space with over a decade and a half experience. They currently are a Director of Production Engineering at Shopify where they lead the Production Platform organization in delivering reliable compute and network infrastructure to developers across the company. Prior to this, they were an SRE Director at Google leading several different infrastructure and GCP SRE organizations over the years in areas such as Observability, Developer Infrastructure, Enterprise IT, and Storage. They hold degrees in Electrical Engineering and Engineering Management.
Mx. Toia resides in New York City with their family. They enjoy experiencing theater and live performances, exploring neighborhood restaurants, and spending long afternoons in Central Park when the weather is nice.
3:00 pm–3:30 pm
Break with Refreshments
3:30 pm–4:50 pm
DO, RE, Me: Measuring the Effectiveness of Site Reliability Engineering
Dave Stanke, Google
The DevOps Research and Assessment group, or DORA, has conducted broad research on engineering teams’ use of DevOps for nearly a decade. Meanwhile, Site Reliability Engineering (SRE) has emerged as a methodology with similar values and goals to DevOps. How do these movements compare? In 2021, for the first time, DORA studied the use of SRE across technology teams, to evaluate its adoption and effectiveness. We found that SRE practices are widespread, with a majority of teams surveyed employing these techniques to some extent. We also found that SRE works: higher adoption of SRE practices predicts better results across the range of DevOps success metrics. In this talk, we’ll explore the relationship between DevOps and SRE and how even elite software delivery teams can benefit through the continuous modernization of technical operations.
Dave Stanke, Google
Dave Stanke is a Developer Advocate for Google Cloud Platform, specializing in DevOps, Site Reliability Engineering (SRE), and other flavors of technical relationship therapy. He loves chatting with practitioners: listening to stories, telling stories, sharing a healthy cry. Prior to Google, he was the CTO of OvationTix/TheaterMania, a SaaS startup in the performing arts industry, where he specialized in feeding memory to Java servers. He chose on purpose to live in New Jersey, where he enjoys baking, indie rock, and fatherhood.
The Scientific Method for Resilience
Christina Yakomin, Vanguard
Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT architects draw inspiration from it in their resilience testing efforts. They leverage a "Failure Modes and Effects Analysis" technique, in which engineers ask themselves questions about the failure modes of various technical components and develop hypotheses based on their expectations of how the system would behave. They use these conjectures as inputs into experimentation, and select chaos experiments accordingly to validate (or disprove!) their hypotheses.
Christina Yakomin, Vanguard
Christina is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University with an undergraduate degree in Computer Science. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, with a specialization in Site Reliability Engineering. She has earned several Amazon Web Services certifications, including the Solutions Architect - Professional. Christina has also worked closely with the Women's Initiative for Leadership Success at Vanguard, both internally at the company and externally in the local community, to further the career advancement of women and girls - in particular within the tech industry. In her spare time (and when it is safe to do so!), Christina is passionate about traveling; she has visited over 20 different countries and 25 U.S. states so far!
A Fresh Look at Operational Debt
David Owczarek
This session will be an inspirational, structured talk where I converge two perspectives on operational debt into a single model based on process gaps and risk. I define operational debt as, “work required to fix process gaps that present a risk to business operations.” This talk is intended to inspire senior SREs and SRE managers to think more holistically about process problems and related automation opportunities. Weaving risk into the approach provides a mechanism to set priority on a risk/reward basis. Creating a backlog of operational debt is also helpful when collaborating with other engineering managers. We will also re-examine Fowler’s quadrants in the operational debt context, compare and contrast operational debt with other forms of debt (financial and technical), and examine how much debt is also toil.
David Owczarek[node:field-speakers-institution]
Dave Owczarek (he/him), recently ex-Adobe, was the senior reliability manager for Adobe's Document Cloud—one of the largest digital signature platforms on the planet. His latest mission is to use the arts—speaking, writing, and illustrating—to bring clarity to confusing concepts and inspire innovation in how we build and operate services. Dave has operated lots of services at scale over a 30+ year career, from the first versions of Monster board back in the 90s to some of the largest web hosting providers. When not focused on the SRE experience, you will probably find him playing a guitar somewhere, most likely a Fender.
4:50 pm–4:55 pm
Closing Remarks
Program Co-chairs: Mohit Suley, Microsoft, and Heidi Waterhouse, LaunchDarkly