SREcon20 Americas Conference Program

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Note: The SREcon20 Americas West conference program is subject to change due to updated conference dates.
Please check back here soon for the latest program.

Tuesday, March 24

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:30am

Opening Plenary Session

Program Co-Chairs: Tammy Butow, Gremlin, and Emil Stolarsky, Incident Labs

Testing in Production: The Hard Parts

Tuesday, 9:15 am10:05 am

Cindy Sridharan

Testing in Production has numerous benefits when it comes to building confidence in the correctness, reliability, and resilience of large scale distributed systems. However, it's also fraught with many perils and pitfalls. Curtailing the blast radius of any potential impact of a test gone wrong is a prerequisite for being able to test effectively in production. This talk will explore architectural patterns that will allow for minimizing the blast radius.

Telling Stories About Incidents

Tuesday, 10:05 am10:30 am

Lorin Hochstein, Netflix

It isn't hard to get SREs to tell stories about incidents; you'll likely overhear some in the hallways of this conference. Yet, when we do our formal incidents write-ups for our organizations, we leave out many of the details that make the informal stories we tell both compelling and useful.

In this talk, I'll discuss the benefits of using a narrative description over filling out a traditional incident template. I'll talk about my experiences at Netflix as an incident investigator, documenting incidents as stories that unfold over time.

Lorin Hochstein, Netflix

Lorin Hochstein is a Sr. Software Engineer on the Delivery Orchestration Team at Netflix. Previously at Netflix, he was on the CORE (Critical Operations and Reliability Engineering) team and the Resilience Engineering (née Chaos) team.

Before Netflix, Lorin was a Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.

10:30 am–11:00 am

Break with Refreshments

11:00 am–12:30 pm

Track 1

If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident Response Using ICS

Tuesday, 11:00 am11:50 am

Thai Wood, Resilience Roundup

I'll outline what my experience, managing physical emergencies in an ambulance and ER, taught me about responding to incidents starting with a small team and how it can be grown and scaled up and perhaps more importantly scaled back down, based on the Incident Command System (ICS) framework. I'll cover approaches regardless of whether or not you already have an incident response plan.

Thai Wood, Resilience Roundup

Thai helps teams build more resilient systems and improve their ability to effectively respond to incidents. A former EMT, he applies his experience managing emergency situations to the software industry. He writes about resilience engineering each week at ResilienceRoundup.com

Everything I Need to Know about Incident Response I Learned in EMT School

Tuesday, 11:50 am12:30 pm

Matt Linton, Google; Elle Armageddon, Independent

In the last decade, the tech industry has seen a rise in the use of the term "Incident Commander" for the person who is in charge of fixing a problem. The use of this term stems from companies who have begun adopting incident management terms and techniques from ICS (The Incident Command System).

ICS was originally developed by CalFire and matured over ~40 years in the emergency services community to improve the three most basic and frequent challenges in crisis management: Command, Control, and Communication.

This talk will explore the fundamentals behind ICS and why it developed the way it has, how that directly relates to system security and reliability work, and what applicable skills the tech industry can still learn from Fire and EMS professionals.

We will provide a bit of an ICS history and then some fundamental advice on incident response practices that teams can immediately put to use.

Track 2

Effective Regression Testing Framework Built with Open Source Tools

Tuesday, 11:00 am11:50 am

Jay Qiu and Chi Chin Chen, Bloomberg

Core Principles

As the agile development methodology gets more popular, it is a big challenge to release new versions of components into distributed systems without introducing new faults or degrading system performance. In this presentation, we discuss an effective testing framework we built using record-and-replay and log analysis techniques for regression tests and capacity tests on large scale distributed systems. A stack of open source tools are used to build this testing framework. Through replaying the incoming events and comparing systems responses, we can verify the correctness of the systems. System performance metrics are collected through log analysis in replay for performance criteria check and performance comparison with the previous version of systems. This testing framework is also an example of effectively using open source tools to build a practical application.

Jay Qiu, Bloomberg

Jay Qiu is a senior SRE at Bloomberg. He has been working on testing and monitoring fields for many years and has involved in architectural design and implementation for equity, fixed income and commodity trading and valuation systems at major financial companies for over 20 years.

Chi Chin Chen, Bloomberg

Chi Chin Chen is a senior SRE at Bloomberg. He has been a software engineer consultant for financial companies in NYC for over three decades, specializing in application system design and implementation using latest technologies.

Software-based Mitigations for Hardware Vulnerabilities

Tuesday, 11:50 am12:30 pm

Antonio Gomez, Intel

Core Principles

Recently disclosed side-channel methods that target internal structures and hardware abstractions of most modern CPUs have received significant attention and have increased the awareness of different actors about hardware vulnerabilities. This presentation introduces the basic concepts behind these methods, presents a threat model, and discusses some of the software mitigations that have been implemented to help protect production hardware against these methods. These mitigations provide system administrators a number of options to configure, from boot time options to real-time changes. This presentation focuses on some ongoing efforts to improve process isolation in Linux*. The talk describes the efforts to improve the reliability of the Linux kernel on a large variety of platforms. We consider how the new functionality is validated, the variety of workloads used to test the changes, life testing or studying potential undesired side effects.

Antonio Gomez, Intel

Antonio is a software engineer in Intel where he focuses on security software mitigations. He holds a Ph.D. in computer science and has worked in different roles in the areas of performance, computer architecture, parallel programming, and security for the last 15 years.

Track 3

Blameless Postmortems: How to Actually Do Them

Tuesday, 11:00 am12:30 pm

George Miranda and Nasim Yazdani, PagerDuty

Everyone talks about blameless postmortems and their value. But human nature is to place blame. How can you and your teams get around that natural instinct? How do you actually do blameless in a step-by-step replicable way? This workshop gives attendees actionable steps they can follow with their own teams to start building a culture of blamelessness today.

George Miranda, PagerDuty

George Miranda is a Community Advocate at PagerDuty, an infrastructure engineer, and a former EMT & First Responder. He is passionate about the systems we use to effectively manage crisis situations and how we learn to continuously improve our practices. He is the author of the O'Reilly eBook, "The Service Mesh: Resilient Service-to-Service Communication for Cloud-Native Applications". Prior to working for software vendors like Buoyant and Chef Software, he worked as an on-call engineer for over 15 years. He enjoys his time roaming the world as a Digital Nomad (with a permanent home in the Pacific Northwest), small-batch artisanal whiskey, and writing third-person biographies no one reads.

Nasim Yazdani, PagerDuty

Nasim Yazdani is a Program Manager for the Customer Education team at PagerDuty, where she designs and delivers curriculum aimed at helping people on-call make the best out of challenging experiences. Prior to starting the Customer Education team, she has consistently worked with customers to help them reach success at a handful of other software companies in the Bay Area. She lives in San Francisco where she enjoys being a coffee connoisseur/crazy cat lady, and looking at homes she can't afford.

Track 4

Exploring Sensemaking and Cognitive Load through a Collaborative Card Game: A Hanabi Adventure!

Tuesday, 11:00 am12:30 pm

Jaime Woo, Incident Labs; Justin Li, Shopify

SREs have to solve problems every single day, and sometimes under very stressful conditions. While some of these problems can be solved individually, most are done in a group setting, requiring collaboration. Research has shown that solving complex problems in groups, while necessary, increases the cognitive load on individuals as they have to not only tackle the problem on-hand, but also navigate interpersonal interactions. Problem solving is often done but its mechanics are usually not deeply understood: It is invaluable, then, to improve awareness around how problem solving works, which tools and techniques are effective, and how that understanding can lead to better problem solving, and, ultimately, better solutions.

Card games are a bounded, approachable system of problem-solving, and their use in illustrating human behaviour and decision-making is well-established. Hanabi is a collaborative problem-solving card game that is easily understandable but rich in complex interactions.

Jaime Woo, Incident Labs

Jaime began his career as a molecular biologist before following his passion for writing. He is an award-nominated writer, focusing his work on the locus between culture and technology, with recent works in the Advocate, the Globe and Mail, and StarTrek.com. He is co-founder of Incident Labs, and co-writing a book on incident response and post-incident reviews. He is always down for dumplings.

Justin Li, Shopify

It was rumored that Justin's first words as a baby were "production excellence." For Justin, the phrase could easily apply both to distributed systems as well as music: he's just as happy to talk about performance sharding, as he is about proto-future-funk. At Shopify, he is a Staff Production Engineer. If you were able to buy a lip kit during a massive flash sale, or to ride a go kart in headquarters, he helped make that happen.

12:30 pm–2:00 pm

Luncheon

2:00 pm–3:30 pm

Track 1

Off the Beaten Path: Moving Observability Focus from Your Service to Your Customer

Tuesday, 2:00 pm2:50 pm

Mohit Suley, Microsoft

Observability systems are usually designed to answer two broad questions, "Is my service doing okay?" and "Is my business doing okay?" There is a third perspective that often doesn't get enough attention (unless it's clearly linked to the first two), "Is my customer experience okay?" It is fair to say that there's never a clear metric for this question.

This talk explains our motivation for stepping out of our metrics-centered "comfort zone" and the journey that ensued: developing a habit of engaging face-to-face with some of our customers, figuring out ways to experience what they did, open-sourcing a high-scale tool to capture this data, setting up broader direct-to-team channels of communication from customers, and re-thinking performance metrics.

If you are curious about why being a customer advocate makes you a better SRE, this talk is for you.

Mohit Suley, Microsoft

Mohit is an engineer on Bing's Live Site Engineering team. Designing systems to proactively improve availability and make customers happy is a core mission for them. In his spare time, he loves to go for long walks, tinkers with hardware, and chases his unachievable goal of reading more books than Bill Gates.

Delivering Business Impact through Culture Change. How We Saved Millions by Celebrating Failure through Learnings

Tuesday, 2:50 pm3:30 pm

Aniket Kulkarni, PayPal

As a leader in the Fintech space, we've come far in the reliability journey. We implemented Service Level Objectives based on Failed Customer Interactions and drove systematic improvements through tooling and automation to get to an availability of 99.98%. The last mile to get to 99.99% required scaling the SRE culture across an organization of over 7000 engineers. Through strategic plan that involved participation from dozens of developers, SREs, product managers and influencers, we encouraged and rewarded a culture that is comfortable with discussing and analyzing failure in a blameless way. Now every single one of our 500+ P0/P1 issues annually goes through a grassroots driven analysis where learnings are published to the entire company. The rich dataset from this is mined to identify themes which drive reliability investment plans. The 0.01% increase in availability translates to millions of $ in bottom line annual revenue for the company. In this talk we would like to cover the strategy, initiatives, incentives, setbacks, learnings and iterations we went through to get us here.

Track 2

Modeling Reliability for Distributed Systems

Tuesday, 2:00 pm2:50 pm

Narayan Desai, Google

Core Principles

Many of our core SRE processes exist as rituals, with many aspects of their implementation derived from the broad type of service and business environment they emerged from. While these processes seem to work, we don't necessarily have a deep understanding of why, or how they should be altered to work in different situations. The same is true of the reliability of distributed systems in general - while we have keenly developed intuition about how to improve our services, we have no coherent theory of why our services are reliable in a deep sense. In short, SRE suffers from a lack of rigor. This problem is the largest and most consequential one facing our profession today.

In this talk, I'll discuss the importance of modeling in engineering disciplines in general, and walk through a basic model of distributed system reliability. Models are useful for encoding intuition in a way that enables validation and projection—key capabilities for engineers. I'll present a model based on a theory of reliability that couples the reliability properties of the Central Limit Theorem to a representation of the toil that characterizes our services. I use this to explain why some problems can be easily mitigated, while others are terrifying. While all models are wrong, some are useful—this one enables the systematic examination and classification of risks, as well as suggesting a way that we may be able to hoist difficult reliability problems into a more tractable regime.

Narayan Desai, Google

Narayan Desai is an SRE at Google, where he focuses on the reliability of Google Cloud Platform Data Analytics products. He has a checkered past, having worked on scheduling, configuration management, supercomputers, and metagenomics—always in the context of production systems.

Give Your PXE Wings! Bootstrapping Explained

Tuesday, 2:50 pm3:30 pm

Rob Hirschfeld, RackN

Core Principles

What is PXE anyway and why does it work? System bootstrapping is one of the great mysteries in IT. In this talk, we'll break the entire bootstrapping process down into components. That will allow us to discuss a way to make it faster, simpler, and more reliable. Most importantly, we'll show how you can improve the automation options involved in day two operations for anything that boots.

Rob Hirschfeld, RackN

Rob has been in the cloud and infrastructure space for 20 years and has done everything from working with early ESX betas to serving four terms on the OpenStack Foundation Board and as an executive at Dell. As a co-founder of the Digital Rebar project, Rob created a new generation of DevOps orchestration to leverage the containers and service-oriented ops. He believes that the technology of running data centers and applications cannot work without also addressing process and people challenges. He trained as an Engineer from degrees from Duke (BS Mechanical) and LSU (MS Industrial).

Track 3

Chaos Engineering Bootcamp

Tuesday, 2:00 pm5:30 pm

Ana Margarita Medina and Jason Yee, Gremlin

Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in our systems. This hands-on workshop will share how you can get started practicing Chaos Engineering. We will cover the tools and practices needed to implement Chaos Engineering in your organization. Even if you're already using chaos engineering, you'll learn to identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

Ana Margarita Medina, Gremlin

Ana is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

Jason Yee, Gremlin

Jason is the Director of Advocacy @ Gremlin. Previously he was a Technical Evangelist at Datadog, the Community Manager for DevOps & Performance at O'Reilly Media and a Software Engineer at MongoDB. He spends much of his time traveling, drinking whiskey, and catching pokémon.

Track 4

Incident Management for IT

Tuesday, 2:00 pm5:30 pm

Chris Hawley, Rob Schnepp, and Ron Vidal, Blackrock 3

Many companies have experienced the fear, pain, and embarrassment of handling a technology failure so significant it shook the core of the business both at the time and into the future. Without a standardized way to organize the people responding to incidents and solving technical problems, the time to restore services gets longer and longer.

This session dives into the nuts and bolts of the Incident Management System, which has a long history in the fire and emergency services. We have translated this system to be optimized for the IT world and it is in use by a number of Site Reliability teams and other operational IT response organizations. Effective use of the Incident Management System can provide substantial reduction to the MTTR and the Mean Time To Assemble (MTTA). Companies that employ incident management see a reduction of their MTTR by 35%–65% and it is not unusual to see MTTA reductions of greater than 90%. Incident management uses the fire department model of getting the right people to the problem as rapidly as possible.

Chris Hawley, Blackrock 3

The Blackrock 3 speakers have deep global experience in Incident Management (Fire Department, Special Operations), Anti-Terrorism Operations, and Critical Infrastructure (fiber networks, data centers, oil and gas, power systems). We combine a unique mix of expertise and ingenuity to maximize IT Uptime in your organization.

3:30 pm–4:00 pm

Break with Refreshments

4:00 pm–5:30 pm

Track 1

Building an Effective Team When You Have No Idea What You're Doing

Tuesday, 4:00 pm4:35 pm

Peter Sahlstrom, Mailchimp; Chris Donnelly, CodeMettle

When Peter became an SRE manager, he had little management experience and no idea what SRE was. By pairing up with Chris, he was able to build a team with a reputation for quality results and for being an excellent place to work.

This isn't a talk about "best practices for a rockstar team". Rather, we'll be talking about the tools we utilized to figure out what next steps to take at any given time. We'll also offer resources that individual managers and technical leads can use to make their teams happy, comfortable, and more productive.

Come learn about how to better understand your team through modeling, mapping, and complexity theory. We'll teach you about the value chain, the Jevons paradox, and how you can apply utilization curves to prevent burnout. Finally, we'll help you learn to recognize when it's time to make a change – and how to implement it.

Peter Sahlstrom, Mailchimp

Peter Sahlstrom manages the Site Reliability Engineering team at Mailchimp, but has worked in the past as a Product Manager, a Software Developer, and a Semiconductor Fabrication Technician. He has a degree in Computer Engineering and an MBA and is currently pursuing a Masters in Analytics. He enjoys conversations about just about anything, but people often talk to him about organizational behavior, technical communication, personal finance, retirement planning, complexity theory, history, organization, contentment, negotiation, parenting, and Star Trek.

Leading without Managing: Becoming an Engineering Technical Leader

Tuesday, 4:35 pm5:10 pm

Todd Palino, LinkedIn

Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an engineer who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.

Todd Palino, LinkedIn

Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O'Reilly Media.

Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership. Or maybe out on the trails, training for the next marathon.

Preventing Burnout with Boundaries

Tuesday, 5:10 pm5:30 pm

James Luck, LinkedIn

Boundaries are a conceptual framework that's useful for protecting your mental health in stressful situations. This talk will explore the concept, why it's useful in the technology industry, and how to implement it in your own life. It will also give examples of stressful situations that could lead to burnout and how to address those stressful situations using boundaries as a mental guide.

James Luck, LinkedIn

Jamie is a Senior SRE at LinkedIn and a total unix nerd. He is fond of talking about the cultural aspects of the internet industry and trying to bring the conversation about mental health further into the forefront.

Track 2

Troubleshooting Microservices with eBPF

Tuesday, 4:00 pm4:35 pm

Jonathan Perry, Flowmill

Core Principles

Over the past year, eBPF tracing has shifted from a niche technology to a powerful observability tool that ships with every Linux kernel. It enables incredibly powerful profiling of container-based applications at the operating system level, without worrying about the languages and frameworks used.

This talk will show you how to get beyond playing with some open source eBPF tools and examine where and when they should be used to tackle observability challenges. We’ll look at a number of specific use cases in networking and system performance that eBPF is particularly well suited to approach. Finally, we’ll talk about how to implement eBPF tracing continuously in microservice-based environments like Kubernetes.

Take Database Performance Troubleshooting to the Next Level with Datadog APM and Logs

Tuesday, 4:35 pm5:10 pm

Anatoly Mikhaylov, Zendesk

Core Principles

Application performance troubleshooting is hard, comprehensive tracing around database performance is harder. Having application traces with meaningful context of what database is doing (process list and statement analysis) at the exact moment of time is even harder. Why is a given SQL query fast in one case and slow in the other? Is the query execution plan enough to understand and address performance issues? In this session we will learn how to use APM and Logs to connect layers of the stack together in order to investigate performance issues thoroughly and expediently.

Anatoly Mikhaylov, Zendesk

Anatoly is a keen enthusiast in observability and performance troubleshooting. He works as a Staff SRE engineer at Zendesk in Dublin, Ireland where he is part of a global team that builds and maintains next generation observability tools for dozens of high traffic microservices/databases. He contributes to Zendesk Engineering blog. He is also a runner, an avid hiker and nature photographer. Before Zendesk, Anatoly worked as a DBA, DevOps and Software engineer for over ten years.

Next-Level Caching: Statelessness and Versioned Data

Tuesday, 5:10 pm5:30 pm

Peter Sperl, Bloomberg LP

Core Principles

Caching is more than just an optimization. It's a powerful tool to increase system capacity and reliability by reducing back-end load.

Many caching strategies hit their limits when operations depend on factors like user settings and/or databases. After all, how can you cache something that’s dependent on a database entry that may change at any time? These realities often force us to forego caching or accept less desirable cache semantics like TTL or event-driven cache invalidation strategies.

We'll discuss designing for immutability, where cache entries are valid in perpetuity and no invalidation strategy is required, and versioned data, a pattern that enables caching of database accesses and any operations that depend on them.

Peter Sperl, Bloomberg LP

Peter Sperl is an Engineering Manager at Bloomberg LP. He graduated from Carnegie Mellon University in 2004 with a B.S. in Electrical and Computer Engineering.

Track 3

(Continued from previous session)

Chaos Engineering Bootcamp

Tuesday, 2:00 pm5:30 pm

Ana Margarita Medina and Jason Yee, Gremlin

Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in our systems. This hands-on workshop will share how you can get started practicing Chaos Engineering. We will cover the tools and practices needed to implement Chaos Engineering in your organization. Even if you're already using chaos engineering, you'll learn to identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

Ana Margarita Medina, Gremlin

Ana is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

Jason Yee, Gremlin

Jason is the Director of Advocacy @ Gremlin. Previously he was a Technical Evangelist at Datadog, the Community Manager for DevOps & Performance at O'Reilly Media and a Software Engineer at MongoDB. He spends much of his time traveling, drinking whiskey, and catching pokémon.

Track 4

(Continued from previous session)

Incident Management for IT

Tuesday, 2:00 pm5:30 pm

Chris Hawley, Rob Schnepp, and Ron Vidal, Blackrock 3

Many companies have experienced the fear, pain, and embarrassment of handling a technology failure so significant it shook the core of the business both at the time and into the future. Without a standardized way to organize the people responding to incidents and solving technical problems, the time to restore services gets longer and longer.

This session dives into the nuts and bolts of the Incident Management System, which has a long history in the fire and emergency services. We have translated this system to be optimized for the IT world and it is in use by a number of Site Reliability teams and other operational IT response organizations. Effective use of the Incident Management System can provide substantial reduction to the MTTR and the Mean Time To Assemble (MTTA). Companies that employ incident management see a reduction of their MTTR by 35%–65% and it is not unusual to see MTTA reductions of greater than 90%. Incident management uses the fire department model of getting the right people to the problem as rapidly as possible.

Chris Hawley, Blackrock 3

The Blackrock 3 speakers have deep global experience in Incident Management (Fire Department, Special Operations), Anti-Terrorism Operations, and Critical Infrastructure (fiber networks, data centers, oil and gas, power systems). We combine a unique mix of expertise and ingenuity to maximize IT Uptime in your organization.

5:30 pm–6:30 pm

Happy Hour

Wednesday, March 25

8:00 am–9:30 am

Continental Breakfast

9:30 am–11:00am

Track 1

SRE Work as Done

Wednesday, 9:30 am10:20 am

Amy Tobey, Elastic

A common discussion anywhere SREs gather to talk about our craft is "what does SRE mean?" This is something we all have strong opinions about in the abstract and have to be flexible about in practice. In this session, Amy will combine her own experience with interviews to talk about how SRE work is done in the real world. She hopes we can all leave with a better understanding of what our peers in other organizations are doing and how our local variants are just as valid as anything in a book.

More details: I intend to take the varieties of human work model and look at various approaches to SRE in the talk. Having moved around a bit in my career gives me a wide perspective, but I intend to supplement that by interviewing folks from 10-20 additional companies to bring more depth.

Reliability Is Made Out of People

Wednesday, 10:20 am11:00 am

Morgan Schryver, Netflix

SRE programs have proliferated throughout the industry, often with mixed results. Metrics, service agreements, nines, and error budgets are only part of the picture. People and relationships are the foundation of SRE work.

This talk will cover some of the common pitfalls encountered when creating and operating SRE teams and how to overcome them.

Morgan Schryver, Netflix

Morgan Schryver is a Site Reliability Engineer on the CORE team at Netflix, where she works on incident management as well as identifying human and technical factors of reliability. Before that, Morgan was an SRE on the Telemetry team at Blizzard Entertainment.

Track 2

Algorithms and Data Structures are Super Important (and Interesting!) For SREs

Wednesday, 9:30 am10:20 am

Adam Mckaig, Google

Core Principles

Large tech companies often test SRE candidates' familiarity with fundamental algorithms and data structures, which can sometimes seem irrelevant to the reality of the role. However, I will show why they do it: These techniques can make your systems run faster, cheaper, and more reliably.

I will show examples of real problems my team has encountered and solved (with better algorithms and data structures) while working on a very large production system: unpredictable performance, performance hot-spotting, and tail latency spikes. You will learn how to spot these problems in your own systems, and how to solve them.

I may even convince you that this stuff is super interesting, too!

Adam Mckaig, Google

Adam Mckaig is an SRE at Google in New York, where he looks after a monitoring system. Previously he built things at the New York Times, Bloomberg, and UNICEF. His favorite language is C++, which probably says it all.

Refining Systems Data without Losing Fidelity

Wednesday, 10:20 am11:00 am

Liz Fong-Jones, Honeycomb.io

It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. The question is, how to scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?

Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. This talk advocates a three-R approach to data retention: Reducing junk data, statistically Reusing data points as samples, and Recycling data into counters. ,We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.

Track 3

Experience the Power of Event Data Structures during Incident Triage

Wednesday, 9:30 am1:00 pm

Abby Bangser, MOO Print; Charity Majors, Honeycomb.io

The observability community is currently arguing over some foundational tenets, such as whether observability has "three pillars" of data (metrics, logs, and traces) or a single data structure of arbitrarily-wide events, from which multiple visualizations (including metric graphs, log lines, and trace waterfalls) can be derived.

The paradigm shift between highly detailed and highly structured logs and emitting a single event per service per request is nuanced enough to confuse many people, yet powerful enough to justify the shift. This workshop will let us get based theory and ground these debates in hands-on experience with both formats.

You will have the chance to work with any of the tool stacks (Elasticsearch/Kibana, Loki/Grafana, Honeycomb) to triage two different Game Day scenarios. In one scenario you will be provided high-quality structured logs whereas in the other you will triage using arbitrarily-wide events.

Note: Participants will be most successful if they have experience triaging issues with logs, but this is not required. All activities will be done via websites and do not require local installs of anything.

Takeaways:

  • Experience running queries with event data structures to visualize the data in metric, log, and tracing views
  • Contextualised comparison of high-quality structured logs vs. events based on debugging a single application with different outputs
  • Ability to explain the unique selling point of events above highly contextualized and structured logs

Abby Bangser, MOO Print

Abby Bangser is a software tester with a keen interest in working on products where fellow engineers are the users. Abby brings the techniques of analyzing and testing customer-facing products to tools like delivery pipelines and logging so as to generate clearer feedback and greater value. Currently, Abby is a Test Engineer on the Platform Engineering team at MOO which supports the shared infrastructure and tooling needs of the organization.

Outside of work Abby is active in the community by co-leading Speak Easy which mentors new and diverse speakers, co-hosting the London free testing meetup Essentials which brings together mentors and new joiners to the software testing industry, and hosting #CoffeeOps London.

Charity Majors, Honeycomb.io

Charity Majors is the cofounder and CTO of Honeycomb.io, a provider of tools for engineering teams to debug production systems faster and smarter. Previously Charity ran infrastructure at Parse and was an engineering manager at Facebook, where she ran next-generation distributed systems at scale. Charity is the co-author of Database Reliability Engineering (O'Reilly) and is devoted to a world where every engineer is on call and nobody thinks on call sucks.

Track 4

Unconference: Topics in SRE

Wednesday, 9:30 am11:00 am

Kurt Andersen, LinkedIn

This workshop time will be an opportunity for people to talk about topics that are top of mind as they have been attending the conference. It will be run as an unconference.

Kurt Andersen, LinkedIn

Kurt Andersen is a past co-chair for SREcon Americas and has been active in the anti-abuse community for over 20 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.

11:00 am–11:30 am

Break with Refreshments

11:30 am–1:00 pm

Track 1

Pragmatic Security for SRE

Wednesday, 11:30 am12:20 pm

James Wickett, Verica.io

All organizations want to go faster and decrease friction in delivering software. The problem is that InfoSec has historically slowed this down or worse. But, with the rise of software delivery pipelines and new DevSecOps tooling, there is an opportunity to reverse this trend and move InfoSec from being a blocker to being an enabler.

This talk will discuss hallmarks of doing security in an SRE context with a particular focus on the software delivery pipeline. This is not a theory talk, but it has an emphasis on being pragmatic. At each phase of the software delivery pipeline, you will be armed with philosophy, questions, and tooling that will get security moving at the speed of your SRE organization.

James Wickett, Verica.io

James is a dynamic speaker on software engineering topics ranging from security to development practices. He spends a lot of time at the intersection of the DevOps and Security communities, and seeing the gap in software testing, James founded the open-source project, Gauntlt, to serve as a Rugged Testing Framework.

James works as a Sr. Security Engineer and Developer Advocate at Verica and is he is the author of several courses on DevOps and DevSecOps at LinkedIn Learning. His courses include DevOps Foundations, Infrastructure as Code, DevSecOps: Automated Security Testing, Continuous Delivery (CI/CD), Site Reliability Engineering, and more.

James is the creator and founder of the Lonestar Application Security Conference, which is the largest annual security conference in Austin, TX. He also runs DevOps Days Austin, DevSecOps Days Austin, and Serverless Days Austin.

How to Build On-Call Rotations that People Actually Want to Join

Wednesday, 12:20 pm1:00 pm

Miles Bryant, Monzo

A healthy and functional on-call rotation is essential to maintaining resilient and adaptive systems. So why do organisations neglect it in favour of focusing on SLOs and the technical parts of the system? Why do people treat being on-call as a necessary evil they have to put up with? Why are we letting people get burnt out?

This conceptual talk is aimed at both people on-call and people who manage their on-call rotations who dream of a better way of doing things. I’ll argue that improving on-call process is as important as improving the resilience of your core systems. I’ll discuss what the outcomes of a good on-call rotation should be, from both an organisational and personal perspective. The main bulk of the talk will cover the general principles that people should be able to take away to improve their own organisation’s on-call rotation.

Track 2

Incident Response, Faster and Better with Traces

Wednesday, 11:30 am12:20 pm

Ashutosh Raina, eBay; Kamala Ramasubramanian, University of California, Santa Cruz

Site reliability engineers (SREs) are constantly building newer and better tools in order to be able to respond faster. The existing use of metrics, logs, and events doesn't fully capture the mental model of an SRE. Distributed Tracing has increasingly been adopted by industry and end to end traces captured by systems, both successful and unsuccessful. We motivate the use of traces through insights on a series of incidents, for each of which we demonstrate how traces could have been used to enable faster and more accurate triage. We present ideas on integrating distributed tracing in incident response and building better tolling using aggregate reasoning over traces. We will also talk about the challenges and opportunities we faced as part of this work.

Ashutosh Raina, eBay

Ashutosh is a member of the Site Reliability team at eBay. He works on the observability, fault tolerance, and reliability of systems at eBay. He works at the intersection of academia and industry, trying his best to fuse them together.

Kamala Ramasubramanian, University of California, Santa Cruz

Kamala is a PhD student at UCSC. Her interests include reasoning about large scale distributed systems and applied machine learning, specifically how and when we might be able to apply machine learning effectively to understand complex systems better.

Weeks of Debugging Can Save You Hours of TLA+

Wednesday, 12:20 pm1:00 pm

Markus A Kuppe, Microsoft Research

Designing and debugging a concurrent program or distributed system is hard and time-consuming. It is the proverbial search for the needle in the ever changing haystack. Wouldn't it be cool if we could explore the actual algorithm hidden inside of the implementation and check that its correctness even before we write a single line of the implementation code?

It turns out we can: The TLA+ specification language is an immensely expressive language that enables rapid prototyping of ideas. It has once been coined "debuggable pseudo-code". And TLA+ comes with a set of tools with which users can verify correctness properties.

In this talk, we will study the powers of TLA+ by solving challenge Fourteen of the famous c2 wiki in 40 minutes.

Markus A Kuppe, Microsoft Research

Markus Kuppe is a Principal Research Software Development Engineer at Microsoft Research. He joined the TLA+ project in 2011. Being an engineer, his focus is on making spec-driven development (with TLA+) more popular among fellow engineers.

Track 3

(Continued from previous session)

Experience the Power of Event Data Structures during Incident Triage

Wednesday, 9:30 am1:00 pm

Abby Bangser, MOO Print; Charity Majors, Honeycomb.io

The observability community is currently arguing over some foundational tenets, such as whether observability has "three pillars" of data (metrics, logs, and traces) or a single data structure of arbitrarily-wide events, from which multiple visualizations (including metric graphs, log lines, and trace waterfalls) can be derived.

The paradigm shift between highly detailed and highly structured logs and emitting a single event per service per request is nuanced enough to confuse many people, yet powerful enough to justify the shift. This workshop will let us get based theory and ground these debates in hands-on experience with both formats.

You will have the chance to work with any of the tool stacks (Elasticsearch/Kibana, Loki/Grafana, Honeycomb) to triage two different Game Day scenarios. In one scenario you will be provided high-quality structured logs whereas in the other you will triage using arbitrarily-wide events.

Note: Participants will be most successful if they have experience triaging issues with logs, but this is not required. All activities will be done via websites and do not require local installs of anything.

Takeaways:

  • Experience running queries with event data structures to visualize the data in metric, log, and tracing views
  • Contextualised comparison of high-quality structured logs vs. events based on debugging a single application with different outputs
  • Ability to explain the unique selling point of events above highly contextualized and structured logs

Abby Bangser, MOO Print

Abby Bangser is a software tester with a keen interest in working on products where fellow engineers are the users. Abby brings the techniques of analyzing and testing customer-facing products to tools like delivery pipelines and logging so as to generate clearer feedback and greater value. Currently, Abby is a Test Engineer on the Platform Engineering team at MOO which supports the shared infrastructure and tooling needs of the organization.

Outside of work Abby is active in the community by co-leading Speak Easy which mentors new and diverse speakers, co-hosting the London free testing meetup Essentials which brings together mentors and new joiners to the software testing industry, and hosting #CoffeeOps London.

Charity Majors, Honeycomb.io

Charity Majors is the cofounder and CTO of Honeycomb.io, a provider of tools for engineering teams to debug production systems faster and smarter. Previously Charity ran infrastructure at Parse and was an engineering manager at Facebook, where she ran next-generation distributed systems at scale. Charity is the co-author of Database Reliability Engineering (O'Reilly) and is devoted to a world where every engineer is on call and nobody thinks on call sucks.

Track 4

Site Reliability Engineering (SRE) and the Art of Service Level Objectives (SLOs)

Wednesday, 11:30 am1:00 pm

Nathen Harvey and Stephanie Hippo, Google

SRE is a set of principles, practices, and organizational constructs that seek to balance the reliability of a service with the need to continually deliver new features. An error budget is the primary construct used to help balance these seemingly competing goals.

This workshop introduces error budgets and their components: service level indicators (SLIs) and service level objectives (SLOs). Participants will learn how to create and implement SLOs through a series of guided discussions and group exercises.

The workshop is appropriate for all levels of technical capability and non-technical participants from "the business" are encouraged to attend; we seek to build a common language across teams.

By the end of this workshop, participants will be able to:

  • Describe key concepts: Error Budget, SLIs, and SLOs
  • Create an error budget
  • Recommend actions to take when the error budget is consumed
  • Recommend actions to take when excess error budget remains

Stephanie Hippo, Google

Stephanie is a Senior Site Reliability Engineer at Google, where she leads a team of eight supporting Google's internal services. She enjoys exploring the data-driven nature of reliability work, teaching teams how to adopt reliability principles, and encouraging the career growth of others. Away from the keyboard, she enjoys baking desserts, playing soccer, and hanging with her lazy, tiny dog.

1:00 pm–2:30 pm

Luncheon

Sponsored by Blameless

2:30 pm–4:00 pm

Track 1

Crayon Drawing Is a Vital Engineering Skill

Wednesday, 2:30 pm3:05 pm

Murali Suriar, Google

Software and technology systems are complex and growing more so as time progresses. This challenges the humans responsible for those systems. Complex systems are difficult to understand, and such understanding, once obtained, is difficult to communicate to others. System overview diagrams are one tool that can be very helpful in mitigating these challenges. This talk will walk through system overview diagrams of 2-3 systems from different problem domains, and discuss the benefits such diagrams can yield.

Murali Suriar, Google

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running cluster filesystem and locking services.

When SLOs Aren't Enough: How to Get Things Done across Teams

Wednesday, 3:05 pm3:40 pm

Matthew Terwilliger, Mailchimp

Ensuring the reliability of our product means working across teams and organizational boundaries. This isn't always easy. We often seek the elusive "owner" of a system, only to find the remnants of a team that was disbanded years ago. Now what?

With more competing priorities than working hours, combining our engineering mindsets with a dash of marketing can go a long way. A well-written story allows an individual to understand an issue in a matter of seconds; it's one of the most effective empathy-building tools we have. In this talk, we'll learn how to shape even the most technical issues into a cohesive narrative that everyone will rally behind.

Matthew Terwilliger, Mailchimp

Matt Terwilliger is a Senior Site Reliability Engineer at Mailchimp. There, he has built continuous deployment infrastructure, operationalized the data pipeline that powers Mailchimp's search, and led a performance engineering team. Matt finds joy in the intersection between human and technical systems.

Wheel of Misadventure: Mocking Site Incidents

Wednesday, 3:40 pm4:00 pm

Brian Wilcox and Ben Goldsbury, LinkedIn

Wheel of Misadventure breaks down a real site incident into a replayable format where Engineers can experience the thrills of being on-call without the pressure of the site actually being down. LinkedIn has been using this as a method to propagate best practices, decimating information about new tools and bring a little levity to stressful part of the job.

Brian Wilcox, LinkedIn

Brian Wilcox is a simplicity enthusiast building the right systems to solve the hard problems. From structural engineering to video games to distributed systems, Brian likes to forage on various technologies and apply the best practices from various disciplines.

Ben Goldsbury, LinkedIn

Ben Goldsbury, Premier Oncall-ogist Coder, systems engineer, and mentor. Preference for Python. Passionate about solving technical problems. Critical thinking is my favorite activity.

Track 2

Reliability Is Performance

Wednesday, 2:30 pm3:05 pm

Bradley Shively, Uber ATG

"Make it faster!" is a common refrain from managers and customers. As teams that build and run systems, we are regularly under a lot of pressure to provide better performance. This may mean cutting corners to deliver more cores, bandwidth, or other resources as quickly as possible. However, we're often making an unseen trade-off; we sacrifice considerable amounts of reliability in pursuit of marginal performance gains.

Stated simply, a "faster" process that fails more often may not actually be faster at all.

In this talk, I'll argue that reliability is one of the best investments you can make to improve system performance. We'll explore the way in which even small reductions in reliability can translate into significantly worse average performance and increased costs. We'll consider these implications through the lens of expected value.

Bradley Shively, Uber ATG

Brad Shively is an engineering manager at the Uber Advanced Technologies Group, where he leads a large part of the Developer Experience team. Prior to Uber, he worked in business operations at Google and spent time as a management consultant. He's passionate about building both engineering teams and services that are durable and high-performance.

Building for Disaster

Wednesday, 3:05 pm3:40 pm

Mihnea Giurgea, Dropbox

Building infrastructure to survive large-scale disasters like data center failures is difficult at the best of times. It's especially challenging when you have hundreds of services and millions of lines of code that expect your entire stack to operate out of a single geographic region.

This talk is about two attempts to transform Dropbox's infrastructure to support cross-country disaster recovery, one that was unsuccessful and one that succeeded, and what we learned in the process.

Mihnea Giurgea, Dropbox

Mihnea is a Software Engineer at Dropbox, where he focused on distributed databases and leading the Infrastructure Disaster Recovery project. Still trying to wrap his head around concurrency.

Resilience Engineering Options: Unintended Consequences of Product Decisions

Wednesday, 3:40 pm4:00 pm

Brian Wilcox, LinkedIn

Everyone wants their product "always available" and "super fast." But how important is that really? Should we compromise our time-to-market because our availability is less than seven nines? Should we stop developing features because data consistency is more important? Take a walk with me through some unintended consequences of product constraints, good ways to deal with them without breaking timelines and banks, and get some ideas to give your PMs context to make more informed decisions.

Brian Wilcox, LinkedIn

Brian Wilcox is a simplicity enthusiast building the right systems to solve the hard problems. From structural engineering to video games to distributed systems, Brian likes to forage on various technologies and apply the best practices from various disciplines.

Track 3

Observing and Understanding Distributed Systems with OpenTelemetry

Wednesday, 2:30 pm6:00 pm

Liz Fong-Jones, Honeycomb.io; Austin Parker, LightStep

Workshop participants require the ability to comfortable write code in Go or Python, preferably Go, and will require a laptop with internet access (but no Docker images etc. will be fetched; this workshop is low-bandwidth and run through glitch.com as an IDE).

Modern systems architecture often splits functionality into microservices for adaptability and velocity. The challenge of managing infrastructure for microservices has led to the cloud native ecosystem, including Kubernetes, Envoy, gRPC, and other projects. Observability, including application performance management (APM), is an essential component of a cloud-native stack. Without observability, application developers and operators cannot understand the behavior of their applications and ensure the reliability of those applications.

Track 4

BPF Performance Tools

Wednesday, 2:30 pm6:00 pm

Brendan Gregg, Netflix

BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. This tutorial shows you how to use the open-source BCC and bpftrace tools to find performance wins across a variety of application and system targets, and how to create your own Linux observability tools with BPF/bpftrace. We will also discuss challenges and fixes for real-world analysis, including lessons learned from its production use at Netflix, so you can hit the ground running when you return to work.

Brendan Gregg, Netflix

Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning. He is the author of BPF Performance Tools (Addison Wesley) and Systems Performance (Prentice Hall), and received the USENIX LISA Award for Outstanding Achievement in System Administration. Brendan has created numerous performance analysis tools, visualizations, and methodologies for performance analysis, including flame graphs.

4:00 pm–4:30 pm

Break with Refreshments

4:30 pm–6:00 pm

Track 1

Unconscious Bias & El Niño—Acknowledging and Addressing Bias without Flying in the Face of Every Storm

Wednesday, 4:30 pm5:20 pm

Jamie L. Henderson, LinkedIn

This talk will use examples of common kinds of unconscious biases to explore strategies for actively addressing microaggressions in a positive way, how individuals can get the credit they deserve, and the power of bringing coworkers on board as allies. While it uses examples of situations commonly experienced by women working in Tech, the tools provided are aimed at being useful to people of all genders in a variety of environments. We can change the world through our actions without getting into constant opposition with our colleagues.

Jamie Henderson, LinkedIn

Jamie Henderson started messing around with computers a very long time ago and has been working with them professionally for two decades. Her passion is for large-scale, Internet-facing production infrastructure and the processes needed to manage and secure it. When she’s not making infrastructure sing, she turns her energy to improving Tech culture through inclusivity and spreading the practice of growth mindset.

On Being Neurodiverse in the Tech Industry

Wednesday, 5:20 pm6:00 pm

Thaddeus Aid, LinkedIn

Particular forms of "Mental Illness" have a genetic basis and those genetics can lead to some surprising natural skills that can be learned by those that are "Neurodiverse" (having a brain built under a different set of rules than the norm).

In this talk, I will discuss how I harnessed my genetic diversity and why others like me should be embraced and given the support they need to harness their genetic diversity. I will also discuss some of the backgrounds around the genetics and discuss how historically Neurodiversity has been a benefit to the Technology Industry and beyond.

Track 2

How We Went from Being Astronauts to Being Mission Control: Managing Systems in an Age of Dynamic Complexity

Wednesday, 4:30 pm5:20 pm

Laura Nolan

Why is it that a single server can often have better uptime than a public cloud service?

We used to manage systems. Instead, many of us now write and run dynamic control planes: the systems that run our user-facing systems. We find the dynamic control plane pattern in software-defined networking, in service meshes, in some load balancers, and in job orchestration systems.

This talk looks at the common architectural shapes of dynamic control planes, and some examples of how they fail spectacularly—many major cloud outages are caused by dynamic control plane issues. Why are dynamic control planes so hard to run, and what can we do about it?

Laura Nolan[node:field-speakers-institution]

Laura Nolan is a software engineer whose fascination with failure and fragility in systems drew her into the field of Site Reliability Engineering. She is a contributor to "Site Reliability Engineering: How Google Runs Production Systems" and "Seeking SRE", and writes a quarterly column on SRE for ;login magazine. Laura works for Slack (in Dublin, Ireland).

When Automation Attacks: Revisiting "Automate All The Things"

Wednesday, 5:20 pm6:00 pm

J. Paul Reed, Netflix

Automation is a cornerstone of DevOps, SRE, and modern operations practices, the A in DevOps' venerable CAMS, and the subject of one of its oldest, most famous memes: "Automate ALL the things."

But are there processes we shouldn't automate? What if HOW we automate actively causes us (and the systems we're responsible for) harm? We'll take a look at what human factors have to do with automation as well as at some of the impacts and challenges pervasive automation has presented for systems administrator and SREs, along with some important considerations when automating our complex, living socio-technical systems, and some strategies to cope when those shell scripts strike back!

J. Paul Reed, Netflix

J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix's the CORE team, focusing on incident analysis, systemic risk identification and mitigation, Resilience Engineering, and human factors expressed in the streaming leader's various socio-technical systems.

Reed is an internationally recognized speaker on operational socio-technical complexity challenges and opportunities, Resilience Engineering, and DevOps and holds a Masters of Science in Human Factors & Systems Safety from Lund University.

Track 3

(Continued from previous session)

Observing and Understanding Distributed Systems with OpenTelemetry

Wednesday, 2:30 pm6:00 pm

Liz Fong-Jones, Honeycomb.io; Austin Parker, LightStep

Workshop participants require the ability to comfortable write code in Go or Python, preferably Go, and will require a laptop with internet access (but no Docker images etc. will be fetched; this workshop is low-bandwidth and run through glitch.com as an IDE).

Modern systems architecture often splits functionality into microservices for adaptability and velocity. The challenge of managing infrastructure for microservices has led to the cloud native ecosystem, including Kubernetes, Envoy, gRPC, and other projects. Observability, including application performance management (APM), is an essential component of a cloud-native stack. Without observability, application developers and operators cannot understand the behavior of their applications and ensure the reliability of those applications.

Track 4

(Continued from previous session)

BPF Performance Tools

Wednesday, 2:30 pm6:00 pm

Brendan Gregg, Netflix

BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. This tutorial shows you how to use the open-source BCC and bpftrace tools to find performance wins across a variety of application and system targets, and how to create your own Linux observability tools with BPF/bpftrace. We will also discuss challenges and fixes for real-world analysis, including lessons learned from its production use at Netflix, so you can hit the ground running when you return to work.

Brendan Gregg, Netflix

Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning. He is the author of BPF Performance Tools (Addison Wesley) and Systems Performance (Prentice Hall), and received the USENIX LISA Award for Outstanding Achievement in System Administration. Brendan has created numerous performance analysis tools, visualizations, and methodologies for performance analysis, including flame graphs.

6:00 pm–7:00 pm

Reception

Sponsored by Packet

7:00 pm–8:00 pm

Lightning Talks

Thursday, March 26

8:00 am–9:30 am

Continental Breakfast

9:30 am–11:00am

Track 1

Confessions of a Systems Engineer: Learning from My 20+ Years of Failure

Thursday, 9:30 am10:20 am

David Argent, Amazon

There's no holy book of best practices for running large online services. We rely on what we've learned along the way, often taught to us by having things break. Failure is a great but expensive teacher, and it's usually better to learn from someone else's mistakes. I've had a long career of mistakes I've made or experienced first hand to draw on and built a list of conceptual lessons to be learned from them. This is a non-exhaustive list of things to think about when designing and running a large scale online service, rather than a prescriptive checklist.

David Argent, Amazon

With over 20 years of experience in the tech industry, and job titles ranging from Technical Writer, Systems Engineer, Program Manager, and Lead Problem Engineer (my personal favorite), I've worn more than a few hats and been victimized by more than a few badly designed online services. 19 years at Microsoft had me working on various TV-division projects, Windows Phone, Cortana, and Bing until my new adventure with Amazon, where I help run what, depending on who you talk to, is the largest no-SQL installation in the world.

How to Speed Up an Old Service

Thursday, 10:20 am11:00 am

Danilo Carvalho, Google

How does one goes around speeding up a system that has been around for several years and is not well understood? In this talk I'll go over some of the reasons why improving performance of these systems might be necessary and high level lessons learned while dealing with one such system - a strongly consistent distributed storage file system.

On the need of improving performance, I outline 3 reasons: cost, capacity and complexity. While the first two are straightforward, the fact that improved performance often decreases the complexity of services—by removing caching, co-location restrictions, request hedging—is often surprising. I'll illustrate this point with a case where improving latency on the core service allowed us to remove an entire caching layer, deleting thousands of lines of code.

Track 2

Hot Swap Your Datastore: A Practical Approach and Lessons Learned

Thursday, 9:30 am10:20 am

Mehmet Can Kurt and Raj Shekhar, Quantcast

This talk is about how we migrated from an in-house legacy datastore that handles 1.5 million lookup requests (per second) to a more reliable, flexible, and cheaper system.

We will talk about how achieved three major goals for the migration process:

  • at least similar or better performance than the legacy system
  • ensure the quality and correctness of the data served by the new system, and
  • do the first two without any downtime on production and affecting the company’s main revenue generating product.

The talk is aimed at software engineers and site reliability engineers who are thinking of replacing a critical part of their distributed system.

Mehmet Can Kurt, Quantcast

Mehmet is a Senior Software Engineer at Quantcast working on the large scale data processing frameworks in the company. He enjoys working on all aspects of distributed computing and has a special interest in reliability. Off work, you can find him kicking the soccer ball and going on long hikes with his dog Sherlock.

Raj Shekhar, Quantcast

Raj is a Staff System Engineer at Quantcast, working on maintaining the uptime and reliability of the servers tracking Quantcast pixels across the web. He enjoys poking sleeping dragons in large scale distributed systems. When not working, you can find him planning his next getaway, usually to a place accessible by motorcycle.

Resiliency Practices in Netflix Global Content Delivery Network (CDN)

Thursday, 10:20 am11:00 am

Yeshwenth Jayaraman, Netflix

The Netflix Open Connect Content Delivery Network is our in-house custom network and server infrastructure responsible for streaming all video, images and related assets (css, javascript bundles, etc.) for Netflix clients around the world. In this talk, we dive deep into the details of what resiliency means in managing a Global CDN. We will also explore distinct failure domains and get into the details of some of the failure exercises that our team conducts and important metrics we measure.

Yeshwenth Jayaraman, Netflix

Yeshwenth is a Senior CDN reliability engineer on the Netflix Open Connect team from 2015. Before that Yeshwenth was part of the Microsoft Azure Datacenter team building and managing data centers.

11:00 am–11:30 am

Break with Refreshments

11:30 am–1:00 pm

Track 1

The On-Call Review: Building a Team Culture That Rejects Noise

Thursday, 11:30 am12:05 pm

Dan Slimmon, Hashicorp

Alerts are only useful if you believe what they're saying. But over time, left unchecked, bogus alerts will make up more and more of a team's alert load. How can we prevent this proliferation of noise?

The alert review is a process I've been using successfully for over a decade, on lots of different teams. It's a way for a team to identify noisy alerts and, over time, develop healthier alerting habits. The process focuses on actionability (the ability of the recipient to act upon the problem an alert indicates) and investigability (the quality of requiring new insight rather than rote runbook-following to resolve the problem). Its benefits can be immense and long-lasting.

Dan Slimmon, Hashicorp

Dan Slimmon is an S.R.E. DevOpsAdmin™ at Hashicorp, a software company that makes lots of super useful tools for ops folks like you and me. He enjoys finding mathy and medicine-adjacent solutions to the problems of running busy web applications. He's got a lot of cat pics on his phone and would be delighted to show you some!

Prepping for the Worst—What Your On-Call Team Should Know

Thursday, 12:05 pm12:40 pm

Amiya Adwitiya and Biju Chacko, Squadcast Inc

When systems break, it is always an urgent and stressful situation. And this only gets worse as your team scales, and your stack evolves. Given these facts, it is essential to have up-to-date documentation and a high-level onboarding plan to keep everyone aligned.

This talk will address the following subjects:

  1. How do we train and on-board new engineers effectively for being on call?
  2. How can we build an atmosphere of trust and confidence between various stakeholders?
  3. How to pick the right folks to be on-call for a specific service?
  4. What tools and processes are popular and which ones are advisable in specific contexts?

In the end, the audience should be able to:

  1. Create an onboarding procedure for their on-call team
  2. Apply best practices to their own processes

Amiya Adwitiya, Squadcast Inc

Amiya Adwitiya is the Founder and CEO of Squadcast, an end-to-end incident response platform built around SRE best practices for tech teams to avoid unplanned downtime. He previously worked with Freshworks, a customer engagement software company and Accel Partners, a venture capital firm.

Biju Chacko, Squadcast Inc

Biju Chacko is a 20+ year veteran of Unix system operations. In the past, he has been a prominent open-source advocate, helped popularize Linux in India, and was a core committer of the Xfce Desktop. Most recently, he helped lead an operations team that managed a fleet of 55,000 physical servers.

How to Enable a Large Enterprise with SRE Capability

Thursday, 12:40 pm1:00 pm

Murty Susarla and Carla Beichner, Fannie Mae

So, what if you aren’t Google or Netflix – how do you introduce and stand up SRE capabilities in a more traditional enterprise? It's time to break the stereotype that SRE is only needed, and can only be implemented, in certain environments. We are from an enterprise where we make homeownership a reality for millions of Americans. We would like to share our story of implementing SRE capabilities within our enterprise, and to share our success and lessons learned. We would like to share our experience in order to give others a roadmap to follow for successful SRE Implementation—from defining the problem statement to educating the leadership and the enterprise, from understanding how SRE is different from other roles to starting a pilot POC, we would like to walk you through the road we took to make a difference.

Murty Susarla, Fannie Mae

Murty Susarla has decades of experience with software development and delivery. Currently leading the efforts of SRE strategy and adoption including performance engineering, testing, and chaos engineering at Fannie Mae. Murty is known for tackling complex and challenging initiatives in the organizations. Prior to this he has built the practice and teams for Architecture Review Board, Salesforce implementations and a host of application/data warehouse initiatives. In his spare time, he teaches mathematics, chemistry and geography to middle and high school students.

Carla Beichner, Fannie Mae

Carla Beichner is an experienced technology leader focused on enterprise agile and DevOps transformation – driving organizations to deliver faster without sacrificing quality and customer experience. An award-winning leader currently focused on driving operational excellence across Fannie Mae in order to achieve continuous delivery, agile maturity, and fast, frequent delivery at scale.

Track 2

3D Models of Distributed Systems

Thursday, 11:30 am12:05 pm

Kim Schlesinger, Fairwinds Ops

The distributed systems we build and maintain are mind-boggling in their complexity, but our main tool for understanding the components and their relationships are diagrams with two-dimensional boxes and arrows. These models aren't sufficient, and we owe it to ourselves and our engineering teams to explore more accurate and engaging ways of representing our computing architecture. This talk will show you how to build 3D models of distributed systems using materials from craft stores, discuss why the process of building these models will increase your team's understanding of your systems, and give you tips on how to facilitate this kind of unorthodox exploration with your team.

Kim Schlesinger, Fairwinds Ops

Kim Schlesinger is a Site Reliability Engineer at Fairwinds Ops. Prior to being an SRE, Kim was an Instructional Designer at a codeschool, and before that an elementary school special education teacher. Kim loves working at the intersection of tech and adult education.

In her spare time, Kim is a CrossFit athlete and the Head of Education and Content for Develop Denver, a 2-day conference for developers, designers, strategists, and tech leaders.

Measuring End-User Availability via the Network Error Logging W3C API

Thursday, 12:05 pm12:40 pm

Mohit Suley, Microsoft

Users of online services around the world experience issues with DNS resolution, TCP connections or SSL certificates on a regular basis. If you believe all your customers reached your online service, think again.

We will explain how an SRE team can leverage the new W3C API to get an availability telemetry feed that is automatically available from all (Chromium-based) endpoints. We will show what it takes to set up a pipeline to get this running and also walk the audience through actual examples we caught from real-user traffic that show the potential of this amazing telemetry system.

Our goal in this talk is to increase awareness in the SRE community for this new API and enable them to detect client issues reliably. If you were to ask us, this is the best thing since sliced bread.

Mohit Suley, Microsoft

Mohit is an engineer on Bing's Live Site Engineering team. Designing systems to proactively improve availability and make customers happy is a core mission for them. In his spare time, he loves to go for long walks, tinkers with hardware, and chases his unachievable goal of reading more books than Bill Gates.

SLIs, SLOs, and Error Budgets at Scale

Thursday, 12:40 pm1:00 pm

Fred Moyer, Zendesk, Inc.

How can one democratize the implementation of SLIs, SLOs, and Error Budgets to put them in the hands of a thousand engineers at once?

At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc).

This talk is for engineers and operations folks who are putting SLIs, SLOs, and Error Budgets into practice. Attendees will come away with concrete examples of how to communicate and implement Error Budgets across multiple teams and diverse service architectures.

Fred Moyer, Zendesk, Inc.

Fred is an SRE and the resident SLOgician at Zendesk, where he works to use Error Budgets to deliver best in class reliability for Zendesk's services. Previously he wrangled large scale web systems at Turnitin.com, and earned his Monitoring and Observability wings at Circonus, dealing with large scale time series telemetry implementations. Fred is a Perl White Camel Award winner (2018), and received an award from Google for the first Istio community adapter (2018). He has two young kids, so he needs more sleep when he's not on the stationary bike in his garage. He still likes to hack C, Go, Perl, and Ruby.

1:00 pm–2:30 pm

Luncheon

Sponsored by LaunchDarkly

2:30 pm–4:00 pm

Track 1

Improving a Distributed System Post-Incident

Thursday, 2:30 pm3:20 pm

Julius Zerwick, DigitalOcean

What happens when a team faces an extended, complex incident that forces them to reevaluate their distributed system's performance and reliability?

In this talk, we will take a deep dive into such an incident faced by the software-defined networking team at DigitalOcean. This incident led us to reevaluate our system's architecture and overhaul key areas in our codebase to improve our monitoring, testing, database interactions, reliability, and system performance.

We'll also explore how DigitalOcean's practices & processes cultivate a blameless culture that enables teams to rally together during high-pressure incidents. Join us as we explore a case study in how a team of engineers can improve a distributed system post-incident!

Julius Zerwick, Digital Ocean

Julius Zerwick is a software engineer at DigitalOcean where he works on software-defined networking and distributed systems. His areas of interest include distributed systems, computer networking, web development, and Go. He lives in New York City and can be found touring national parks when not coding.

How to Recover When You Break Everything in Production

Thursday, 3:20 pm4:00 pm

Kelsey Fix, Dropbox

When working in production, you will eventually break something and cause an outage. To most seasoned veterans, it's not a matter of if, it's a matter of when. While this might be the case, that's only minimally reassuring to the individual who is going through this experience for the first time. Come hear the story of my first outage when I ran the command that led to the largest outage in Dropbox history and how I made it through this SRE rite of passage.

Kelsey Fix, Dropbox

Kelsey currently leads the Persistent Systems SRE team at Dropbox where her team works to create reliable, sustainable, and efficient storage systems.

Track 2

MySQL 8 Observability

Thursday, 2:30 pm3:05 pm

Michael Coburn, Percona

We will cover the most important observability improvements available in MySQL 8. If you're a Developer or DBA passionate about Observability or just want to be empowered to resolve MySQL problems quickly and efficiently you should attend this talk.

Michael Coburn, Percona

Michael joined Percona as a Consultant in 2012 after having worked with high volume stock photography websites and email service provider platforms. With a foundation in Systems Administration, Michael acted as the Product Manager responsible for Percona Monitoring and Management (PMM) open-source database observability platform and is now in the role of Principal Architect at Percona.

Asynchronous Computing at Facebook Scale

Thursday, 3:05 pm3:40 pm

Bo Huang and Carla Souza, Facebook

A lot of work happens behind the scenes that requires time and processing that should not block real-time actions in our products. Things like notifications, events' invites, and video rendering may entail long waits or require a large amount of processing power. This talk details how Facebook handles asynchronous processing at large scale, and the challenges that come with maintaining the reliability of a large multi-tenant system constantly growing in demand.

Bo Huang, Facebook

Bo is a software engineer with over 10-years of experience in distributed system and developer tools areas. At Facebook, he helps the team to build asynchronous computing systems to support 2B users scale. Prior to Facebook, he worked on Visual Studio and Data Platform tools at Microsoft.

Improve Observability Using Resilience Engineering

Thursday, 3:40 pm4:00 pm

Chinmay Parida, Nike

Do you worry about Observability of your services? Or you recently had an incident that exposed some gaps in overall Observability of your services? Have you noticed not every product team treats Observability the same? This talk walks you through how to validate and improve Observability of your services continuously by applying Resilience engineering.

Chinmay Parida, Nike Inc.

Chinmay Parida's background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. He has worked in Software Engineering since 2005, first as SWE and then as an SRE for the last 9 years. Currently he works as a Director of SRE @Nike managing a team of SREs that are responsible for building Observability solutions, practicing Resilience engineering & improving overall reliability of services and infrastructure at Nike.

4:00 pm–4:30 pm

Break with Refreshments

4:30 pm–5:30 pm

Closing Plenary Session

Program Co-Chairs: Tammy Butow, Gremlin, and Emil Stolarsky, Incident Labs

Leaving the Ivory Tower: Research in the Real World

Thursday, 4:30 pm5:10 pm

Armon Dadgar, HashiCorp

Academic research often has a reputation of being insular and seldom being used in the real world. At HashiCorp, we've had a long tradition of basing our tools and products on academic research. We look at research for the initial design of products, and for the ongoing development of new features. Our industrial research group, HashiCorp Research, has even published novel work. In this talk, we cover why we care, how we incorporate research, and what has been particularly useful for us.

Armon Dadgar, HashiCorp

Armon has a passion for security and distributed systems and their application to real-world problems. As a co-founder and CTO of HashiCorp, he brings both those interests into the world of DevOps tooling. As a former practitioner and proponent of open source software, he has designed and implemented the HashiCorp products to solve end-user problems, while applying the state of the art from academic research. He also leads the HashiCorp Research group, focused on industrial research in the security and large scale system management space.