SREcon19 Americas Conference Program

SREcon19 Americas Program Grid

View the program in mobile-friendly grid format.

Downloads for Registered Attendees

(Sign in to your USENIX account to download these files.)

Attendee Files 
SREcon19 Americas Attendee List (PDF)
Display:

Monday, March 25

7:45 am–8:45 am

Continental Breakfast

Grand Ballroom Foyer
Sponsored by Twitter

8:45 am–9:00 am

Welcome and Opening Remarks

Grand Ballroom ABCD
Program Co-Chairs: Liz Fong-Jones, Honeycomb, and Mike Rembetsy, Bloomberg

9:00 am–10:00 am

Opening Plenary Session

Grand Ballroom ABCD

What Breaks Our Systems: A Taxonomy of Black Swans

Monday, 9:00 am9:30 am

Laura Nolan, Slack

Available Media

Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.

By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more.

Laura Nolan, Slack

Laura Nolan's background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'.

Complexity: The Crucial Ingredient in Your Kitchen

Monday, 9:30 am10:00 am

Casey Rosenthal, Verica.io

Available Media

Software engineering is basically rocket science, so it comes as no surprise that we can learn a lot from that industry. For example, the Challenger explosion in 1986 is a fascinating subject for study. The details of the incident are well documented from a variety of angles (engineering, political, sociotechnical, ethnographical, etc) providing a rich dataset. Highlighting a few examples from this, we can empathize with the architecture considerations and organizational issues that engineers faced at NASA during that time. There are strong, informative parallels between the events that led up to that tragic incident and how software engineers think about reliability today. As Churchill allegedly quipped, "Never let a good crisis go to waste."

Casey Rosenthal, Verica.io

CEO/Founder of Verica.io. Previously an engineering manager for the Traffic Engineering Team and the Chaos Engineering Team at Netflix. As an executive manager and senior architect, Casey has managed teams to tackle Big Data, architect solutions to difficult problems, and train others to do the same. He finds opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. For fun, Casey models human behavior using personality profiles in Ruby, Erlang, Elixir, Prolog, Scala, and other languages. He speaks frequently on the topics of Chaos Engineering and Complexity.

10:00 am–10:30 am

Break with Refreshments

Grand Ballroom Foyer and Northside Ballroom
Sponsored by PayPal

10:30 am–12:10 pm

Track 1

Grand Ballroom ABC

Case Study: Implementing SLOs for a New Service

Monday, 10:30 am11:00 am

Arnaud Lawson, Squarespace

Available Media

Implementing service level objectives (SLOs) effectively is a hard task, especially for a service which not only is new within your engineering and product organizations but also encompasses both a request-driven and a storage subsystem.

In this talk, I will discuss our experience defining and measuring service level indicators (SLIs) and objectives for our Ceph Object Storage service. I will describe our approach in specifying service level indicators plus the tradeoffs and implementation decisions we made when it came to measuring various types of SLIs, including availability, latency, and durability.

I will also share the lessons learned and benefits gained from our implementation. You will understand why SLOs are crucial for site reliability engineers and service users and will be given some tips on how to implement them for either a request-driven or a storage system.

Arnaud Lawson, Squarespace

Arnaud is a Senior Site Reliability Engineer at Squarespace in New York, where—among other things—he has led the productionization of Ceph as a storage backend used by many Squarespace services.

Fixing On-Call When Nobody Thinks It's (Too) Broken

Monday, 11:05 am11:35 am

Tony Lykke, Hudson River Trading

Available Media

What's a team to do when they receive more than 30 pages a day, every day, for almost a decade? Deny there's a problem of course! Join me as we relive the data-informed journey from around 70,000 pages over 7 years (~200/week) to under 50/week in just a few short months in a way that shows those carrying the pager improvement is possible and empowers them to continue questioning and improving the status quo moving forward. We'll look at not only the technical challenges but also non-technical challenges like getting buy-in when nobody thinks there's a problem and managing risk when the on-call team is concerned about silencing legitimate pages along with the noise.

Tony Lykke, Hudson River Trading

Tony is an SRE on the trade systems team at Hudson River Trading based in NYC, where he gets to tackle hard (often not just technically) automation problems and tech debt cleanup projects across a variety of environments. He is obsessively anti-toil, and regularly refuses to accept "that's just the way it is" as an answer.

Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value

Monday, 11:40 am12:10 pm

Aaron Wieczorek, United States Digital Service

Available Media

How do you monitor systems that don't want to be monitored or ones that you don't have internal access to? Why monitor these systems at all? The United States Digital Service finds the truth and tells the truth, and fights fires across government, even when those fires don't want to be found. We put together a system to black box monitor all 25,000 .GOV domains and then expanded to perform more robust monitoring of important citizen-facing, government-provided services so we can go where the work is and restore services. In the process, we're hoping to change the culture and prove the value of SRE teams across government. This is how we're doing it.

Aaron Wieczorek, United States Digital Service

Aaron is a Site Reliability Engineer at the United States Digital Service Headquarters team. He works on hard technical problems and hard bureaucratic problems, from infrastructure to CI/CD pipelines, to network engineering.

Track 2

Grand Ballroom D

Keeping the Balance: Internet-Scale Loadbalancing Demystified

Monday, 10:30 am11:00 am

Laura Nolan, Slack; Murali Suriar, Google

Available Media

Can you explain the entire path that an IP packet takes from your users to your binary? What about a web request? Do you understand the tradeoffs that different kinds of load balancing techniques make? If not, this talk is for you.

Load balancing is hard, and it is made up of many disparate technologies. It cuts across network, transport, and application layers. We'll describe different flavours of load balancing (network, naming, application) and how they are composed together by cloud providers and other large Internet companies to provide fast, reliable, multi-region services.

Laura Nolan, Slack

Laura Nolan's background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'.

Murali Suriar, Google

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running cluster filesystem and locking services. Left Google to get on a boat. Got bored and came back.

Aperture: A Non-Cooperative, Client-Side Load Balancing Algorithm

Monday, 11:05 am11:35 am

Ruben Oanta, Twitter

Available Media

Twitter's RPC framework, Finagle, employs non-cooperative, client-side load balancing. That is, clients make load balancing decisions independently. Although this architecture continues to serve Twitter well, it also comes with some unique trade-offs and challenges. In particular, it scales poorly as service clusters grow to thousands of instances. In this talk, we will dive deeper into the problem space and how we addressed it via an algorithm we call "Aperture."

Ruben Oanta, Twitter

Ruben has been working on Twitter’s RPC stack for the past five years. In that time, he has made substantial contributions to both the design and implementation of Finagle which have markedly improved the resiliency and operability of Twitter services.

Capacity Prediction in External Services

Monday, 11:40 am12:10 pm

Jerome Kraus, Alaska Airlines

Available Media

Applications are often limited by resources in third-party external systems. As an SRE, I want to be able to predict when, and under what conditions, those resources will be exhausted to facilitate pre-emptive remedial actions and appropriate planning. In this talk, I will describe how we use linear regression analysis to generate a predictive model that empowers us to properly plan and size external services, as well as adapting to changes in our application.

Jerome Kraus, Alaska Airlines

Jerome Kraus is a Senior Software Development Engineer/SRE with Alaska Airlines. He has 20 years of software development engineering experience and 15 years with Alaska Airlines in Seattle, WA. He has been practicing Site Reliability Engineering for the past three years assisting Alaska Airlines' e-commerce application delivery teams to produce reliable and robust software solutions.

12:10 pm–1:40 pm

Luncheon

Grand Ballroom EFGHI
Sponsored by NS1

1:40 pm–3:20 pm

Track 1

Grand Ballroom ABC

How Did Things Go Right? Learning More from Incidents

Monday, 1:40 pm2:10 pm

Ryan Kitchens, Netflix

Available Media

Solely learning from failure isn't a fundamental—it's a limitation.

A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.

  • What's going on when it seems like nothing is happening?
  • When failure does occur, what's going to keep it from being worse?
  • How do teams adapt successfully when preventative techniques fail?
  • How should we prioritize the effort to develop systems that help us safely manage the consequences of failure?

These questions cannot be answered by trying to explain the causes of failure and fixing remediation items.

We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things go wrong?" to "How did things go right?"

Ryan Kitchens, Netflix

Ryan Kitchens is a Site Reliability Engineer on the Core team at Netflix where he works on building capacity across the organization to ensure its availability and reliability. Before that, Ryan was a founding member of the SRE team at Blizzard Entertainment.

Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way

Monday, 2:15 pm2:45 pm

Michael Kehoe and Todd Palino, LinkedIn

Available Media

We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Michael Kehoe, LinkedIn

Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automation.

Todd Palino, LinkedIn

Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O'Reilly Media. Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership, and at Kafka Summit or ApacheCon talking about how to feed and water Kafka infrastructures. Or maybe out on the trails, training for the next marathon.

Creating a Code Review Culture

Monday, 2:50 pm3:20 pm

Johnathan Turner, Squarespace

Available Media

Code review is one of the best ways to keep code quality high, and for engineering teams to communicate their best practices and patterns. But how do organizations build a sustainable code review culture? This talk explores best practices for introducing code review to teams and looks at how to improve the code review process from the perspective of organizations, code authors, and code reviewers.

Johnathan Turner, Squarespace

Johnathan Turner is a Site Reliability Engineer at Squarespace, where he works on tooling and processes that enable product engineers to care less about infrastructure. He spends his spare time playing guitar, reading comic books, listening to heavy metal, and thinking about how to make software more human-centric.

Track 2

Grand Ballroom D

Benefits of Taking the Less Traveled Road with Containers Infrastructure

Monday, 1:40 pm1:55 pm

Eduard Iacoboaia, Booking.com

Available Media

After almost a year of running Openshift Origin we decided to migrate to a vanilla Kubernetes setup and during this phase, we had to take some hard decisions. This talk will explain the reasons, the benefits and the technical details of some not-mainstream (at that time) decisions we took.

Eduard Iacoboaia, Booking.com

I'm a Senior Systems Administrator working for more than 5 years at Booking.com. During the first years, I worked on several teams, some of them managing infra for more than a hundred services. I saw the need for a change in process and I'm happy to see our containers infrastructure evolving from a hackathon project to an entire track in the core-infra department.

The Ops in Serverless

Monday, 1:55 pm2:10 pm

Jennifer Davis, Microsoft

Available Media

In this talk, we will examine the increased need for specialized Operations Engineering in the Age of Serverless. We'll use the serverless platform to explore three critical areas of operational readiness of testing, monitoring, and debugging.

Jennifer Davis, Microsoft

Jennifer Davis is a Senior Cloud Advocate at Microsoft. She is also the co-author of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Testing in Production at Scale

Monday, 2:15 pm2:45 pm

Amit Gud, Uber

Available Media

Once frowned upon, testing in production has started to become a viable solution, especially in the microservices architecture. We present a case study of implementing testing in production at one such large-scale organization. This talk provides insights into real-world testing in production architecture. This is the talk for you if large-scale integration and load testing are on your mind.

Amit Gud, Uber

Having worked for multiple companies in the storage and systems domain, from startups to multi-billion companies, Amit has a track record of tackling issues relating to performance and scalability. Amit has a masters degree from Kansas State University. He has worked on multiple research papers and has authored multiple (pending) patents.

Tackling Kafka, with a Small Team

Monday, 2:50 pm3:20 pm

Jaren Glover, Robinhood

Available Media

This is a story about what happens when a distributed system becomes a big part of a small team's infrastructure. This distributed system was Kafka and the team size was one engineer. I will discuss my failures along with my journey of deploying Kafka at scale with very little prior distributed systems experience. This presentation will be a tactical approach to conquering a complex system with an understaffed team while your business is growing fast.

Jaren Glover, Robinhood

Jaren Glover is an early engineer at Robinhood. He has spent the last 3 years scaling Robinhood's distributed systems and to support its rapid customer growth. He also allocates a large percentage of his time scaling Robinhood's human capital via new hire mentoring and on-boarding. Jaren lives in the intersection of software, economics, and culture.

3:20 pm–3:50 pm

Break with Refreshments

Grand Ballroom Foyer and Northside Ballroom
Sponsored by Bloomberg

3:50 pm–5:30 pm

Track 1

Grand Ballroom ABC

Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance

Monday, 3:50 pm4:20 pm

Lynn Root, Spotify

Available Media

Do you maintain a Rube Goldberg-like service? Perhaps it's highly distributed? Or you recently walked onto a team with an unfamiliar codebase? Have you noticed your service responds slower than molasses? This talk walks you through how to pinpoint bottlenecks, approaches, and tools to make improvements, and make you seem like the hero! All in a day's work.

The talk will describe various types of tracing a web service, including black & white box tracing, tracing distributed systems, as well as various tools and external services available to measure performance. I also present a few different rabbit holes to dive into when trying to improve your service's performance.

Lynn Root, Spotify

Lynn Root is an SRE at Spotify in NYC, with historical issues of using her last name as her username, and the resident FOSS evangelist. She is also a global leader of PyLadies and former Vice Chair of the Python Software Foundation Board of Directors. When her hands are not on a keyboard, they are usually holding a bass guitar or a paintbrush.

Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest

Monday, 4:25 pm4:55 pm

Danny Chen, Bloomberg LP

Available Media

Loggers and tracers have become crucial components of computing systems, providing invaluable visibility into the runtime behavior of our software. Ironically, these vital components are opaque when it comes to their own runtime behavior. We typically only look at logging as suspects in performance-related incidents as part of post-mortem analysis.

Why do we have such blind spots when it comes to components that are pervasively used in our systems? We explore possible explanations and present example solutions.

Danny Chen, Bloomberg LP

Danny Chen started his career almost 40 years ago as a UNIX performance engineer at Bell Laboratories where he was a co-developer of one of first general purpose UNIX kernel tracing facilities (USENIX/1988: CASPER the Friendly Daemon). He also contributed to the SVR4 virtual memory implementation (USENIX/1990: "Insuring Improved VM Performance - Some No-Fault Policies). Since then he has worked on a wide variety of systems ranging from low latency market data to distributed transaction managers—all while tailing logs.

Operating within Normal Parameters: Monitoring Kubernetes

Monday, 5:00 pm5:30 pm

Elana Hashman, Two Sigma

Available Media

After Kubernetes takes over your data centers, how can you be sure that it's operating within normal parameters? What does "normal" even mean? By formalizing your expected quality of service, you can measure and compare against known targets with open source tools like Prometheus. In this talk, we'll use Kubernetes as a case study for introducing service level objectives (SLOs) to guide monitoring efforts. Come learn the how and why of metric selection for monitoring Kubernetes quality of service, what gaps exist in the open source Kubernetes monitoring ecosystem, how to use Prometheus and its exporters to establish predictability and "normal" baselines, and how to use this telemetry to debug service degradations in a Kubernetes cluster.

Elana Hashman, Two Sigma

Elana Hashman currently works as a Reliability Engineer at Two Sigma, wrangling Kubernetes clusters and automating operations. She is a currently a member of the Kubernetes Instrumentation SIG, where she focuses on benchmarking and metrics usability. In the wider FOSS community, she is a Debian Developer, maintaining the Clojure package ecosystem in Debian and Ubuntu, and a Python Packaging Authority committer, hacking on portable binary Python wheels for Linux.

Track 2

Grand Ballroom D

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager

Monday, 3:50 pm4:20 pm

Jen Wohlner, Fastly

Available Media

SRE and product management—do those even go together? Yes! In this talk, we'll go over small ways and big strategies to form sustainable, impactful relationships with your users and build products that they love whether or not your SRE team has an official product manager. SRE teams' users are other engineers, data scientists, designers, and anyone else who pushes code at your company. It's not enough to build perfectly engineered platforms and tooling. SRE teams must build scalable, opinionated, USABLE products and workflows. This talk will give you the framework to get there.

Jen Wohlner, Fastly

Jen Wohlner is a product manager for platform engineering at Fastly, an edge cloud platform that provides a content delivery network, Internet security products, load balancing, and video and streaming services for major companies across the globe. Previously, Jen worked as a senior technical program manager for site reliability engineering at LinkedIn where she had a special focus on resiliency projects across LinkedIn's stack and at BuzzFeed where she led the software infrastructure and tools infrastructure groups. In her spare time, Jen runs marathons, cooks feasts, draws, makes ceramics, and serves as vice chair on the board of directors for the Point Foundation, the nation's largest LGBT scholarship fund for higher education.

Shipping Software with an SRE Mindset

Monday, 4:25 pm4:55 pm

Theo Schlossnagle, Circonus

Available Media

Most SRE techniques revolve around resiliency and reliability of service delivery. Most "product" is the type of product that is deployed, not shipped. At Circonus, we deal with a lot of on-premise software shipment due to hybrid customer requirements. It turns out that many SRE techniques can apply directly to the construction, packaging, and shipment of installed software as well. In this talk, we'll learn all about it.

Theo Schlossnagle, Circonus

The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded four technology startups focusing on large systems scalability and distributed systems. He is a Distinguished Member of the ACM and sits on the ACM Practitioners Board and serves as co-chair for the ACM Queue.

Using PRDs and User Journeys to Design User-Friendly Tools

Monday, 5:00 pm5:30 pm

Gwendolyn Stockman, Google

Available Media

Implementing software is one core aspect of the SRE role. Often this software will be used by multiple teams. SREs need to make sure that what they build is easy to use and understandable by all users. Product Requirement Documents (PRDs) can help collect and prioritize requirements for tooling and other software. But how do you write a good PRD?

Gwendolyn Stockman, Google

Gwendolyn Stockman has worked at Google since 2008, first as an SWE then as an SRE for the last 5 years. She is on the Customer Reliability Engineering team which she joined after being on a similar group which works with teams within Google launching to production. Before helping services launch she learned to be an SRE on a team working with the bandwidth of Google's internal network.

5:30 pm–6:30 pm

Happy Hour

Grand Ballroom EFGHI

Tuesday, March 26

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer and Northside Ballroom
Sponsored by Microsoft Azure

9:00 am–10:30 am

Track 1

Grand Ballroom ABC

SRE Classroom - How to Design a Distributed System in 3 Hours

Tuesday, 9:00 am12:30 pm

Ryan Thomas, JC van Winkel, Phillip Tischler, and Jennifer Mace, Google

Available Media

Participants in this workshop will learn principles of systems design, and work in small groups to apply the concepts to designing a distributed system. This workshop emphasizes design skills for the real world, including how to integrate third-party or Cloud-based software components into your own systems.

Ryan Thomas, Google

Ryan is a Site Reliability Engineering Manager at Google Australia, and currently manages the Accelerated Storage SRE team. Ryan is passionate about the design, implementation, and operation of large-scale distributed systems, and sharing his experiences with anyone interested in SRE. In his spare time, Ryan enjoys messing about with the Rust programming language.

JC van Winkel, Google

JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the SRE education team, SRE EDU.

Phillip Tischler, Google

Phillip Tischler is a Senior Software Engineer & Site Reliability Engineer at Google NYC. Phillip is currently SRE Tech Lead of ACL-d Search, which is search over data with permissions/sharing. Phillip also works on general indexing and search, aggregations, and low latency serving. Phillip is passionate about large-scale distributed systems and autonomous robotics. In his spare time, Phillip is a Director of the AUVSI SUAS Competition and mentors local FIRST robotics teams.

Jennifer Mace, Google

Macey is a Senior Site Reliability Engineer at Google Seattle, where she wrangles the world's largest fleet of Kubernetes clusters under the banner of GKE. Previously the tech lead of Display Ads SRE, she has contributed to the latest SRE Workbook on topics from Incident Management to the interplay between load balancing and autoscaling systems. Ask her about multi-single-tenancy, and why that phrase should give you nightmares.

Track 2

Grand Ballroom D

Migrating a Monolith to the Cloud

Tuesday, 9:00 am9:30 am

Keyur Govande, Etsy

Available Media

After over a decade of hosting itself in the data center, Etsy.com moved to the Google Cloud Platform (GCP) in 2018. In this talk, I'll go over:

  • why the company decided to make the transition
  • our architectural approaches to migrating a large monolith, and the difficulties we faced gaining confidence in them
  • the assumptions we never knew about and had to fix: in the application code, the infrastructure tooling, and our processes
  • cutting over to GCP, safely
  • things we learnt running there for the last 9+ months

Keyur Govande, Etsy

Keyur is the Chief Architect at Etsy. He has led multiple large architectural changes during his tenure, most recently the move to Google Cloud. Prior to this role, he was a key member of the Systems Engineering team helping scale the site and keeping PHP, Gearman, MySQL, Memcached, Redis, and the Linux kernel running smoothly.

An Introduction to GraphQL

Tuesday, 9:30 am10:00 am

Nat Welch, Google

Available Media

GraphQL is a data sharing schema from Facebook. This talk will introduce the schema, common uses of it, pros and cons versus other data formats. Nat will also talk about some things to consider when using GraphQL in production, and common problems people encounter while running GraphQL deployments and how to combat those issues.

Nat Welch, Google

Nat Welch is an SRE based in Brooklyn, NY, and the author of "Real World SRE" from Packt Publishing. He currently works for Google on the Customer Reliability Engineering team. In the past, he has worked for First Look Media, Hillary for America, iFixit, and others.

Service Discovery Challenges at Scale

Tuesday, 10:00 am10:30 am

Ruslan Nigmatullin, Dropbox, Inc.

Available Media

We'll discuss what challenges does one face while building Service Discovery at scale of millions of processes, tens of millions of clients, and tens of thousands of state changes per second.

Ruslan Nigmatullin, Dropbox, Inc.

Ruslan Nigmatullin is a Software Engineer in Traffic team at Dropbox. Before that he was a Software Engineer in the Internal Components Team at Yandex.

10:30 am–11:00 am

Break with Refreshments

Grand Ballroom Foyer and Northside Ballroom
Sponsored by Catchpoint

11:00 am–12:30 pm

Track 1 (continued)

Grand Ballroom ABC

SRE Classroom - How to Design a Distributed System in 3 Hours

Tuesday, 9:00 am12:30 pm

Ryan Thomas, JC van Winkel, Phillip Tischler, and Jennifer Mace, Google

Available Media

Participants in this workshop will learn principles of systems design, and work in small groups to apply the concepts to designing a distributed system. This workshop emphasizes design skills for the real world, including how to integrate third-party or Cloud-based software components into your own systems.

Ryan Thomas, Google

Ryan is a Site Reliability Engineering Manager at Google Australia, and currently manages the Accelerated Storage SRE team. Ryan is passionate about the design, implementation, and operation of large-scale distributed systems, and sharing his experiences with anyone interested in SRE. In his spare time, Ryan enjoys messing about with the Rust programming language.

JC van Winkel, Google

JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the SRE education team, SRE EDU.

Phillip Tischler, Google

Phillip Tischler is a Senior Software Engineer & Site Reliability Engineer at Google NYC. Phillip is currently SRE Tech Lead of ACL-d Search, which is search over data with permissions/sharing. Phillip also works on general indexing and search, aggregations, and low latency serving. Phillip is passionate about large-scale distributed systems and autonomous robotics. In his spare time, Phillip is a Director of the AUVSI SUAS Competition and mentors local FIRST robotics teams.

Jennifer Mace, Google

Macey is a Senior Site Reliability Engineer at Google Seattle, where she wrangles the world's largest fleet of Kubernetes clusters under the banner of GKE. Previously the tech lead of Display Ads SRE, she has contributed to the latest SRE Workbook on topics from Incident Management to the interplay between load balancing and autoscaling systems. Ask her about multi-single-tenancy, and why that phrase should give you nightmares.

Track 2

Grand Ballroom D

Inside the Kube: A Guided Tour of Kubernetes Cluster Setup

Tuesday, 11:00 am12:30 pm

Liz Frost, VMware

Available Media

A lot of SREs are (or will soon be) responsible for Kubernetes clusters. But what exactly makes up Kubernetes? This talk will dive into the services and systems that make a cluster work, how they interact, and what can go wrong. Kubernetes will no longer be a black box, but a system that can be debugged, reconfigured, and improved to suit every administrators' needs.

Liz Frost, VMware

Liz Frost is a kubernetes contributor and engineer at VMware, née Heptio. She is also a dog mom, queer woman, and occasionally a colorful pony.

12:30 pm–2:00 pm

Luncheon

Grand Ballroom EFGHI
Sponsored by LaunchDarkly

2:00 pm–3:30 pm

Track 1
Track 2

Grand Ballroom D

What I Wish I Knew before Going On-call

Tuesday, 2:00 pm3:30 pm

Chie Shu, Dorothy Jung, and Wenting Wang, Yelp

Available Media

Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we'll share common myths among new on-call engineers and the Do's and Don'ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes.

Chie Shu, Yelp

Chie Shu is a backend Software Engineer at Yelp. She has worked on improving Yelp's revenue-critical Ads data pipeline to be more resilient to system failures, and designed heuristics used internally by executives and Product Managers to assess the financial impact of on-call incidents. Chie holds a bachelor's degree in Computational Biology from Cornell University.

Dorothy Jung, Yelp

Dorothy Jung is a Software Engineer with multiple years of on-call experience. At Yelp she served as a "pushmaster", managing and monitoring company-wide deployments to production; and as a release engineering deputy, helping to set up CI/CD pipelines within the Ads organization. She was previously at DreamWorks Animation R&D, where she worked on upgrading the studio's build management tools. Dorothy holds a bachelor's degree in Computer Science and French from the University of California, Berkeley.

Wenting Wang, Yelp

Wenting Wang is a Software Engineer with three years of industry experience. She has been on-call for different teams at Yelp: on the BizApp backend team, where she worked closely with mobile developers and monitored mobile user traffic; and on the Ads team, where she currently develops and maintains revenue-critical real-time processing systems. Wenting received her master's degree in Computer Science from Shanghai Jiao Tong University and was previously a doctoral candidate in Computer Science focusing on distributed systems at the University of Illinois at Urbana-Champaign.

3:30 pm–4:00 pm

Break with Refreshments

Grand Ballroom Foyer and Northside Ballroom
Sponsored by Circonus

4:00 pm–5:30 pm

Track 1 (continued)
Track 2

Grand Ballroom D

Running Excellent Retrospectives: Talking for Humans

Tuesday, 4:00 pm5:30 pm

Courtney Eckhardt, Heroku, a Salesforce company; Lex Neva, Fastly

Available Media

How many awful meetings have you been to in your life, where people are talking forever and saying nothing, or where people are talking at cross purposes and not listening, or where they're saying things that make everyone feel bad? Have you been in retrospectives like that? (Did it make you never want to attend a retrospective again?)

Let's do better! Come learn practical techniques for facilitating pleasant, productive, welcoming retrospectives (which will improve any meeting you need to run). We will talk about the structure of welcoming language and discuss when it's necessary to interrupt someone. We'll examine what it means for language to include blame and how to reframe blaming conversations. We'll practice the mental work of understanding things that seem contrafactual but are actually just confusing (especially helpful for discussing complex systems). When you leave, you'll be ready to make any meeting or retrospective you're in more comfortable and effective, as a leader or an attendee.

Courtney Eckhardt, Heroku, a Salesforce company

Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we'd like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman (among others).

Lex Neva, Fastly

Lex has six years of experience keeping large services running, including Linden Lab's Second Life, Deviantart.com, Heroku, and his current position at Fastly. While originally trained in computer science, he's found that he most enjoys applying his software engineering skills to operations. A veteran of many large incidents, he has strong opinions on incident response, on-call sustainability, and the intersection of culture and SRE, and he currently runs the SRE Weekly newsletter.

5:30 pm–7:30 pm

Reception

Grand Ballroom EFGHI
Sponsored by Packet

7:30 pm–9:00 pm

Lightning Talks

Grand Ballroom D

Available Media
  • Livetweeting Tech Conferences
    Bridget Kromhout, Microsoft
  • 5 Insights from 200 SREs on How Incident Response Affects Them
    Jaime Woo, Dawn Parzych, Catchpoint
  • Distributed Systems Need Deadlines
    Paul Henry, Coinbase
  • Doughnut Dilemma: A Lesson in Resource Managers
    Ravi Lachhman, AppDynamics
  • Automating SRE Work: Focusing on High-Return Customer and Business Outcomes
    Aniket Kulkarni, PayPal
  • Durable Disorder
    Anthony Sandoval, GitLab Inc
  • The Operation Maturity Model
    Matthew Fornaciari, Gremlin, Inc.
  • "Monitoring and Alerting, Ain't Nobody Got Time for That": How USDS Bootstrapped Basic SRE Best Practices a Week before Launch at FEMA
    David Holmes, USDS

Wednesday, March 27

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:40 am

Track 1

Grand Ballroom ABC

Optimizing for Learning

Wednesday, 9:00 am9:30 am

Logan McDonald, BuzzFeed

Available Media

The talk is about the most powerful observability system SREs have at their disposal: the human mind! I draw from cognitive science to discuss how we can improve how we learn and store information about our systems in our brains in order to respond better to incidents and anomalies. It's a talk broken into four parts: preparing to learn, gaining knowledge, building mental models, and enabling a team to learn well together.

Logan McDonald, BuzzFeed

Logan is a security-focused Site Reliability Engineer at BuzzFeed, based in New York City. She is a maintainer of BuzzFeed's open source centralized sign-on platform, sso, and has written for dev.to and Increment Magazine. She is obsessed with learning, but especially with the learning process that accompanies onboarding people new to security, operations tooling, and concepts. If Logan has a personal brand, she hopes it is "Friendly Neighborhood Operations Engineer."

Zero to SRE

Wednesday, 9:35 am10:05 am

Kim Schlesinger, ReactiveOps & diversity

Available Media

Being able to transform a junior engineer into an excellent mid, then senior engineer is a competitive advantage for any company. Unfortunately, there aren't many entry-level SRE job postings, and if your company hasn't hired juniors before, you'll need to make changes in order to create an environment where they can thrive.

This talk is the story of a junior Web Developer turned SRE. I've been able to successfully transition into my role because my company has embraced junior engineers by creating a 'Culture of Error,' encouraging all engineers to be mentors, and ensuring that all employees take time during the day to learn new skills.

By end of this talk, I'll share the specific details, and you will have a roadmap for how to support junior SREs during their first day, month, 90 days and year.

Kim Schlesinger, ReactiveOps & diversity

Kim Schlesinger is Site Reliability Engineer at ReactiveOps. Prior to being an SRE, Kim was an Instructor, Web Developer, and Curriculum Designer for the Full-Stack Immersive Program at Galvanize, a code school based in Denver, Colorado.

In her spare time, Kim is active in the Colorado Chapter of Leadership for Educational Equity, and she is the co-founder of diversity, a company that is striving to make the tech industry more equitable.

One on One SRE

Wednesday, 10:10 am10:40 am

Amy Tobey, GitHub

Available Media

When Amy started at GitHub, support for SRE principles and technical solutions were well underway. What was missing was how to handle the human side: how can a group of individual contributors influence a company to prioritize reliability? To that end, she created the 1:1 SRE outreach and 1:1 incident debrief programs for the purpose of growing GitHub's culture of resilience by embracing the values of empathy and psychological safety. This talk will cover how the programs work, how they were launched, and real-world outcomes.

Amy Tobey, GitHub

Amy has worked in web operations for 20 years at companies of every size, touching everything from kernel code to user interfaces. When she's not working she can usually be found around her home in San Jose, caring for her family, practicing piano, or running slowly in the sun.

Track 2

Grand Ballroom D

Scaling SRE Organizations: The Journey from 1 to Many Teams

Wednesday, 9:00 am9:30 am

Gustavo Franco, Google

Available Media

In this talk, the author will share their experience starting new teams, splitting and moving them from both technical and non-technical standpoints. This is ideal for new leaders in charge of SRE wondering when it's time to grow beyond a single team and how to. This is also very valuable for SREs who are interested to know what happens behind the scenes, how to influence such changes and how they can help while avoiding burnout.

Gustavo Franco, Google

Gustavo Franco is a Customer Reliability Engineer at Google working on to learn more about, helping to define, and expanding the reach of SRE. He's been at Google since 2007 and has started, moved and managed several SRE teams such Google Plus Frontend, BreakFix, Horizon Web, Cluster Turnups, Apps Media, Apps Messaging, G Suite and Cloud Identity.

The Curse of SRE Autonomy and How to Manage It

Wednesday, 9:35 am10:05 am

Richard Bondi, Google

Available Media

Within an SRE organization, teams usually develop very different automation tools and processes for accomplishing similar tasks. Some of this can be explained by the software they support: different systems require different reliability solutions. But many SRE tasks are essentially the same across all software: compiling, building, deploying, canarying, load testing, managing traffic, monitoring, and so on.

There are two puzzles here: why does this diversity exist, and how can it be overcome so that SRE teams stop duplicating their development efforts?

This talk presents a solution to both puzzles using the ten-year history of a single SRE tool. The tool is used only internally at a large company. It is one of the rare tools there that has been adopted widely by our very large SRE organization.

Richard Bondi, Google

Richard Bondi has been an engineer at Google since 2011, specializing in the entire web stack and working on travel applications. In 2016 he converted to SRE, and then joined the SRE tech writer team. Before Google, and after leaving his political philosophy PhD program to join the first of many Internet startups, he published a book on cryptography with Wiley of which Bruce Schneier wrote: "This is essential reading for anyone who wants to understands the Microsoft CryptoAPI..."

Learning from Learnings: Anatomy of Three Incidents

Wednesday, 10:10 am10:40 am

Randy Shoup, WeWork

Available Media

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork—their incidents, aftermaths, and recoveries. In all cases, many things went right and a few went wrong; also in all cases, because of blameless cultures, we buckled down, learned a lot, and made substantial improvements in the systems for the future. Looking back with the perspective of 20-20 hindsight, all of these incidents were seminal events that changed the focus and trajectory of engineering at each organization. You will leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management. You will also enjoy a few war stories from the trenches, none of which has been previously told fully in public.

Randy Shoup, WeWork

Over the past several decades, Randy Shoup has led high-performing engineering teams at eBay, Google, Stitch Fix, and WeWork. A long-time advocate of DevOps practices, Randy specializes in scaling engineering organizations, company cultures, and technology infrastructures. He is equally excited to talk about empathy and learning cultures as discuss distributed systems, microservices, or containers.

10:40 am–11:10 am

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Two Sigma

11:10 am–12:50 pm

Track 1

Grand Ballroom ABC

Fault Tree Analysis Applied to Apache Kafka

Wednesday, 11:10 am11:40 am

Andrey Falko, Lyft

Available Media

At last year's SREcon, we were inspired by talks that introduced fault tree analysis. We decided to apply the technique to bulletproof our Apache Kafka deployments. In this talk, learn about fault tree analysis and what you should focus on to make your Apache Kafka clusters resilient.

Andrey Falko, Lyft

Andrey Falko is one of the first Reliability Software Engineers at hired at Lyft, where he has been for seven months. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he researched Kafka and Pulsar performance and reliability. While there, he also built an IaaS system, many CI/CD systems, a Zipkin service, and features for the Salesforce platform.

Strategies to Edit Production Data

Wednesday, 11:45 am12:15 pm

Julie Qiu, Google

Available Media

At some point, we all find ourselves at a SQL prompt making edits to the production database. We know it’s a bad practice and we always intend to put in place safer infrastructure before we need to do it again—what does a better system actually look like?

This talk progresses through 5 strategies for teams using a Python stack to do SQL writes against a database, to achieve increasing safety and auditability:

  1. Develop a process for raw SQL edits
  2. Run scripts locally
  3. Run scripts on an existing server
  4. Use a task runner
  5. Build a Script Runner service

We’ll talk about the pros and cons of each strategy and help you determine which one is right for your specific needs.

By the end of this talk you’ll be ready to start upgrading your infrastructure for making changes to your production database safely!

Madaari: Ordering for the Monkeys

Wednesday, 12:20 pm12:50 pm

Ashutosh Raina and Ramprasad Ellupuru, eBay

Available Media

Lineage Driven Fault Injection (LDFI) is a state of the art technique in chaos engineering experiment selection. As SRE's we would like to perform chaos experiments that reveal the bugs that the customers are most likely to hit first. In this talk, we present new improvements to LDFI that orders the experiment suggestions.

In the first the half of the talk we will show introduce LDFI as a technique that can be widely used within an enterprise. We also highlight how ordering is general purpose technique that we can use to encode the peculiarities of a heterogeneous microservices architecture. LDFI can work in an enterprise by harnessing the observability infrastructure to model the redundancy of the system.

Next, we present experiments conducted within eBay using ordered LDFI and some preliminary results. We show examples of services where we discovered bugs, and how carefully controlling the order of experiments allowed LDFI to avoid running unnecessary experiments.

We will discuss open problems and future direction of LDFI.

Key takeaways :

  1. Understand how LDFI can be integrated in the enterprise by harnessing the observability infrastructure
  2. Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos experiments
  3. Preliminary results of prioritized LDFI and a future direction for the community

No prior knowledge of LDFI is required.

Ashutosh Raina, eBay

Ashutosh is a member of the Site Reliability team at eBay focussed on bringing LDFI to the enterprise. He works at the intersection of academia and industry, trying his best to fuse them together. Previously, Ashutosh was a graduate student at UCSC working at Disorderly Labs making distributed systems safer using LDFI.

Ramprasad Ellupuru, eBay

Ramprasad is a member of the Site Reliability team at eBay working on making checkout highly reliable and available. He is an experienced developer and a new practitioner of chaos engineering at eBay.

Track 2

Grand Ballroom D

Sublinear Scaling in Practice: The 1k SRE Project

Wednesday, 11:10 am11:40 am

Nikolaus Rath, Google

Available Media

At Google, one of the primary objectives of SRE teams is sublinear scaling: the size and number of SRE teams should grow more slowly than the number of supported services. This talk will describe how one team has implemented this principle. Over the last 3 years, we have increased our portfolio by more than 200% (from 187 to 431 supported services) without additional staffing, and we plan for continued growth up to 1000 services. We will review the extensive automation infrastructure that we have in place, describe ongoing projects (including automated incident handling), and discuss the changes we've made in how we approach SRE - moving away from service-specific production readiness reviews towards automated policy verification and service-agnostic consulting. Audience members will hear about a vision for the long-term role of SRE in large organizations, where sublinear scaling requires not just increasing automation but a cultural shift from providing service-specific expertise to mostly service-independent consulting.

Nikolaus Rath, Google

Dr. Nikolaus Rath is a site reliability engineer working on Google's advertising services. Before joining Google, he worked on feedback control systems for magnetically confined plasmas. He is a maintainer of a number of open-source projects, including libfuse and S3QL.

Pragmatic Automation

Wednesday, 11:45 am12:15 pm

Max Luebbe, Google

Available Media

Automation is great, but how do you know when the right thing to do is to stop writing it? How do you take on complex automation projects of unknown scope and deliver impact incrementally?

This talk explores lessons learned in the automation space at a large public Cloud provider, that are applicable to anyone looking for new ideas to reduce toil in their day to day work.

Max Luebbe, Google

Max has been an SRE at Google since 2009, having spent most of that time working in Storage Infrastructure. More recently he was on the teams that externalized Bigtable and Spanner as GCP Products and currently leads the effort to deploy new Google Cloud Regions all over the globe.

Differences in SRE Implementations across Companies

Wednesday, 12:20 pm12:50 pm

Kurt Andersen, LinkedIn

Available Media

With the popularity of "SRE" as a job role, people have become aware that not all such roles are entirely equivalent. There's been a slack channel on the USENIX-SREcon workspace (https://usenix.org/srecon/slack #sre_between_companies) where people have started to explore these distinctions.

This session will be an opportunity to crowd-source more information. It will be a moderated, audience driven session. Come and tell us what SRE means at your company!

Kurt Andersen, LinkedIn

Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.

12:50 pm–2:20 pm

Luncheon

Grand Ballroom EFGHI
Sponsored by Google

2:20 pm–4:00 pm

Track 1

Grand Ballroom ABC

Latency SLOs Done Right

Wednesday, 2:20 pm2:50 pm

Fred Moyer, Circonus

Available Media

Median, average, 90th, 99th percentile. We've all seen these metrics on our monitoring systems, both open source and from commercial vendors, but often they are used incorrectly when constructing Service Level Objectives. This session will show three different approaches to correctly calculating latency SLOs, and how histograms can be used to calculate mathematically correct quantiles and set SLOs based on those.

Fred Moyer, Circonus

Fred Moyer is a Developer Evangelist for Circonus, where he likes to apply math to ridiculously large sets of data. Fred is a recovering Perl and C programmer, and these days likes to hack in Go and is learning Lua. He is a 2013 White Camel award winner, Apache Software Foundation member, and has worked in software engineering and reliability roles for the last 18 years.

Extending the Error Budget Model to Security and Feature Freshness

Wednesday, 2:55 pm3:25 pm

Jim Thomson and David Laing, Pivotal

Available Media

Everyone knows about error budgets (most every SRE at this conference, anyway) and how to use them to manage availability.

But what about operations outcomes beyond availability, like _security_ and _feature freshness_? In this talk, we'll describe how to apply the error budget model to measure and improve security and feature value and mitigate the risk of change. And we'll give you the tools to brag about your success.

Jim Thomson, Pivotal

Jim is a Product Lead at Pivotal Cloud R+D, and loves to bring product-thinking into an operations world. While he's more into dogs, he shares a love of dad-jokes with David.

David Laing, Pivotal

David is an Engineering Lead at Pivotal Cloud R+D. He previously ran CloudOpsEU - the team that keeps Pivotal Tracker's foundation available, secure, and feature-fresh. He is particularly fond of cats and dad-jokes.

You Don't Have to Love Your Job

Wednesday, 3:30 pm3:45 pm

Leslie Carr, Quip

Available Media

"Do what you love, and you'll never work another day in your life." -- someone who's never had a job

We're often told that we need to love our jobs—which sounds great on paper. But if everyone did what they loved, the world would only have astronaut pilots and pony huggers. Feeling we need to love our jobs pushes an imposter syndrome myth and makes great employees feel like they're not doing the right thing.

You don't have to love your job, you just need to like it!

Leslie Carr, Quip

Leslie Carr is an Engineering Manager at Quip.

Leslie transformed from a productive engineer into a pointy-haired manager while at Clover Health. In her past life, Leslie worked at Cumulus Networks in DevOps, helping to push automation in the network world. Prior to that, she was on the production side of the world at many large websites, such as Google, Craigslist, and Wikimedia.

Leslie is a lover and user of open source and automation. She dreams of robots taking over all of our jobs one day.

Mindfulness in SRE: Monitoring and Alerting for One's Self

Wednesday, 3:45 pm4:00 pm

Tommy Lutz, Google

Available Media

As SREs, we are all permanently on-call for our own well-being. Without proper monitoring and alerting about what's going on in our body, mind, and surroundings, we're likely to fall short of our own expectations regarding stress management, work-life balance, social interactions, and risk management. This talk provides an illustrative definition of mindfulness, provides practical examples of its usefulness in SRE, and builds the concept of self-monitoring that can improve performance both on the job and on the street.

Tommy Lutz, Google

Tommy Lutz is an SRE manager at Google and a former engineering manager at Bloomberg Tradebook. The SRE team he serves supports Google's archival storage systems. Tommy is known for commuting to Google NYC by folding boat and bicycle on the Hudson River. The long float down the river bucks the trend of "ever-faster" lifestyles in favor of gentle lapping water, jumping fish, and time for reflection and contemplation (along with an occasional visit from the NYPD).

Track 2

Grand Ballroom D

Automating the Management of the Operational Health of Cloud Accounts at Scale

Wednesday, 2:20 pm2:50 pm

Jamie Walls, Capital One

Available Media

In a large scale environment where engineers are empowered to independently deliver an application from concept to working production system, and in public cloud providers that allow access to do almost anything, there is a unique challenge of implementing and maintaining controls that align with tight banking regulations. I will discuss how we've used a combination of open source tools and our custom automation to solve various challenges such as:

  • Limiting public access
  • Staying ahead of account resource limits
  • Enforcing resource ownership
  • Cost control
  • Security patching
  • Account-impacting mistakes

Jamie Walls, Capital One

Jamie has experience in operations and on feature delivery teams and brings an understanding of the balance between high operational quality and time to market. He understands the value in "Shift Left" operational testing and validation where a focus on simplifying and automating early stages of any process will lead to higher quality later. He currently works in a role establishing and enforcing cloud best practices within a large-scale public cloud environment.

Designing Resilient Data Pipelines

Wednesday, 2:55 pm3:25 pm

Andrew Bolin, Two Sigma Investments, LP

Available Media

There are a number of questions that plague any operator of a complex data pipeline. How do I quickly recover from failures in my pipeline? How do I know that the data I generate is accurate? How do I minimize the risk associated with updating my pipeline? Designing your data pipeline with resiliency and observability in mind will help to answer these questions. In this talk, I will present several strategies that my team has adopted for reducing operational complexity, risk associated with updates, and concerns about accuracy of data pipelines.

Andrew Bolin, Two Sigma Investments, LP

Andrew Bolin is a Reliability Engineer at Two Sigma Investments where he is responsible for the design and operation of data pipelines critical to the firm's research environment. Before his current role, Andrew worked on the team responsible for the development of Two Sigma's open source fair-share scheduler, Cook. Andrew has an equal passion for spreading RE best practices at Two Sigma and exploring the diverse food offerings of NYC.

From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services

Wednesday, 3:30 pm4:00 pm

Salim Virji and Carlos Villavieja, Google LLC

Available Media

Artificial intelligence is all around us, from the digitals assistants in our microwaves to the apps we rely on every day. Many of these systems build on APIs and services that use machine learning to provide key features. This talk will describe techniques for building predictable, reliable ML-based services as well as ways to sustain these services through social and technical change. We discuss challenges unique to the reliability of these systems and relate our experiences with ML in our production systems to illustrate our techniques.

Salim Virji, Google LLC

Salim Virji is a Site Reliability Engineer at Google, where he has worked on distributed compute, consensus, and storage systems.

Carlos Villavieja, Google LLC

Carlos Villavieja is a Computer Architect/Researcher working as a Software/Site Reliability Engineer at Google. He works on Storage optimizations and his interests vary from micro-architecture to machine learning.

4:00 pm–4:30 pm

Break with Refreshments

Grand Ballroom Foyer

4:30 pm–5:30 pm

Closing Plenary Session

Grand Ballroom ABCD

Resilience Engineering Mythbusting

Wednesday, 4:30 pm5:00 pm

Will Gallego, Fastly

Available Media

How confident are you in your prod servers staying up without your help? Too often in tech we mistakenly interchange three important concepts when describing our socio-technical systems: how resilient they are, the reliability they exhibit in day to day work, and how robust they are under duress. Though interrelated, they are not equivalent.

How can we successfully gain insights in post-incident reviews, execute chaos engineering experiments, and build scalable infrastructure if we're misinterpreting our approaches? By separating out these core concepts, we can isolate better approaches in adapting to unforeseen circumstances. We'll look at common misconceptions when describing our systems as resilient and focus on proven methods to help us improve our understanding of our systems.

Will Gallego, Fastly

Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Senior Software Engineer at Fastly. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers grow. He believes in a free and open internet, blame aware post mortems, and pronouncing gif with a soft "G".

Why Are Distributed Systems So Hard?

Wednesday, 5:00 pm5:30 pm

Denise Yu, Pivotal

Available Media

Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will cover a brief history of distributed databases, clear up some common myths about the CAP theorem, dig into why network partitions are inevitable, and close out by highlighting how a few popular consensus algorithms mitigate the risks of operating in a distributed fashion and the importance of considering human factors to fully understand the systems we build. Almost all slides will contain original illustration featuring mischievous cats masquerading as sysadmins. By the end of this talk you will have a better understanding of the design trade-offs involved in architecting for distributed systems, and hopefully, be inspired to start doodling tech concepts!

Denise Yu, Pivotal

Denise is a software engineer who occasionally wears a product management hat at Pivotal R&D in Toronto. Denise has previously delivered conference talks on topics ranging from continuous delivery to functional programming to scaling company culture. She enjoys learning about distributed systems, release engineering, and low-level Linux kernel programming, and when she's not coding, she is often doodling sketch notes that break down technical concepts into digestible pieces at deniseyu.io/art.

5:30 pm–5:45pm

Closing Remarks

Grand Ballroom ABCD
Program Co-Chairs: Liz Fong-Jones, Honeycomb, and Mike Rembetsy, Bloomberg