SREcon18 Americas Conference Program

All sessions will be held at the Hyatt Regency Santa Clara.

Conference Videos

Videos of the conference talks on Wednesday and Thursday are linked from each presentation below.

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files 
SREcon18 Americas Attendee List (PDF)
Display:

Tuesday, March 27, 2018

7:30 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer
Sponsored by Rundeck

9:00 am–10:30 am

Workshop Track 1

Grand Ballroom C

Containers from Scratch

Avishai Ish-Shalom, Aleph VC, and Nati Cohen, Here Technologies

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Pre-Reading List:

Prerequisites, Skills, and Tools:

Basic knowledge of Python or C, good knowledge of Linux.

Avishai Ish-Shalom, Aleph VC

Avishai is a veteran operations and software engineer with years of high scale production experience. After spending many years in startup and web companies, Avishai now serves as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Workshop Track 2

Grand Ballroom GH

SRE Classroom, or How to Build a Distributed System in 3 Hours

Salim Virji, Laura Nolan, and Phillip Tischler, Google LLC

Available Media

This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design.

Participants will work together in small groups to sketch out the design, identify components and their relationships.

Participants will have a system design and bill of materials at the conclusion of this workshop.

Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving.

Pre-Reading List:

  • CAP Twelve Years Later
  • Raft
  • Distributed systems in production environments
    • The Google File System
    • The Chubby Lock Service for Loosely-Coupled Distributed Systems
  • Microsevices
  • Latency Numbers That Are Good To Know
  • SRE Book Chapters
    • Service Level Objectives
    • Load Balancing at the Frontend
    • Load Balancing in the Datacenter
    • Managing Critical State: Distributed Consensus for Reliability
    • Data Integrity: What You Read Is What You Wrote

Prerequisites, Skills, and Tools:

  • Participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving
  • Participants will work together, using pencil and paper
  • Participants ought to be familiar with order-of-magnitude comparisons

Salim Virji, Google LLC

Salim Virji is a Site Reliability Engineer at Google, where he has built distributed compute, consensus, and storage systems.

Workshop Track 3

Grand Ballroom F

Profiling JVM Applications in Production

Sasha Goldshtein, Sela Group

Available Media

Many profilers, such as JVisualVM and jstack, will simply lie to your face about which call stacks are hottest and where your bottlenecks lie. Profiling in production environments has even more challenges, because you have to carefully manage overhead, account for other processes running on the system, and choose non-invasive tools that don't require an application restart.

We will build a simple checklist for verifying JVM application performance, and finding the area to focus on in a closer investigation. Then, we will experiment with two approaches for CPU profiling on Linux: the perf multi-tool, combined with perf-map-agent, and the async-profiler project, an innovative tool that brings perf together with traditional JVM profiling techniques. We will visualize stack traces using flame graphs, and understand where the CPU bottlenecks lie, through a series of hands-on labs.

In the second half of this workshop, we will talk about more complicated scenarios: diagnosing errors when opening files, tracing database queries, monitoring system I/O load, understanding the reasons for excessive garbage collection, figuring out why threads are blocked off-CPU, and more. Some of these tasks can be approached using perf, but others require a bleeding-edge technology—BPF and BCC.

Pre-Reading List:

Prerequisites, Skills, and Tools:

  • Familiarity with shell scripting and basic commands
  • Understanding of OS concepts such as processes, threads, memory
  • Basic understanding of the JVM code execution model, GC, interactions with native code.

Sasha Goldshtein, Sela Group

Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP and Regional Director, Pluralsight and O'Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.

Workshop Track 4

Grand Ballroom AB

Incident Command for IT—What We've Learned from the Fire Department

Brent Chapman, Great Circle Associates, Inc.

Available Media

Leading companies such as Google, Heroku, and PagerDuty have developed successful incident management practices based on the public safety world's Incident Command System (ICS). This workshop will teach you these practices, and help you bring them to your own organization. It is based on my experience creating and refining Google's "IMAG" (Incident Management at Google) protocol, as well as on my experience with incident command in the public safety world as an air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.

In this workshop, we will explore:

  • How do public safety agencies manage emergencies daily?
    • The basic principles of the Incident Command System (ICS)
    • Responsibilities of each ICS role
  • How to launch and manage an effective response
  • How to evolve your response on the fly, scaling it up and down, as both the situation and your resources evolve
  • How to communicate effectively among responders
  • How to communicate beyond the responders, to management, customers, investors, regulators, the public, and others
  • How to conclude a response and return to normal operations
  • How to follow up effectively with a blameless postmortem
  • How to deal with multiple incidents simultaneously

Pre-Reading List

Brent Chapman, Great Circle Associates, Inc.

Brent Chapman is an expert at emergency management and at helping organizations prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).

As a leader in Google's legendary SRE organization, Brent convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system that is now used throughout the company. Brent is also a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.

10:30 am–11:00 am

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Nutanix

Visit the Sponsor Showcase!

11:00 am–12:30 pm

Workshop Track 1 (continued)

Grand Ballroom C

Containers from Scratch

Avishai Ish-Shalom, Aleph VC, and Nati Cohen, Here Technologies

Docker is very popular these days, but how many of us are really familiar with the basic building blocks of Linux containers and their implications? What's missing in the good ol’ chroot jails? What are the available Copy-on-Write options and what are their pros and cons? Which syscalls allow us to manipulate Linux namespaces and what are their limitations? How do resource limits actually work? What different behaviours do containers and VMs have?

In this hands-on workshop, we will build a small Docker-like tool from O/S level primitives in order to learn how Docker and containers actually work. Starting from a regular process, we will gradually isolate and constrain it until we have a (nearly) full container solution, pausing after each step to learn how our new constraints behave.

Pre-Reading List:

Prerequisites, Skills, and Tools:

Basic knowledge of Python or C, good knowledge of Linux.

Avishai Ish-Shalom, Aleph VC

Avishai is a veteran operations and software engineer with years of high scale production experience. After spending many years in startup and web companies, Avishai now serves as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Workshop Track 2 (continued)

Grand Ballroom GH

SRE Classroom, or How to Build a Distributed System in 3 Hours

Salim Virji, Laura Nolan, and Phillip Tischler, Google LLC

Available Media

This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design.

Participants will work together in small groups to sketch out the design, identify components and their relationships.

Participants will have a system design and bill of materials at the conclusion of this workshop.

Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving.

Pre-Reading List:

  • CAP Twelve Years Later
  • Raft
  • Distributed systems in production environments
    • The Google File System
    • The Chubby Lock Service for Loosely-Coupled Distributed Systems
  • Microsevices
  • Latency Numbers That Are Good To Know
  • SRE Book Chapters
    • Service Level Objectives
    • Load Balancing at the Frontend
    • Load Balancing in the Datacenter
    • Managing Critical State: Distributed Consensus for Reliability
    • Data Integrity: What You Read Is What You Wrote

Prerequisites, Skills, and Tools:

  • Participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving
  • Participants will work together, using pencil and paper
  • Participants ought to be familiar with order-of-magnitude comparisons

Salim Virji, Google LLC

Salim Virji is a Site Reliability Engineer at Google, where he has built distributed compute, consensus, and storage systems.

Workshop Track 3 (continued)

Grand Ballroom F

Profiling JVM Applications in Production

Sasha Goldshtein, Sela Group

Available Media

Many profilers, such as JVisualVM and jstack, will simply lie to your face about which call stacks are hottest and where your bottlenecks lie. Profiling in production environments has even more challenges, because you have to carefully manage overhead, account for other processes running on the system, and choose non-invasive tools that don't require an application restart.

We will build a simple checklist for verifying JVM application performance, and finding the area to focus on in a closer investigation. Then, we will experiment with two approaches for CPU profiling on Linux: the perf multi-tool, combined with perf-map-agent, and the async-profiler project, an innovative tool that brings perf together with traditional JVM profiling techniques. We will visualize stack traces using flame graphs, and understand where the CPU bottlenecks lie, through a series of hands-on labs.

In the second half of this workshop, we will talk about more complicated scenarios: diagnosing errors when opening files, tracing database queries, monitoring system I/O load, understanding the reasons for excessive garbage collection, figuring out why threads are blocked off-CPU, and more. Some of these tasks can be approached using perf, but others require a bleeding-edge technology—BPF and BCC.

Pre-Reading List:

Prerequisites, Skills, and Tools:

  • Familiarity with shell scripting and basic commands
  • Understanding of OS concepts such as processes, threads, memory
  • Basic understanding of the JVM code execution model, GC, interactions with native code.

Sasha Goldshtein, Sela Group

Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP and Regional Director, Pluralsight and O'Reilly author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing—across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.

Workshop Track 4 (continued)

Grand Ballroom AB

Incident Command for IT—What We've Learned from the Fire Department

Brent Chapman, Great Circle Associates, Inc.

Available Media

Leading companies such as Google, Heroku, and PagerDuty have developed successful incident management practices based on the public safety world's Incident Command System (ICS). This workshop will teach you these practices, and help you bring them to your own organization. It is based on my experience creating and refining Google's "IMAG" (Incident Management at Google) protocol, as well as on my experience with incident command in the public safety world as an air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.

In this workshop, we will explore:

  • How do public safety agencies manage emergencies daily?
    • The basic principles of the Incident Command System (ICS)
    • Responsibilities of each ICS role
  • How to launch and manage an effective response
  • How to evolve your response on the fly, scaling it up and down, as both the situation and your resources evolve
  • How to communicate effectively among responders
  • How to communicate beyond the responders, to management, customers, investors, regulators, the public, and others
  • How to conclude a response and return to normal operations
  • How to follow up effectively with a blameless postmortem
  • How to deal with multiple incidents simultaneously

Pre-Reading List

Brent Chapman, Great Circle Associates, Inc.

Brent Chapman is an expert at emergency management and at helping organizations prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).

As a leader in Google's legendary SRE organization, Brent convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system that is now used throughout the company. Brent is also a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.

12:30 pm–2:00 pm

Luncheon

Santa Clara Ballroom
Sponsored by LinkedIn

2:00 pm–3:30 pm

Workshop Track 1

Grand Ballroom AB

Kubernetes 101

Bridget Kromhout, Microsoft

It is a truth universally acknowledged that a techie in possession of any production code whatsoever must be in want of a container orchestration platform. What's up for debate, according to noted thought leader Jane Austen, is how many pizzas the team is going to eat.

Let's explore how to create and operate a Kubernetes cluster in order to answer this crucial question. If you're into dev or ops or some portmanteau thereof, this is relevant to your interests. We’ll be following an Azure variant based on the open-source k8s training at http://container.training/, as well as trying out AKS (Azure Container Service); there are takeaways no matter which public or private cloud you use.

As our team grows, we're going to need to scale our k8s cluster, deploying and configuring our pizza delivery app. We'll deal with the consequences of state (you know, where your customers and money live) and carry out service discovery between our deliciously independent microservices. We'll level up on k8s (and pizza) together.

Pre-Reading List:

https://docs.microsoft.com/en-us/azure/aks/

Optional: https://github.com/ivanfioravanti/kubernetes-the-hard-way-on-azure

Prerequisites, Skills, and Tools:

  • A device with a web browser and an ssh client

Bridget Kromhout, Microsoft

Bridget Kromhout is a Principal Cloud Developer Advocate at Microsoft. Her CS degree emphasis was in theory, but she now deals with the concrete (if 'cloud' can be considered tangible). After 15 years as an operations engineer, she traded being on call for being on a plane. A frequent speaker and program committee member for tech conferences, she leads the devopsdays organization globally and the devops community at home in Minneapolis. She podcasts with Arrested DevOps, blogs at bridgetkromhout.com, and is active in a Twitterverse near you.

Workshop Track 2

Grand Ballroom GH

Chaos Engineering Bootcamp

Tammy Butow, Gremlin

Available Media

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

1. Start by defining “steady state” as some measurable output of a system that indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group and the experimental group.
3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

A hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll learn to identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

Pre-Reading List

  1. Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)
  2. Principles of Chaos
  3. Chaos Engineering: Building confidence in system behavior through experiments

Prerequisites, Skills, and Tools

  1. A basic understanding of production environments and the infrastructure required to run systems
  2. Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting
Workshop Track 3

Grand Ballroom F

Ansible for SRE Teams

James Meickle, Quantopian

Ansible is a "batteries included" automation, configuration management, and orchestration tool that's fast to learn and flexible enough for any architecture. It's the perfect choice for automating repetitive tasks, increasing deployment frequency, and running infrastructure at scale. Companies are increasingly turning to Ansible to meet the challenges of transitioning from the data center to the cloud, from servers to containers, and from silos to devops.

In the first half of this workshop, you'll get a high level overview of Ansible's key concepts and learn how to use these building blocks to develop a reusable and maintainable Ansible codebase. You'll also get practical exposure to running Ansible commands on an EC2 instance.

In the second half of this workshop, you'll learn how Ansible interacts with cloud providers and other automation tools. The focus will be on hard problems like secret management, scheduling, and variable inheritance. This half of the workshop also has a practical component: you'll log into Ansible Tower, configure a new task, and provision a cloud instance to run orchestration commands against.

By the end of the workshop you'll understand whether Ansible is the right fit for your team and how to plan a successful deployment.

Pre-Reading List:

A subset of the reading materials from the README.md

Prerequisites, Skills, and Tools:

  • Environment: All students will be provided AWS instances at the start of the course.
  • Tools: Pre-installed into AWS instances. Will need the ability to SSH into a public EC2 instance on port 22.
  • Skills: No Ansible knowledge is required, but students should have moderate or higher skills in configuration management, cloud architecture, and UNIX skills.

James Meickle, Quantopian

James is a site reliability engineer at Quantopian, a Boston startup making algorithmic trading accessible to everyone. Past roles have seen him responsible for processing MRI scans at the Center for Brain Science at Harvard University, sales engineering and developer evangelism at AppNeta, and release engineering during the Romney for President 2012 campaign. Between NYSE trading days, he organizes DevOpsDays Boston and conducts Ansible trainings on O'Reilly's Safari platform. What free time remains is dedicated to cooking, sci-fi, permadeath video games, and Satanism.

Workshop Track 4

Grand Ballroom C

Tech Writing 101 for SREs

Lisa Carey, Google

From post-mortems to operations manuals to code comments, writing things down for others is an unavoidable part of the life of an SRE.

In this workshop, you’ll learn writing principles to help you present technical information from two experienced Google technical writers—and each other! Through a series of pair-work exercises you’ll work through a variety of topics to improve the clarity, readability, and effectiveness of your writing, and possibly think about a toothbrush like you’ve never thought about one before. If you've never before taken any technical writing training, this workshop is perfect for you. If you've taken technical writing training, this class will serve as a great refresher.

There is a small amount of pre-reading for participants in this workshop (~30 minutes of reading about basic technical writing concepts).

The workshop runs for approximately two hours with a short break.

Pre-Reading List:

https://lisafc.github.io/tw101-reading/

Prerequisites, Skills, and Tools:

This class is suitable for all conference attendees.

Lisa Carey, Google

Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, Istio, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.

3:30 pm–4:00 pm

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Dropbox

Visit the Sponsor Showcase!

4:00 pm–5:30 pm

Workshop Track 1 (continued)

Grand Ballroom AB

Kubernetes 101

Bridget Kromhout, Microsoft

It is a truth universally acknowledged that a techie in possession of any production code whatsoever must be in want of a container orchestration platform. What's up for debate, according to noted thought leader Jane Austen, is how many pizzas the team is going to eat.

Let's explore how to create and operate a Kubernetes cluster in order to answer this crucial question. If you're into dev or ops or some portmanteau thereof, this is relevant to your interests. We’ll be following an Azure variant based on the open-source k8s training at http://container.training/, as well as trying out AKS (Azure Container Service); there are takeaways no matter which public or private cloud you use.

As our team grows, we're going to need to scale our k8s cluster, deploying and configuring our pizza delivery app. We'll deal with the consequences of state (you know, where your customers and money live) and carry out service discovery between our deliciously independent microservices. We'll level up on k8s (and pizza) together.

Pre-Reading List:

https://docs.microsoft.com/en-us/azure/aks/

Optional: https://github.com/ivanfioravanti/kubernetes-the-hard-way-on-azure

Prerequisites, Skills, and Tools:

  • A device with a web browser and an ssh client

Bridget Kromhout, Microsoft

Bridget Kromhout is a Principal Cloud Developer Advocate at Microsoft. Her CS degree emphasis was in theory, but she now deals with the concrete (if 'cloud' can be considered tangible). After 15 years as an operations engineer, she traded being on call for being on a plane. A frequent speaker and program committee member for tech conferences, she leads the devopsdays organization globally and the devops community at home in Minneapolis. She podcasts with Arrested DevOps, blogs at bridgetkromhout.com, and is active in a Twitterverse near you.

Workshop Track 2 (continued)

Grand Ballroom GH

Chaos Engineering Bootcamp

Tammy Butow, Gremlin

Available Media

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

1. Start by defining “steady state” as some measurable output of a system that indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group and the experimental group.
3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

A hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll learn to identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

Pre-Reading List

  1. Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)
  2. Principles of Chaos
  3. Chaos Engineering: Building confidence in system behavior through experiments

Prerequisites, Skills, and Tools

  1. A basic understanding of production environments and the infrastructure required to run systems
  2. Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting
Workshop Track 3 (continued)

Grand Ballroom F

Ansible for SRE Teams

James Meickle, Quantopian

Ansible is a "batteries included" automation, configuration management, and orchestration tool that's fast to learn and flexible enough for any architecture. It's the perfect choice for automating repetitive tasks, increasing deployment frequency, and running infrastructure at scale. Companies are increasingly turning to Ansible to meet the challenges of transitioning from the data center to the cloud, from servers to containers, and from silos to devops.

In the first half of this workshop, you'll get a high level overview of Ansible's key concepts and learn how to use these building blocks to develop a reusable and maintainable Ansible codebase. You'll also get practical exposure to running Ansible commands on an EC2 instance.

In the second half of this workshop, you'll learn how Ansible interacts with cloud providers and other automation tools. The focus will be on hard problems like secret management, scheduling, and variable inheritance. This half of the workshop also has a practical component: you'll log into Ansible Tower, configure a new task, and provision a cloud instance to run orchestration commands against.

By the end of the workshop you'll understand whether Ansible is the right fit for your team and how to plan a successful deployment.

Pre-Reading List:

A subset of the reading materials from the README.md

Prerequisites, Skills, and Tools:

  • Environment: All students will be provided AWS instances at the start of the course.
  • Tools: Pre-installed into AWS instances. Will need the ability to SSH into a public EC2 instance on port 22.
  • Skills: No Ansible knowledge is required, but students should have moderate or higher skills in configuration management, cloud architecture, and UNIX skills.

James Meickle, Quantopian

James is a site reliability engineer at Quantopian, a Boston startup making algorithmic trading accessible to everyone. Past roles have seen him responsible for processing MRI scans at the Center for Brain Science at Harvard University, sales engineering and developer evangelism at AppNeta, and release engineering during the Romney for President 2012 campaign. Between NYSE trading days, he organizes DevOpsDays Boston and conducts Ansible trainings on O'Reilly's Safari platform. What free time remains is dedicated to cooking, sci-fi, permadeath video games, and Satanism.

Workshop Track 4 (continued)

Grand Ballroom C

Tech Writing 101 for SREs

Lisa Carey, Google

From post-mortems to operations manuals to code comments, writing things down for others is an unavoidable part of the life of an SRE.

In this workshop, you’ll learn writing principles to help you present technical information from two experienced Google technical writers—and each other! Through a series of pair-work exercises you’ll work through a variety of topics to improve the clarity, readability, and effectiveness of your writing, and possibly think about a toothbrush like you’ve never thought about one before. If you've never before taken any technical writing training, this workshop is perfect for you. If you've taken technical writing training, this class will serve as a great refresher.

There is a small amount of pre-reading for participants in this workshop (~30 minutes of reading about basic technical writing concepts).

The workshop runs for approximately two hours with a short break.

Pre-Reading List:

https://lisafc.github.io/tw101-reading/

Prerequisites, Skills, and Tools:

This class is suitable for all conference attendees.

Lisa Carey, Google

Lisa Carey is a Technical Writer for Google Cloud Platform in Dublin. She has written documentation for many technologies including Protocol Buffers, gRPC, Istio, and Cloud APIs, and regularly runs writing workshops for Google engineers. She holds degrees from Trinity College Dublin.

5:30 pm–6:30 pm

Happy Hour

Terra Courtyard
Sponsored by Google

Wednesday, March 28

7:30 am–8:30 am

Continental Breakfast

Grand Ballroom Foyer
Sponsored by Squarespace

8:30 am–8:45 am

Welcome and Opening Remarks

Grand Ballroom ABCFGH
Program Co-Chairs: Kurt Andersen, LinkedIn, and Betsy Beyer, Google

8:45 am–10:15 am

Opening Plenary Session

Grand Ballroom ABCFGH

If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There

Wednesday, 8:45 am9:15 am

Nicole Forsgren and Jez Humble, DevOps Research and Assessment (DORA)

Available Media

The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction.

Nicole Forsgren, DevOps Research and Assessment (DORA)

Dr. Nicole Forsgren is the CEO and chief scientist at DevOps Research and Assessment (DORA). Nicole is an IT impacts expert who is best known for her work with tech professionals and as the lead investigator on the largest DevOps studies to date. In a previous life, she was a professor, sysadmin, and hardware performance analyst. Nicole has been awarded public and private research grants (funders include NASA and the NSF), and her work has been featured in various media outlets, peer-reviewed journals, and conferences.

Jez Humble, DevOps Research and Assessment (DORA)

Jez Humble is co-author of Accelerate, The DevOps Handbook, Lean Enterprise, and the Jolt Award winning Continuous Delivery. He has spent his career tinkering with code, infrastructure, and product development in companies of varying sizes across three continents, most recently working for the US Federal Government at 18F. He is currently researching how to build high performing teams at his startup, DevOps Research and Assessment LLC, and teaching at UC Berkeley.

Security and SRE: Natural Force Multipliers

Wednesday, 9:15 am9:45 am

Cory Scott, LinkedIn

Available Media

A thorough understanding of how modern security principles impact SRE operations and the services they enable is key for any high-performing SRE organization. On the flip side, modern SRE practices can make the difference between an effective security organization and one that is left in the dust. SRE infrastructure and operations discipline are often more robust than anything being offered by outside security vendors.

During this talk, I’ll address how security teams use SRE principles to better run security practices while also sharing how SRE teams can better apply key security principles. Attendees will come away with an understanding of how these teams are different but also complement each other.

Cory Scott, LinkedIn

Cory Scott is the Chief Information Security Officer at LinkedIn. He is responsible for production and corporate information security, including assessment, monitoring, incident response and assurance activities. Prior to joining LinkedIn, Scott was at Matasano Security, where he led the consulting teams based in Chicago and Mountain View. He has also held technical positions at @stake, Symantec and ABN AMRO/Royal Bank of Scotland. Scott has presented at Black Hat, USENIX, OWASP and SANS.

What It Really Means to Be an Effective Engineer

Wednesday, 9:45 am10:15 am

Edmond Lau, Co Leadership

Available Media

For two years, I embarked on a quest to answer: What mindsets and frameworks do the most effective engineers use? I interviewed engineering leaders at top software companies including Google, Facebook, Dropbox, Airbnb, Stripe, Instagram, and more.

When I published The Effective Engineer in 2015, I thought I had the question all figured out. It turns out there's so much more to what it means to work effectively on a team—a lesson I had to learn the hard way.

Edmond Lau, Co Leadership

Edmond Lau is the author of the book, The Effective Engineer—now the de facto onboarding guide for many engineering teams. He's spent the last decade leading engineering teams across Silicon Valley—at Quip (acquired by Salesforce), Quora, Ooyala (acquired by Telstra), and Google.

As a leadership coach, Edmond works directly with CTO's, directors, managers, and other emerging leaders to unlock what's possible for them. He's run workshops and seminars at places like Pinterest, Google, and Facebook to raise the bar on what it means to be an engineering leader.

He blogs at effectiveengineer.com.

10:15 am–10:55 am

Break with Refreshments

Grand Ballroom Foyer
Sponsored by PayPal

Visit the Sponsor Showcase!

10:55 am–12:20 pm

Talks Track 1

Grand Ballroom ABC

SparkPost: The Day the DNS Died

Wednesday, 10:55 am11:35 am

Jeremy Blosser, SparkPost

Available Media

More than 25% of the world's non-spam email is sent using SparkPost's technology, and our cloud service sends nearly 15 billion messages per month. Running this service in the cloud has provided all the expected benefits of flexibility and scalability, but also unique challenges due to email's inherent nature as a highly stateful, push-oriented service. To support our use case, our service and network utilization models are similarly atypical.

Our DNS needs are particularly extreme. Our infrastructure currently has to support 8,000 DNS queries per second. Two major DNS-related events in early 2017 caused significant delays for our customers and sent us back to the drawing board once again. We recently completed a ground-up DNS tier redesign that includes dedicated VPCs with optimized security groups and ACLs, distribution across tiers and availability zones, resolver tuning and custom configurations, and multiple local caching resolvers per instance.

In this talk, we will discuss our history addressing this challenge, the limitations discovered in our previous approaches, and our current architecture's design and results. Attendees will gain an understanding of what it takes to host a robust DNS service in AWS at a scale beyond what is currently natively supported by AWS' resolver services.

Jeremy Blosser, SparkPost

Jeremy Blosser has worked in systems administration and engineering for over 20 years, and most of that time his focus has been on reliably delivering email and other traffic at scale. He is currently the Principal Operations Engineer at SparkPost, responsible for technical architecture oversight and keeping the cloud service operating and healthy. He lives in Texas with his wife and five kids.

Stable and Accurate Health-Checking of Horizontally-Scaled Services

Wednesday, 11:40 am12:20 pm

Lorenzo Saino, Fastly

Available Media

This talk explains how Fastly built a distributed health-checking system capable of driving stable traffic allocation, while quickly and accurately identifying failures. The key intuition behind our design is that the common approach of estimating the operational readiness of a service instance based on its state alone leads to inaccurate decisions. Instead, the health of each instance should be evaluated in the context of the whole service: an instance should be classified unhealthy only if its behavior deviates significantly from other instances in a cluster. Our design borrows techniques from machine learning, signal processing and control theory to ensure overall system availability.

Attendees will learn:

  • About the challenges involved in health-checking complex services and how to make accurate, timely and stable decisions.
  • How health-checking can be abstracted into a tractable mathematical problem that can be effectively solved by applying known tools and techniques from machine learning, signal processing and control theory.
  • How to implement such a system practically, what the issues are, and the tradeoffs involved.

Lorenzo Saino, Fastly

Lorenzo Saino is a software engineer at Fastly, where he works on network systems and load balancing problems. Prior to that, he was a PhD student at University College London. His thesis investigated design issues in networked caching systems and he was awarded the Fabrizio Lombardi prize in 2016 for his research.

Talks Track 2

Grand Ballroom FGH

Beyond Burnout: Mental Health and Neurodiversity in Engineering

Wednesday, 10:55 am11:35 am

James Meickle, Quantopian

Available Media

In 2018, most of us understand what burnout is and why it's an occupational hazard for site reliability engineers. But while we're aware of how toil can sneak up on us, or how losing sleep to pages can destroy our productivity, we aren't always as cognizant that we're all starting from different baselines. Finding space in SRE can be challenging for people who struggle to read documentation quickly, or to speak up during meetings, or even have a face to face conversation at all. Even worse, well-meaning attempts at accommodation can stifle personal growth or become career-limiting.

But those of us with unusual deficits have gotten where we are by leveraging our strengths. The trauma we live with can also teach us coping skills, seeing the world a different way can lead to unique insights, and being anxious can lead to triple-checking configurations that everyone else only double-checks. The time is right to draw on lessons from the "mad pride" movement and learn to embrace difference on your team by providing tangible support without othering or being paternalistic, because one of the most ethical ways your organization can retain the best talent is by rejecting sanity as a requirement.

James Meickle, Quantopian

James is a site reliability engineer at Quantopian, a Boston startup making algorithmic trading accessible to everyone. Past roles have seen him responsible for processing MRI scans at the Center for Brain Science at Harvard University, sales engineering and developer evangelism at AppNeta, and release engineering during the Romney for President 2012 campaign. Between NYSE trading days, he organizes DevOpsDays Boston and conducts Ansible trainings on O'Reilly's Safari platform. What free time remains is dedicated to cooking, sci-fi, permadeath video games, and Satanism.

Bootstrapping an SRE Team: Effecting Culture Change and Leveraging Diverse Skill Sets

Wednesday, 11:40 am12:20 pm

Aaron Wieczorek, U.S. Digital Service

Available Media

The U.S. Digital Service at VA’s team (DSVA), working on its most visible applications, Vets.gov and Caseflow, practice a modern DevOps culture with a full CI/CD pipeline and daily prod deploys. However, the team saw two major opportunities in putting together the foundations of an SRE team. The first opportunity was to make the DevOps for the existing two applications more streamlined. The second, more impactful opportunity was to form an SRE organization that could service software project services in the broader VA.

In this talk, I’ll focus on the cultural challenges entailed in bootstrapping an SRE culture in a larger organization. Both from combining single-application DevOps teams into one shared SRE team, but more importantly from the perspective of bringing these services to a broader organization. This presentation will highlight the challenges of bringing together developers with different skill sets. However, it will also highlight the challenges of making cultural change and in building momentum so that the other hundreds of application owners within the VA hear about us through word of mouth and come to us to effect change, and in changing the nature of their development processes from a waterfall-based approach to lightweight, modern, agile SRE-based teams.

Aaron Wieczorek, US Digital Service

Site Reliability Engineer at the United States Digital Service's Department of Veterans Affairs team. Working on easy technical problems and hard bureaucratic problems, from infrastructure, to CI/CD pipelines, to network engineering.

12:20 pm–1:35 pm

Luncheon

Santa Clara Ballroom
Sponsored by eBay

1:35 pm–3:25 pm

Talks Track 1

Grand Ballroom ABC

Don’t Ever Change! Are Immutable Deployments Really Simpler, Faster, and Safer?

Wednesday, 1:35 pm2:15 pm

Rob Hirschfeld, RackN

Available Media

In the cloud and container era, we’ve moved from managing systems over time to the create-destroy-recreate pattern commonly known as immutable infrastructure. These concepts are not just for containers and clouds, they can be simpler, faster and safer than traditional configuration in data center operations!

We’ll drill deep into these redeployment-based processes with examples, tooling highlights, and pragmatic pros and cons of this approach. We’ll also talk about taking immutable down to the metal.

This session is for experienced operators: we’re taking a system view and then going into the weeds.

Attendees of this session will learn the following:

  • Understanding of Immutable Infrastructure as part of a larger process
  • Requirements to create your own Immutable Infrastructure deployments
  • Benefits & Challenge of an Immutable Infra architecture
  • Applying immutable to physical infrastructure
  • Demonstration of Immutable Infrastructure using open source tools (time permitting)

Rob Hirschfeld, RackN

Rob has been in the cloud and infrastructure space for nearly 15 years from working with early ESX Betas to serving four terms on the OpenStack Foundation Board and becoming an Executive at Dell. As a co-founder of the Digital Rebar project, Rob creating a new generation of DevOps orchestration to leverage the containers and service-oriented ops. He trained as an Industrial Engineer and carries a passion for applying Lean and Agile process to software delivery.

Lessons Learned from Our Main Database Migrations at Facebook

Wednesday, 2:20 pm3:00 pm

Yoshinori Matsunobu, Facebook

Available Media

At Facebook, we created a new MySQL storage engine called MyRocks. Our objective was to migrate one of our main databases (UDB) from compressed InnoDB to MyRocks and reduce the amount of storage and number of servers used by half. In August 2017, we finished converting from InnoDB to MyRocks in UDB. The migration was very carefully planned and executed, and it took nearly a year. But that was not the end of the migration. SREs needed to continue to operate MyRocks databases reliably. It was also important to find any production issue and to mitigate or fix it before it becomes critical. Since MyRocks was a new database, we encountered several issues after running in production. In this session, I will introduce several interesting production issues that we have faced, and how we have fixed them. Some of the issues were very hard to predict. These will be interesting for attendees to learn too.

Attendees will learn the following topics.

  • What is MyRocks, and why it was beneficial for large services like Facebook
  • What should be considered for production database migration
  • How migration should be executed
  • Learning 4-6 real production issues

Yoshinori Matsunobu, Facebook

Yoshinori Matsunobu is a Production Engineer at Facebook, and is leading MyRocks project and deployment. Yoshinori has been around MySQL community for over 10 years. He was a senior consultant at MySQL Inc since 2006 to 2010. Yoshinori created a couple of useful open source product/tools, including MHA (automated MySQL master failover tool) and quickstack.

Leveraging Multiple Regions to Improve Site Reliability: Lessons Learned from Jet.com

Wednesday, 3:05 pm3:25 pm

Andrew Duch, Jet.com/Walmart Labs

Available Media

Running your systems across multiple regions allows you to tolerate a unique set of failure modes and can simplify disaster recovery processes. With this improved resiliency comes a daunting set of technical, management, and cost challenges.

At Jet, we’ve taken a multi-year journey to move from running in one region to multiple regions in the Microsoft Azure cloud. We'll share some of the multi-region architectural patterns we’ve introduced at Jet—both the successes and failures. We'll also discuss strategies to minimize costs and manage the complexity of moving thousands of services to run in multiple regions. Finally, we'll talk through how we manage failover automation and exercises.

Talks Track 2

Grand Ballroom FGH

Building Successful SRE in Large Enterprises—One Year Later

Wednesday, 1:35 pm2:15 pm

Dave Rensin, Google

Available Media

At SRECon2017 I talked about the formation of a special group of Google SREs who go into the world and teach enterprise customers—via actual production systems—how to "do SRE" in their orgs. It was new when I presented it. It's one year later and we have a lot of interesting data about how it's going. Some things that we thought would be hard, weren't. Others were nigh on impossible. We've written many postmortems and learned a bunch of lessons you can only learn the hard way.

Things you can expect to learn:

  • Why it's easier to bootstrap SRE in a large traditional enterprise than a cloud native!
  • Things enterprises assume are true, but aren't.
  • All the things we should have known better, but still learned the hard way—and how you can avoid them when bootstrapping SRE in your culture (or your customers' cultures)

Dave Rensin, Google

Dave Rensin is a Google SRE Director leading Customer Reliability Engineering (CRE)—a team of SREs pointed outward at customer production systems. Previously, he led Global Support for Google Cloud. As a longtime startup veteran he has lived through an improbable number of "success disasters" and pathologically weird failure modes. Ask him how to secure a handheld computer by accidentally writing software to make it catch fire, why a potato chip can is a terrible companion on a North Sea oil derrick, or about the time he told Steve Jobs that the iPhone was "destined to fail."

Working with Third Parties Shouldn't Suck

Wednesday, 2:20 pm3:00 pm

Jonathan Mercereau, traffiq corp.

Available Media

As an SRE, you're not just responsible for building automation. Sometimes the work we do requires us to reach out to a third-party to take advantage of services they offer, such as Site Monitoring (Catchpoint, Keynote, Pingdom, ThousandEyes, etc), Log Aggregation (Splunk, Sumo Logic, Datadog, Loggly, etc.), Chat (Slack, Discord, HipChat, etc.).

Many of us have had to weigh the benefit of building or buying, integrating, and managing third-parties. Beyond the scope of features and functionality, there are a lot of things to consider related to working with a vendor, like customer support, pricing, scalability, flexibility, roadmap, track record, and real-world proving of their capabilities. There are things that may matter to you and some that don't. Let's dive into how a bad selection could be costly.

Jonathan Mercereau, traffiq corp.

Jonathan is co-founder of Traffiq, helping enterprises manage and orchestrate their web traffic infrastructure.

In his prior role as Staff SRE at LinkedIn, Jonathan helped build and lead the team at LinkedIn responsible for surviving one of the largest DDoS attacks. In his 4.5 years, the team scaled from a single engineer to 15 Site Reliability Engineers internationally, who were responsible for traffic engineering, third-party evaluation, planning, integration, tooling, automation, monitoring, M&A due diligence, interactions with legal and security, contract negotiation, procurement. Prior to LinkedIn, Jonathan was the first dedicated CDN engineer at Netflix. While at Netflix, Jonathan improved streaming in Latin America with a 40% improvement in bitrate performance and a 90% decrease in rebuffer events, launched the European CDN cohort for UK+Nordics, participated in the architecture and deployment of Open Connect, and rewrote stream selection algorithms for various Netflix Clients. Prior to Netflix, Jonathan built Video Delivery platforms for Comcast and Time Warner, and was part of the inception, evaluation, design and integration Comcast's first nationwide MPEG2 Video CDN (Project Infinity, to become Xfinity).

When to NOT Set SLOs: Lots of Strangers Are Running My Software!

Wednesday, 3:05 pm3:25 pm

Marie Cosgrove-Davies, Google

Available Media

Are you robbing your customers of their ability to think hard about their users and control their operational fate? The SLI/SLO model was created by and for Software-as-a-Service teams. It becomes challenging when the product you build is run by teams at companies around the world. For products like Cloud Foundry, where there is one development group shipping software to many operations teams at other companies, shipping the product with built-in Service Level Indicators makes complete sense. However, the idea of establishing SLOs for products gets tricky. In order to map a path forward, we need to reconsider the value of SLOs and what we’re trying to accomplish by setting them, and empower our customers to work through the process of establishing SLOs for standard SLIs that we enable them to measure.

Marie Cosgrove-Davies, Google

Marie Cosgrove-Davies is an ops product manager at Google in Pittsburgh, PA. Her checkered past includes stints in SRE, R&D, support, implementation, and the Peace Corps. She has previously worked at Pivotal and Opower. In her spare time she reads way too much, tends to her Victorian house, and cossets her three adorable cats.

3:25 pm–4:05 pm

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Microsoft Azure

Visit the Sponsor Showcase!

4:05 pm–5:35pm

Talks Track 1

Grand Ballroom ABC

Lessons Learned from Five Years of Multi-Cloud at PagerDuty

Wednesday, 4:05 pm4:45 pm

Arup Chakrabarti, PagerDuty

Available Media

PagerDuty has been running a multi-cloud infrastructure over the past 5 years. In that time, we have tested multiple providers, learned about fun networking routes, saw what traffic filtering looks like, and many other horrors.

In this talk, I will be going over the decisions and events that led up PagerDuty's multi-cloud environment and how we managed it. I will go through the benefits and problems with our setup and the assumptions that we made that turned out to be completely wrong. By the end of this talk, you will be able to better answer the question of whether a multi-cloud setup is the right thing for your team or company.

Arup Chakrabarti, PagerDuty

Arup has been working in the space of software operations since 2007. He started out at as an Operations Engineer at Amazon, helping to reduce customer defects with multiple teams for the Amazon Marketplace. Since then, he has managed and built operations teams at Amazon and Netflix to help improve availability and reliability. He currently works at PagerDuty, where he is part of the Infrastructure Engineering group.

Help Protect Your Data Centers with Safety Constraints

Wednesday, 4:50 pm5:10 pm

Christina Schulman and Etienne Perot, Google

Available Media

Running a multi-tenant, multi-datacenter compute infrastructure requires automating machine management across their respective lifecycles. We look at how Google keeps its own infrastructure safe in the face of rogue automation and human error, as well as ever-changing machine management software.

We’ll discuss common failure patterns that we’ve encountered in Google’s automation systems, and ways to avoid and mitigate them. We’ll also cover principles of a good production safety constraint checking service: when to use it, what constraints it should have, and how to make that system safe from itself.

These principles apply at any scale, and it’s easier to apply them if you start early.

Christina Schulman, Google

Christina Schulman is an SRE at Google, where she works on datacenter machine management and system dependency control. Prior to joining Google in 2008, she wrote software for medical imaging systems, early Internet startups, and game companies. She has a B.A. in Computer Science from Princeton.

Etienne Perot, Google

Etienne Perot is an SRE at Google working on Borg, Google's cluster orchestration system. He works on systems to make the management of large scale systems at Google safe and reliable.

Real World SLOs and SLIs: A Deep Dive

Wednesday, 5:15 pm5:35 pm

Matthew Flaming and Elisa Binette, New Relic

Available Media

If you've read almost anything about SRE best practices, you've probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity.

But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we'll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals.

Matthew Flaming, New Relic

Matthew Flaming began his career in software engineering back when creating a web portal meant hacking together your own version of JSP and racking your own Solaris boxes. Since then he has led the development of complex, high-scale backend systems ranging from CDNs to IoT platforms with an equal emphasis on technical architecture and building organizations where innovation thrives. In his current role as VP of Site Reliability at New Relic, he focuses on the SRE practice and the technical, operational, and cultural aspects of scaling and reliability.

Elisa Binette, New Relic

Elisa Binette is a Senior Engineering Manager within New Relic’s Site Engineering Reliability Organization. The group focuses on helping teams measure and achieve their reliability goals, improving reliability for both the engineers within the company and for the end customers of New Relic products.

Talks Track 2

Grand Ballroom FGH

How SREs Found More than $100 Million Using Failed Customer Interactions

Wednesday, 4:05 pm4:45 pm

Wes Hummel, PayPal

Available Media

This talk will go into PayPal SRE's journey of using data around customer failures (outward-looking) instead of payment data (inward looking) to become a more customer-focused company. The results not only benefitted our millions of merchants and consumers, but also benefitted our company in a significant way. Topics covered will be:

  • The Principles used when starting the initiative
  • The technical implementation of using Failed Customer Interactions (FCIs)
  • The dashboards and visualizations of the data
  • The culture change that occurred in the company as a result of the initiative
  • The tactical efforts needed to get momentum behind this new way of measuring
  • The bad ideas and mistakes we made along the way
  • The results and where we're at today

Wes Hummel, PayPal

Wes Hummel is the Head of Site Reliability Engineering and Site Operations at PayPal. He and his team partner with Customer Support, Merchant Technical Support, Product Development, and other teams in PayPal to ensure customers have the best possible experiences. After spending over two decades in software development, technical operations, and leading teams, Wes knows how to deliver high quality software and solutions that delight customers - by understanding what problems they are trying to solve and how they run their businesses and lives.

Wes holds a Bachelor of Science in Electrical Engineering and a Masters in Business Administration.

Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data

Wednesday, 4:50 pm5:10 pm

Tanner Lund, Microsoft

Available Media

An important part of site reliability is identifying and eliminating the causes of outages. Good problem management requires good problem definition and theme identification. Historically, this has been a largely inefficient human process, but problem management should never be driven solely by manual review of individual postmortems or a limited study of top-level metrics. If we want to scale, we must be systematic.

Machine Learning is a key component in this process. However, fitting models is only a small piece of the pie. Without good data sets you will learn precious little. We'll talk though some of challenges we've identified when collecting and cleaning useful datasets for problem identification. How do you categorize? What is an outage theme? What is at risk for repeating and what problems have already been firmly left in the past?

On top of it all is the issue of success measurement. When we make reliability investments, how do we know that our actions are making a positive difference? We'll address some of the challenges we've encountered in measuring success (and reliability) in an environment that is ever-evolving. Join us as we discuss our vision for the future and the share our journey so far.

Tanner Lund, Microsoft

Tanner Lund has been a part of Azure's SRE organization from the beginning. He has worked in a variety of roles, including Crisis Management, working on SREBot, building data pipelines, and leading services through SRE/DevOps transitions. Throughout it all his focus has been on learning at scale, identifying trends, and eliminating outages before they happen (or at least shortening them significantly). Azure is growing at a rapid pace and our methods of learning from outages must grow with it.

Outside of work Tanner spends the most time on his family, his faith, progressive music, fiction, and eSports.

How Not to Go Boom: Lessons for SREs from Oil Refineries

Wednesday, 5:15 pm5:35 pm

Emil Stolarsky, Shopify

Available Media

Bad software doesn’t explode. You can describe it as exploding when it throws an exception, corrupts some data, or makes your computer unusable, but it doesn’t explode. When code doesn’t work, the solution is to figure out where the logic is incorrect and fix it. While SREs may be called engineers, we rarely face the consequences of engineers in other industries. 

In contrast, when a chemical engineer makes a mistake designing a refinery, the consequences are very different. We’ve all seen videos of the repercussions online. Big, loud explosions reducing massive facilities to chunks of twisted metal. The reality is working with unstable chemicals is a lot harder than keeping track of pointers in C.

Yet despite the differences, industrial process plants can be surprisingly similar to a complex software system. Where refineries will use pressure relief valves, web services will degrade gracefully. Regardless if you’re protecting against thermal runaway in a plant or a cascading failure in a data center, the fundamental ideas can be shared by both domains.

In this talk, I’ll explore the techniques and ideas used to build and operate refineries and how we can use them to make our software systems more resilient and reliable.

Emil Stolarsky, Shopify

Emil is a production engineer at Shopify where he works on scriptable load balancers, performance, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's fighting his fear of heights in a nearby rock climbing gym.

5:35 pm–7:30 pm

Reception

Terra Courtyard
Sponsored by Circonus

7:30 pm–9:00 pm

Lightning Talks

Grand Ballroom ABCFGH

Available Media
  • Track That Clone: Near-Realtime Data Audit for Distributed Data Replication—Janardh Bantupalli, LinkedIn
  • Auto-Cascading Security Updates Through Docker Images—Andrey Falko, Salesforce
  • How to Insulate Your Team from "Shoulder Taps"—Danny Gershman, SecurityScorecard
  • TechOps-How Stride Built a Culture of Reliability from Day One—David Giesberg, Atlassian
  • Auto Remediation in Diagnosing Network for SRE team—Sean Jiang, Huawei Technologies
  • Embedded SRE to Improve Time to Market at Scale—Hemant Kapoor, Wayfair
  • Turning It off and on Again—"Look Mom, No Hands"—Craig Knott, Atlassian
  • Who Are Your Alerts For?—Ren Lee, Arista Networks
  • Statistics for Dummies—Fred Moyer, Circonus
  • Package Masonry, Francisco Ruiz—Box Inc.
  • Delivering Technical Presentations the SRE Way—Peter Sahlstrom, MailChimp
  • Diversity: It’s Not about How or Who, but Why We Hire—John Schnipkoweit, Choozle
  • Why Netflix Built Titus with Reliability in Mind—Andrew Spyker, Netflix
  • An Unexpected Open Source Win, Amy Tobey—Tenable
  • SRE and Unicorn—Match Made in Heaven—Ritesh Vajariya, Bloomberg LP
  • Signal vs. Noise: How to Identify Projects That Will Survive—Chris Robertson, Scalyr

Thursday, March 29

7:30 am–8:30 am

Continental Breakfast

Grand Ballroom Foyer
Sponsored by VMware and Wavefront

8:30 am–10:20 am

Talks Track 1

Grand Ballroom ABC

Containerization War Stories

Thursday, 8:30 am9:10 am

Ruth Grace Wong and Rodrigo Menezes, Pinterest

Available Media

Just like many other mid-sized companies, Pinterest runs tens of thousands of machines and hundreds of microservices. This talk describes why and how our engineers migrated our services to Docker, plus stories of how certain elements of the migration were more complicated than expected and caused outages.

We will go through the containerization plan and architecture at Pinterest, and how we used it to make things easier for developers. We also go into issues we had with performance: networking, java versions, operating system default packages.

Ruth Grace Wong, Pinterest

Ruth is a member of the Core Site Reliability Engineering team at Pinterest, where she experiences interesting outages, and works on systems and internal tools to help Pinterest stay up. She likes anything at scale—be it software at Pinterest, or visiting factories for fun. Ruth likes to pin desserts to bake, and DIY projects to laser cut and sew at her local makerspace.

Rodrigo Menezes, Pinterest

Rodrigo is a member of the Core Site Reliability Engineering team at Pinterest. While at Pinterest his main focus has been on Docker and creating applications to support running their stateless prod infrastructure in a container. Outside of work, Rodrigo loves rock climbing, surfing, and anything outdoors, as well as working on various DYI projects.

Previously, Rodrigo worked at LiveVox and LiveScribe and other companies you may never heard of.

Resolving Outages Faster with Better Debugging Strategies

Thursday, 9:15 am9:55 am

Liz Fong-Jones and Adam Mckaig, Google

Available Media

Engineers spend a lot of time building dashboards to improve monitoring but still spend a lot of time trying to figure out what’s going on and how to fix it when they get paged. Building more dashboards isn’t the solution, using dynamic query evaluation and integrating tracing is.

Liz Fong-Jones, Google

Liz Fong-Jones is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Adam Mckaig, Google

Adam Mckaig is an SRE at Google in New York, where he looks after a monitoring system. Previously he built things at the New York Times, Bloomberg, and UNICEF. He enjoys C++, which probably says it all.

Monitoring DNS with Open-Source Solutions

Thursday, 10:00 am10:20 am

Felipe Espinoza and Javier Bustos, NIC Labs

Available Media

NIC Chile is the DNS administrator of the ccTLD .cl, managing over 500.000 domain names in an infrastructure composed by more than 30 servers distributed around the globe (some of them belonging to one of the three Anycast clouds used in the name service) answering a ratio of around 3,000 queries/sec per server. In this scenario, we took the challenge of build a real-time monitor system four our DNS service, by only using open-source software.

We reviewed and benchmarked different alternatives: Packetbeat, Collectd, DSC, Fievel, and GoPassiveDNS for data collection; Prometheus, Druid, ClickHouse, InfluxDB, ElasticSearch, and OpenTSDB as DB engines; and Kibana, Grafana, and Graphite Web for visualization. The info we wanted to know were, Five top-queried domains, mean length of DNS queries, and the number of queries per subnetwork, per operation code (OPCODE), per class (QCLASS), per type (QTYPE), per answer type, per transport protocol (UDP, TCP), and with active EDNS.

With that scenario, we measured:

  • CPU used by DB.
  • RAM
  • Secondary memory
  • Time required for data aggregation

We present two compatibility matrices summarizing our findings and a ready-to-use open-source integrated monitoring system.

Felipe Espinoza, NIC Labs

Felipe Espinoza is a Software Engineer at NIC Labs, who does research on maintenance and improvements of the DNS service availability. Before joining NIC Labs, he had an internship at Google and worked at different other analytics companies which helped him realize the difficulties and challenges that distributed systems have to face. He is mainly interested in large scale systems design and looking for weak spots on different applications.

Javier Bustos, NIC Labs

Javier Bustos-Jiménez is the director of NIC Labs at the University of Chile. He has been working on several projects such as DNS real-time monitoring and Mobile Internet QoS platforms. He is mainly interested in complex networks, internet protocols, network privacy/security, and data science.

Talks Track 2

Grand Ballroom FGH

Antics, Drift, and Chaos

Thursday, 8:30 am9:10 am

Lorin Hochstein, Netflix

Available Media

Large systems evolve from successful, smaller one, an observation predicted by the branch of study known as systems theory. Systems theory also predicts that our systems will inevitably behave, and fail, in unforeseen ways. This talk will draw from the ideas of two very different systems theorists to demonstrate that neither quality architecture nor thorough testing can prevent our software from eventually exhibiting pathological behavior. The first is the safety researcher Sidney Dekker, who proposed a theory of "drift into failure" that describes how seemingly reliable safety-critical systems can still lead to accidents. The second is the late pediatrician John Gall, who coined the "Generalized Uncertainty Principle" about how all types of complex systems behave unexpectedly.

Even though failure is inevitable, there is still hope. Chaos Engineering is an approach that can be used to identify system vulnerabilities before they lead to outages. This talk will cover how to design and run Chaos Engineering experiments, drawing examples from our experiences at Netflix.

Lorin Hochstein, Netflix

Lorin Hochstein is a Sr. Software Engineer on the Chaos Team at Netflix, where he works on ensuring that Netflix remains available. He was previously Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.

Security as a Service

Thursday, 9:15 am9:55 am

Wojciech Wojtyniak, Facebook

Available Media

In the game of security, defenders have to be lucky every single time, but just one time is enough for an attacker. Good news: by leveraging good practices one can tilt the scales significantly in their favor. Bad news: it's not always an easy and quick thing to do. PE and SRE teams have helped building reliable infrastructure and software across the stack, can their skills and experience be leveraged to improve security posture within an organization?

Facebook's Production Engineering Security team has proved that it's indeed possible. In this talk I'm going to give you a brief overview of the security landscape today (aka the battlefield), give you a simple framework for thinking about security in the context of a production environment and show ways in which we engage with others both internally and externally.

Wojciech Wojtyniak, Facebook

Wojciech has joined Facebook as a Production Engineer and the PE Security team in 2013. During his tenure there he was working in a wide variety of security-related projects including: SSH environment, internal certificate authority, and public certificates management. Before that he was working in the software configuration management area for a telecom company.

Breaking in a New Job as an SRE

Thursday, 10:00 am10:20 am

Amy Tobey, Tenable

Available Media

In theory, most companies have onboarding processes to assimilate you and prepare you for life in the collective. In reality, this is not common and folks are busy, so it’s often up to you to figure out where you can make an impact and start contributing. Even after you’ve been at a company for a while, it can be useful to re-orient after extended leave or a major organizational change. This talk presents a candid rendition of Amy's experience changing jobs over the summer of 2017, covering mistakes she made and strategies she used to find the right places to focus.

Amy Tobey, Tenable

Amy is a parent, technologist, musician, and SRE at Tenable. While attending university as a music major, she got into MUDs, C, and Linux, eventually ending up with a career as a sysadmin. Over the last 18 years, Amy has worked on everything from kernel hacking to full-stack applications, mostly from inside operations teams.

10:20 am–11:00 am

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Catchpoint

Visit the Sponsor Showcase!

11:00 am–12:05 pm

Talks Track 1

Grand Ballroom ABC

"Capacity Prediction" instead of "Capacity Planning": How Uber Uses ML to Accurately Forecast Resource Utilization

Thursday, 11:00 am11:20 am

Rick Boone, Uber

Available Media

At Uber, the majority of our services are in the critical path of customer-facing features (matching drivers and riders, handling ongoing trips, determining prices or ETA's, etc). Each of these services consumes resources (CPU, MEM, NET, DISK) in a manner which is "driven" by the behavior of 1 or more of a few key business metrics ("Trips Occurring", "Drivers Online", "App Opens", etc)—for instance, a CPU-bound, "trips-driven" service will see its CPU utilization increase when trips demand increases. With this in mind, along with historical data and machine learning algorithms in hand, we can statistically model the relationship between these key business metrics and the resource utilization of each individual service. This allows us to accurately build predictions of how many hardware resources any service will need at any arbitrary point in the future with stunning accuracy. This talk will walk you through the method of gathering the right data and applying machine learning to it, to allow you to revolutionize how you approach and perform capacity planning.

Rick Boone, Uber

I've worked in reliability engineering for over 12 years, most recently at Uber as both an SRE (2 years) and a Capacity Engineer (1 year) and, prior to that, at Facebook as a Production Engineer (3 years). At both companies, I've ensured reliability for large-scale, complex applications and platforms with both high criticality and high-performance requirements.

Distributed Tracing, Lessons Learned

Thursday, 11:25 am12:05 pm

Gina Maini, Jet.com

Available Media

Your engineering job might look something like this:

1. Understand dependencies.
2. Keep the site up.
3. Understand where you messed up.

Distributed Tracing helps you with all of these. Industry standards like OpenTracing provide some mindshare on tracing RPC style communications, but leave much unsaid about asynchronous communications over message busses or Kafka. At Jet.com we use Kafka and Event Sourcing at nearly every tier of our architecture, so tracing these types of communications motivated us to create our own protocol and semantics around operations and channels. We learned a lot from creating our own libraries in F# and on-boarding engineering teams around our company to try tracing their communications. We learned a lot and are here to share tips, tricks, and hard earned lessons.

Gina Maini, Jet.com

Gina Maini is a professional functional programmer living in Jersey City, NJ. She works on internal infrastructure and distributed tracing at Jet.com. Her team at Jet is challenged with making "event sourcing scale" and providing generic solutions in F#. She is an avid renaissance woman and wants to "hallway talk" about music theatre history, linguistics, makeup techniques, physics, political philosophy, science fiction, and cats.

Talks Track 2

Grand Ballroom FGH

Junior Engineers Are Features, Not Bugs

Thursday, 11:00 am11:20 am

Kate Taggart, HashiCorp

Available Media

There are many benefits to hiring junior engineers, but when it comes to teams responsible for production infrastructure, we default to thinking such risky environments are no place for newbies. However, “the things we do are too risky to have junior engineers working on them” often instead means “we haven’t invested properly in resiliency”. Hiring junior engineers onto production-critical teams can guide you to reduce risk to your production systems by highlighting needed improvements you should be making regardless. We’ll walk through three categories—architecture, process, and tooling—and discuss the specific ways that a junior engineer may help illustrate the need for improvement in each.

Kate Taggart, HashiCorp

Kate has been in software engineering for almost a decade, having worked in areas ranging from power grid resiliency to fintech to enterprise software. Kate has managed a variety of teams across the devops spectrum at New Relic and Simple, and is now at HashiCorp helping build tools for other companies and teams to manage their own infrastructure.

Approaching the Unacceptable Workload Boundary

Thursday, 11:25 am12:05 pm

Baron Schwartz, VividCortex

Available Media

We've all stared in frustration at a system that degraded into nonresponsiveness, to the point that you couldn't even kill-dash-nine whatever was responsible for the problem. A key fact we all recognize, but may not recognize as significant, is that this isn't a sharp boundary. There's a gradient of deteriorating performance where the system becomes less predictable and stable. In this talk I'll explain:

  • what the unacceptable workload boundary is
  • how to recognize and predict the signs
  • how to measure and model this simply
  • what causes nonlinear performance degradation
  • how to use this to architect more scalable systems

Baron Schwartz, VividCortex

Baron is the founder and CEO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software, and several books including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally (including the view that teams are systems and culture influences their performance), and databases specifically.

12:05 pm–1:20 pm

Luncheon

Santa Clara Ballroom

1:20 pm–3:10 pm

Talks Track 1

Grand Ballroom ABC

Building Shopify's PaaS on Kubernetes

Thursday, 1:20 pm2:00 pm

Karan Thukral, Shopify

Available Media

Shopify has grown from less than 20 production services in 2011 to more than 400 in 2017. These services currently run on a wide variety of production environments making it harder to share tools across applications. Moving to the cloud is a common occurrence but Shopify decided to build a platform as a service (PaaS) to consolidate all production environments. This PaaS was built on a public cloud provider and Kubernetes (k8s). Despite consolidation, the PaaS added maintenance and support load on the team. SREs' build tools that are used by developers with a wide array of experience which adds challenges regarding user experience, education, support etc.

This talk contains a brief overview of development at Shopify, why we decided to move to the cloud, what we built and what was learned along the way. This talk is not meant to be a technical introduction into how to use Kubernetes, but instead a case study regarding the team’s experiences building the PaaS. Issues regarding onboarding, education etc are similar to issues faced in most SRE projects. The case study is meant to generalize the lessons the team learnt over the project and how they can applied more broadly at other organizations.

Karan Thukral, Shopify

Karan is a Systems Design Engineering graduate from the University of Waterloo and currently works as a Production Engineer at Shopify. Karan is part of the team responsible for building and maintaining a platform as a service for internal developers. Along with production engineering, Karan has experience working as an iOS and backend (rails) developer.

Know Thy Enemy: How to Prioritize and Communicate Risks

Thursday, 2:05 pm2:45 pm

Matt Brown, Google

Available Media

Every SRE team attempting to manage, mitigate, or eliminate the risks facing their system will encounter two fundamental problems:

1. As humans our intuitive judgement about risk is unreliable.
2. The work required to address all potential risks far outstrips our available time and resources.

The CRE team (Customer Reliability Engineering—a group of Google SREs who partner with cloud customers to implement SRE practices in their application and across the cloud provider/customer relationship) battles these challenges every day in our interactions with customers. We have drawn on Google’s deep experience managing reliable systems, and the broader field of risk management techniques to develop a process that allows us to communicate an objective ranking of risks and their expected cost to a system. This ranking and the associated cost data can then be used as an input to team and business decision making.

This talk will cover the development of our process, explain how anyone can apply it to any system today and demonstrate how the resulting ranking and costs provide objective, consistent data which can take the tension and subjectivity out of often tense discussions around work priorities and focus (e.g. more features or more reliability?).

Matt Brown, Google

Matt began his SRE career with Google in Dublin in 2007, shifting to London in 2012, and since 2016 works remotely from Cambridge in New Zealand.

During this time Matt has worked on or led a range of diverse SRE teams with responsibilities ranging from Google's internal corporate infrastructure, through to the Internet facing load-balancing infrastructure responsible for keeping Google fast and always available.

His current role with the Customer Reliability Engineering team is pioneering how to apply SRE practices across organisations to address the challenges posed by today's world where the traditional boundaries between platforms and their customers are being blurred.

Automatic Metric Screening for Service Diagnosis

Thursday, 2:50 pm3:10 pm

Yu Chen, Baidu

Available Media

When a service is experiencing an incident, the oncall engineers need to quickly identify the root cause in order to stop the loss as soon as possible. The procedure of diagnosis usually consists of examining a bunch of metrics, and heavily depends on the engineers’ knowledge and experience. As the scale and complexity of the services grow, there could be hundreds or even thousands of metrics to investigate and the procedure becomes more and more tedious and error-prone.

While hard to humans, algorithms are good at performing such repetitive work efficiently and precisely. In this talk, we will introduce an automatic screening approach on service metrics, including system performance indicators and user-defined metrics. An anomaly detection algorithm filters out the normal metrics and creates the machine/instance level abnormal patterns. The abnormal patterns are then clustered into groups. At last the groups are ranked according to their abnormal level. The top groups, together with their abnormal metrics, are presented to the engineers as a recommendation. Our experience on real cases shows that the top 3 groups can cover most of the root cause modules and reveal important information to understand the reason of the corresponding incident.

Yu Chen, Baidu

Yu Chen is the Data Architect in the Operation Department of Baidu. His work focuses on anomaly detection and problem diagnosis using statistical analysis and machine learning methods. Previously he worked at Microsoft Research Asia as a Researcher, with research interests in Distributed Systems and Machine Learning.

Talks Track 2

Grand Ballroom FGH

Whispers in Chaos: Searching for Weak Signals in Incidents

Thursday, 1:20 pm2:00 pm

J. Paul Reed, Release Engineering Approaches

Available Media

The complexity of the socio-technical systems we engineer, operate, and exist within is staggering. Despite this, complexity remains a fact of life in software development and operations, a fact which can become easy to ignore, due to our daily interactions with and familiarity with those systems. (And, let's face it, often a strategy to cope with that comlexity!) When those systems falter or fail, we often find in the postmortems and retrospectives afterward that there were "weak signals" that portended doom, but we didn't know they were there or how to sense them.

In this talk, we'll look at what research in the safety sciences and cognitive psychology has to say about humans interacting with and operating complex socio-technical systems, including what air craft carriers have to do with Internet infrastructure operations, how resilience engineering can help us, and the use of heuristics in incident response. All of these provide insight into ways we can improve one the most advanced—and most effective—monitoring tools we have available to keep those systems running: ourselves.

J. Paul Reed, Release Engineering Approaches

J. Paul Reed has over fifteen years experience in the trenches as a build/release engineer, working with such storied companies as VMware, Mozilla, Postbox, Symantec, and Salesforce.

In 2012, he founded Release Engineering Approaches, a consultancy incorporating a host of tools and techniques to help organizations "Simply Ship. Every time." He's worked across a number of industries, from financial services to cloud-based infrastructure to health care.

He speaks internationally on release engineering, DevOps, operational complexity, and human factors and is currently a Masters of Science candidate in Human Factors & Systems Safety at Lund University.

Architecting a Technical Post Mortem

Thursday, 2:05 pm2:45 pm

Will Gallego, Etsy

Available Media

SRE’s are frequently tasked with being front and center in intense, highly demanding situations in the production environment that require clear lines of communication. Our systems fail not because of a lack of attention or laziness but due to cognitive dissonance between what we believe about our environments and the objective interactions both internal and external to them. In this talk, I’ll discuss how we can revisit our established beliefs surrounding failure scenarios with an emphasis not on the who in decision making but the why behind those decisions. With this mindset, we can encourage our teams to reject shallow explanations of human error for said failures, instead focusing on how we can gain greater understanding of these complexities. I’ll walk through the structure of post mortems used at large tech companies with real world examples of failure scenarios and debunk myths regularly attributed to failures. Through these discussions, you'll learn how to incorporate open dialogue within and between teams to bridge these gaps in understanding.

Will Gallego, Etsy

Will Gallego is a systems engineer with 15+ years of experience in the web development field, currently as a Staff Engineer at Etsy. Comfortable with several parts of the stack, he focuses now on building scalable, distributed backend systems and tools to help engineers grow. He believes in a free and open internet, blame aware post mortems, and pronouncing gif with a soft “G”.

Your System Has Recovered from an Incident, but Have Your Developers?

Thursday, 2:50 pm3:10 pm

Jaime Woo

Available Media

Mistakes are inevitable, and happen to the best of us. Our industry adopts a blame-free culture, but that doesn't negate the sting that occurs when we're at the heart of a mess-up.

Developers continually raise the bar on how to prevent errors, mitigate damage for ones that arise, and wring out as many learnings as possible after the damage is done. But much of this work is focused on the products, and not the people. And given the high-stakes in SRE, the range of how a mistake psychologically impacts people can run the gamut from minor to the near-traumatic.

Where are the game day exercises that simulate how to support a coworker who just caused 3 am pings and 20 hour work days? What resources should we share to help people understand the stages of emotions they'll feel after a major incident?

The concept of psychological safety is well understood as a key predictor for high-performing teams, but what does that entail? Drawing from our work at Shopify, and lessons from fields like sports, medicine, and even theatre, attendees will leave with a series of tangible actions and exercises to help restore team trust and rebuild a developer's confidence.

Jaime Woo[node:field-speakers-institution]

Jaime Woo's first job was growing bacteria in a lab to help build artificial membranes. A former journalist who led technology communications at Shopify, he worked closely with production engineering to help build team culture and collaboration through internal and external communications. He lives in Toronto and has a dog named Taco.

3:10 pm–3:50 pm

Break with Refreshments

Grand Ballroom Foyer
Sponsored by Datadog

3:50 pm–5:20 pm

Closing Plenary Session

Grand Ballroom ABCFGH

The History of Fire Escapes

Thursday, 3:50 pm4:30 pm

Tanya Reilly, Squarespace

Available Media

When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug, we usually have a contingency plan. We reduce damage, redirect traffic, page someone, drop low-priority requests, follow documented procedures. But why do many failures still come as a surprise? In this talk, we look at some real life analogs to preventing and managing software failures. Fire partitions. Public safety campaigns. Smoke alarms. Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. What can we learn from the real world about expecting failure and designing for it?

Leaping from Mainframes to AWS: Technology Time Travel in the Government

Thursday, 4:30 pm5:00 pm

Andy Brody and James Punteney, U.S. Digital Service

Available Media

The year is 1999, judging from the technology on your servers. Your mission is to ensure the next .gov website launches successfully using modern SRE tools and practices. Where do you start?

In this talk, U.S. Digital Service engineers will discuss lessons learned from helping teams at multiple government agencies adopt SRE tools and practices, changing the culture to emphasize launching fast and learning from mistakes in an environment where healthcare.gov-style disasters are commonplace.

In October 2017, the USDS helped launch a new Global Entry enrollment site built with modern technology. It uses Login.gov, a new service that provides a single, secure, and usable way to log in to multiple government websites.

We describe what it was like on launch day when we moved Global Entry from “cloud infrastructure” where it takes 6 months to provision a VM to an AWS infrastructure where deployments happen daily, and how we scaled Login.gov to meet the load.

Andy Brody, U.S. Digital Service

Andy is an engineer with the U.S. Digital Service, where he currently leads the infrastructure team for Login.gov. He has also worked to improve the technology practices at U.S. Citizenship & Immigration Services, which is digitizing the process that millions of people go through every year to submit applications and requests.

Previously Andy worked on infrastructure and security at Stripe, where he was the first member of the systems team. He helped grow Stripe's platform through many orders of magnitude of traffic, and he also started Stripe's security team.

James Punteney, U.S. Digital Service

James is the Director of Emerging Projects for the Department of Homeland Security Digital Service. Since May 2017, he has led efforts to improve international travel, as well as modernize tools & techniques used to build software within the government. Most recently, he helped lead the relaunch of Customs & Border Protection’s Trusted Traveler Program, a program that enables its 6 million+ members to travel hassle-free in and out of the United States.

With almost two decades of experience, James came to government from Manzama where he served as VP of Product & Engineering. Prior to that, he held senior executive positions at Genius Rocket, Transcendigital and various mid-late stage startups. James holds a Master’s degree from University of Notre Dame – Mendoza College of Business, and a Bachelor’s in Information Systems Management from Washington State University.

Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work!

Thursday, 5:00 pm5:20 pm

Thomas Limoncelli, Stack Overflow, Inc.

Available Media

One of the most “high stakes launch” you can do is the April Fools Prank (AFP). The usual best practices don’t always work and there is no second chance: you can’t just re-launch on April 2nd. Many silos must sign off on the plan, yet the plan must be kept secret. Beta testing is particularly difficult. Loads shift from zero to millions in one day with little room for mistakes.

In this talk I’ll discuss how to communicate across the company to get buy-in, how to load test using “dark launches”, rollbacks using feature flags, and other techniques for assuring a high stakes launch happens correctly. I’ll include stories from many AFPs both that I’ve observed and been a part of.

Thomas Limoncelli, Stack Overflow, Inc.

Tom is the SRE Manager at StackOverflow.com and author of Time Management for System Administrators (O'Reilly). He is co-author of The Practice of System and Network Administration and The Practice of Cloud System Administration. He is an internationally recognized author, speaker, system administrator, and DevOps advocate. He's previously worked at small and large companies including Google, Bell Labs/Lucent, and AT&T. His blog is EverythingSysadmin.com and he tweets @YesThatTom.

5:20 pm–5:25 pm

Closing Remarks

Grand Ballroom ABCFGH
Program Co-Chairs: Kurt Andersen, LinkedIn, and Betsy Beyer, Google

5:30 pm–6:30 pm

Light Happy Hour

Terra Courtyard
Sponsored by Box