Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures

Friday, 2019, October 4 - 14:0014:30

John Arthorne, Shopify


A core concept in SRE is that we learn from major system failures, using the experience gained to improve the resiliency of our systems. If we are successful at this, we avoid repeating the same customer impact the next time our systems fail in a similar way. This means when the next big failure happens, it will often be a novel problem. This talk will focus on how to prepare for novel large scale failures. I will start by summarizing common methods of incident training. This includes simulated disaster scenarios, and live system exercises involving controlled but real production system failures. I will outline the benefits of each approach, and our experience in employing them at Shopify as our team has grown. This talk will wrap up with a summary of a large scale incident exercise we ran involving a hundred people, an office building, and 20,000 pieces of lego.

John Arthorne, Shopify

John leads a developer team within the Shopify Production Engineering group, with a focus on building tools to improve the quality of production systems, and on engineering incident response. John is a frequent speaker at technical conferences, including most recently SRECon, DevOps Days, and GitHub Universe. He has served on conference program committees and was voted a JavaOne Rock Star. His current interests are in tools and practices for infrastructure automation, and incident response. Before joining Shopify, John led a team-building cloud-based developer tooling for IBM Bluemix and was a prominent leader within the Eclipse open source community.

@conference {239508,
author = {John Arthorne},
title = {Expect the Unexpected: Preparing {SRE} Teams for Responding to Novel Failures},
year = {2019},
address = {Dublin},
publisher = {{USENIX} Association},
month = oct,