Breaking Things on Purpose

Monday, March 13, 2017 - 3:50pm4:45pm

Kolton Andrus, Gremlin Inc.

Abstract: 

Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.

At Netflix and Amazon, we ran failure exercises on a regular basis to ensure we were prepared. These experiments helped us find problems and saved us from future incidents. Come and learn how to run an effective “Game Day” and safely test in production. Then sleep peacefully knowing you are ready!

Kolton Andrus, Gremlin Inc.

Kolton Andrus is the founder and CEO of Gremlin Inc., which provides ‘Failure as a Service’ to help companies build more resilient systems. Previously he was a Chaos Engineer at Netflix improving streaming reliability and operating the Edge services. He designed and built F.I.T., Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. In both companies he has served as a ‘Call Leader’, managing the resolution of company wide incidents. Kolton is passionate about building resilient systems, as it lets him break things for fun and profit.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {201802,
author = {Kolton Andrus},
title = {Breaking Things on Purpose},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video 

Presentation Audio