Incident Management at Netflix Velocity

Dave Hahn

Monday, October 29, 2018 - 11:45 am–12:30 pm

Dave Hahn, Netflix

Abstract:

Netflix—as a service and a system—goes through an enormous amount of change all the time. Our engineering teams make 1000s of changes a days while our customers stream 100,000,000s hour of entertainment every day. At that velocity, an outage seconds or minutes long has real and noticeable impact to our customers. Stir in some Chaos Engineering and things become even more unpredictable.

The talk begins with a story. Netflix had a healthy relationship with Chaos Monkey—our tool to ensure that instance loss didn’t affect a running service application. We’d had such good luck we extended our plans from just Chaos Monkey to more Monkeys that would do nasty things to our environment. A new entry, Latency Monkey, would help us increase the health between our microservices by injecting latency and errors at our common IPC layer. What we thought was a safe, little experiment went completely off the rails. The centralized SRE team, called CORE, realized that we’d have to think differently about outages and managing them if the company was going to be successful moving forward.

This is the story of how the centralized SRE team at Netflix changed and adapted to help the service and our engineering teams prepare for and handle problems—big and small—when they do occur.

Key Takeaways:

How Netflix prepares for failures
Incident Handling at velocity requires special expertise
Preparation and training of everyone that runs services is key for quick recovery
You should be spending more time after an incident learning that you do during an incident managing
Outages being unique is an excellent goal—it takes work to make it happen

Dave Hahn, Netflix

Dave Hahn is a Senior SRE in the Cloud Operations & Reliability Engineering organization at Netflix. He has designed tools and systems used by many teams in the organization to support the Netflix service. He has decades of experiences in systems operations, networks, reliability, cron jobs, cable termination, grep, and not taking himself very seriously.

Connect:

@relix42

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {221724,
author = {Dave Hahn},
title = {Incident Management at Netflix Velocity},
year = {2018},
address = {Nashville, TN},
publisher = {USENIX Association},
month = oct
}

Incident Management at Netflix Velocity

Dave Hahn, Netflix

Open Access Media

Presentation Video

Presentation Audio