Availability, Latency, and Cost: Withstanding Regional Outages

Wednesday, 29 August, 2018 - 14:0014:50

Aaron Blohowiak, Netflix

Abstract: 

Running in multiple regions is better for your users through increased availability and lower latencies, and it won't cost as much as you think. We've turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach—and our understanding—as we've matured.

Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it's a matter of routine that usually concludes with an brief "all is well" email.

This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we've developed to tame, refine, and leverage our approach. Once you've decided to go multi-region, the three major questions that arise are: how many regions, how should we steer users to regions, and how do we actually perform the failover? In addition to the story of how we got to where we are, I'll present the design considerations and system models we used to make those decisions.

Aaron Blohowiak, Netflix

Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team at Netflix. Previously, Aaron co-authored Chaos Engineering (O'Reilly, 2017.)

SREcon18 Europe/Middle East/Africa Open Access Videos
Sponsored by Indeed

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {218921,
author = {Aaron Blohowiak},
title = {Availability, Latency, and Cost: Withstanding Regional Outages},
booktitle = {SREcon18 Europe/Middle East/Africa (SREcon18 Europe)},
year = {2018},
address = {Dusseldorf},
url = {https://www.usenix.org/node/218922},
publisher = {USENIX Association},
month = aug
}

Presentation Video 

Presentation Audio