TrafficShift: Avoiding Disasters at Scale

Monday, March 13, 2017 - 9:55am10:50am

Michael Kehoe and Anil Mallapur, LinkedIn


LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center out of rotation and redistribute its traffic to the healthy data centers within minutes, with virtually no visible impact to users

As we transitioned from big monolithic application to micro-services, we witnessed pain in determining capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex micro-services architecture wasn’t sufficient to provide enough confidence in a data center’s capacity. To solve this problem, we at LinkedIn leverage live traffic to stress services site-wide by shifting traffic to simulate a disaster load.

This talk provide details on how LinkedIn uses Traffic Shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Michael Kehoe, LinkedIn

Michael Kehoe, Staff Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a new college graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {201755,
author = {Michael Kehoe and Anil Mallapur},
title = {{TrafficShift}: Avoiding Disasters at Scale},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar

Presentation Video 

Presentation Audio