Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way

Monday, October 29, 2018 - 3:00 pm3:30 pm

Michael Kehoe and Todd Palino, LinkedIn

Abstract: 

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.

We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Michael Kehoe, LinkedIn

Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automation.

Todd Palino, LinkedIn

Todd Palino is a Senior Staff Site Reliability Engineer at LinkedIn, tasked with keeping some of the largest Kafka and Zookeeper deployments fed and watered. Previously, he was a Systems Engineer at Verisign, developing service management automation for DNS, networking, hardware and operating systems. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and can be found sharing his experience on both SRE and Apache Kafka at industry conferences and tech talks. He is also the co-author of Kafka: The Definitive Guide, available from O’Reilly Media.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {221732,
author = {Michael Kehoe and Todd Palino},
title = {{Code-Yellow}: Helping Operations {Top-Heavy} Teams the Smart Way},
year = {2018},
address = {Nashville, TN},
publisher = {USENIX Association},
month = oct
}

Presentation Video 

Presentation Audio