Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way

Monday, March 25, 2019 - 2:15 pm2:45 pm

Michael Kehoe and Todd Palino, LinkedIn

Abstract: 

We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Michael Kehoe, LinkedIn

Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automation.

Todd Palino, LinkedIn

Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O'Reilly Media. Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership, and at Kafka Summit or ApacheCon talking about how to feed and water Kafka infrastructures. Or maybe out on the trails, training for the next marathon.

BibTeX
@conference {229561,
author = {Michael Kehoe and Todd Palino},
title = {Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way},
year = {2019},
address = {Brooklyn, NY},
publisher = {{USENIX} Association},
}