Wait for Us! Evolving On-Call as Your Company Grows

Friday, November 03, 2017 - 3:00 pm3:30 pm

Christopher Hoey, Datadog

Abstract: 

The talk will start with a quick overview of the rapid growth Datadog experienced and the resulting challenges. This is done to illustrate the eventual challenges where a simple primary and secondary on-call team starts to fall apart.

In hindsight the signs are obvious however in the thick of it all it is hard to step back and realize the on-call team and processes were falling apart. It should be said that what was in place worked and met its needs for a long time. You have to start somewhere. The evolution is what I focus on while sharing the tricks to make that evolution easier.

The talk will then go into some of the patterns Datadog found useful such as refining our incident management processes and roles, growing the depth of the oncall team, eventually switching to per team rotations and the challenges involved through this evolution.

We will highlight some of the useful tricks and tools Datadog have used such as:

  • Structured service templates to help with on-call training
  • On-call training and shadow ops rotations
  • The use of Github Issues to track on-call tasks for handoff and to use as training examples
  • Scheduled on-call handoffs that include systematically reviewing the sources of alerts to kill noise
  • Providing a way to capture monitor feedback from every alert notification
  • Patterns of using Github projects to track where each on-call member stands as far as service training
  • Scripts to use in conjunction with the service templates and on-call scheduling to show each on-call member a list of what changed since the last time they were on-call

Christopher Hoey, Datadog

Christopher Hoey currently leads the SRE Team at Datadog. Prior he was Director of Engineering, Operations at Mortar Data and Senior IT Manager at Amplify. Chris is a seasoned veteran having ridden the growth roller coaster numerous times while leading the operations teams that keep things running smoothly.

Outside of work Chris enjoys spending time with his family, riding downhill mountain bikes and tinkering on projects like open source telemetry systems.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Audio

BibTeX
@conference {207197,
author = {Christopher Hoey},
title = {Wait for Us! Evolving On-Call as Your Company Grows},
year = {2017},
address = {San Francisco, CA},
publisher = {{USENIX} Association},
}