Incident Management and Chatops at Shopify

Thursday, 2017, August 31 - 16:3017:00

Daniella Niyonkuru, Shopify

Abstract: 

SREs are expected to be incident management experts. Yet, incident handling is hard, often messy, and exhausting. We encounter new incidents, look up everywhere for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some good practices.

At Shopify, we care not only about handling incidents quickly and efficiently, but also SRE well-being. We have a special IMOC (Incident Manager On Call) rotation and an incident chatbot to assist IMOCs. In this talk, I’ll first explain the IMOC role and how training SREs for this duty is essential to handling incidents well.

Our chatbot assists the IMOC by reducing manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, Github) to send timely reminders. It also binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates service disruption content. To avoid burnout for long-running incidents, the chatbot also reaches out to other IMOCs.

Our chatbot supports best practices and "streamlines" incident response. Attendees will leave with strategies for incorporating chatbots into their incident management and considerations for automating precisely and smartly.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Audio

BibTeX
@conference {205552,
author = {Daniella Niyonkuru},
title = {Incident Management and Chatops at Shopify},
year = {2017},
address = {Dublin},
publisher = {{USENIX} Association},
}