How to Have an Operational Incident (A Crash Course)

Wednesday, October 30, 2019 - 9:45 am10:30 am

Courtney Eckhardt


What happens at your company when a service goes down? Hopefully an alarm fires somewhere and someone gets paged, but then what? Does the person who got paged fix it all themselves (and do they feel as isolated as that sounds)? What if they don’t know how- is there a procedure for them to get help? Do you have a protocol for deciding when the incident is over?

More and more, most of us work at companies that provide a service. Even if you’re a game dev or you work at a retailer, the way you interface with your customers is a web service, and services have outages. Let’s talk about the basics of incident response- what it is, how it helps, how to learn more. I can't fix all your problems in a 40m talk, but I can help get you going in the right direction!

Courtney Eckhardt[node:field-speakers-institution]

Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability. You can find her knitting in the audience of conference talks, and she's always interested in cat pictures.

@conference {240896,
author = {Courtney Eckhardt},
title = {How to Have an Operational Incident (A Crash Course)},
year = {2019},
address = {Portland, OR},
publisher = {{USENIX} Association},
month = oct,