Measuring the Success of Incident Management at Atlassian

Tuesday, May 23, 2017 - 11:20am11:45am

Gerry Millar, Atlassian


When an incident happens it's the worst possible time to be bogged down with confusing systems and processes. A well defined Incident Management Process that's light-weight and supported by good automation offers a way to get fast, easy and predictable results during an incident, but if you don't implement the right things in the right way you risk bad results, such as high time-to-recovery, at critical times.

Find out how Atlassian drives value out of the Incident Management process and what metrics we use to track it. We'll also cover how we created automation to remove the overhead in managing incidents and deep dive into a case study to explain how it all ties together.

The target audience for this conceptual session is people who are involved in the management of incidents, such as SREs and delivery team members.

Gerry Millar, Atlassian

Gerry Millar is a member of the Reliability Process Group, the team responsible for Atlassian's incident response and recovery process. He trains staff, develops automation, measures incident performance, and drives post-incident reviews at Atlassian.

He was originally hired as a technical operations engineer for the company's cloud-facing products and as this group morphed into SRE Gerry's role morphed along with it. He now enjoys writing a bit of Python and ocean swimming around Sydney's beaches.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {202757,
author = {Gerry Millar},
title = {Measuring the Success of Incident Management at Atlassian},
year = {2017},
publisher = {USENIX Association},
month = may

Presentation Video 

Presentation Audio