You Can't Stop Fires with an Ambulance

Wednesday, June 06, 2018 - 4:50 pm–5:15 pm

Piers Chamberlain, Xero

Abstract: 

SRE is often perceived as an emergency response function—dealing with incidents and restoring system health. While it is true that there is much useful work done in this space, reactive processes impact only the MTTR not the MTBF. Once the low-hanging fruit of detection and remediation improvements are gone, improvement takes more and more investment. At this point, I'd argue, it is time to start preventative work.

We saw some reduction in incident rates through establishing a Post-Mortem process but these often involved only the poor souls who happened to have been called upon to help triage and fix the issue. Realisation dawned on me that attempting to evangelise to engineers with little influence on the balance of functional/non-functional development effort was going to have limited success.

By embarking on a campaign evangelising reliability (similar to the way forest fire prevention, or health promotion campaigns might work) and targeting the right level within the organisation, we're seeing positive cultural and behavioural changes, and better operational morale and we believe it will result in fewer severity outages.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {214989,
author = {Piers Chamberlain},
title = {You Can{\textquoteright}t Stop Fires with an Ambulance},
year = {2018},
publisher = {USENIX Association},
month = jun
}

Presentation Video 

Presentation Audio