Building Resilience: How to Learn More from Incidents

Friday, 4 October, 2019 - 09:0009:45

Nick Stenning, Microsoft

Abstract: 

Learning from incidents: it's not as easy as it sounds! Research from numerous safety-critical industries (aviation! healthcare! firefighting!) is changing what we know about how to build resilient systems and organizations in a turbulent world. This talk is going to share some of that research with you in a direct and practically-applicable way.

One major obstacle to building resilience in an engineering organization is the traditional approach to post-incident review, which focuses heavily on incident prevention. Come and learn:

  1. that there is and always will be more to incident response and review than prevention,
  2. how to recognize and avoid four common traps during incident investigations, and
  3. when to apply four concrete recommendations on how to learn more from incidents in your organization.

Nick Stenning, Microsoft

Nick Stenning is a Site Reliability Engineer on Azure, poking and prodding at the internals of "somebody else's computers." He previously worked at the UK's Government Digital Service and at open-source startup Travis CI. He's been talking his colleagues' ears off on the topic of post-incident review for close to a decade.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {239504,
author = {Nick Stenning},
title = {Building Resilience: How to Learn More from Incidents},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video