Building Resilience: How to Learn More from Incidents

Friday, 2019, October 4 - 09:0009:45

Nick Stenning, Microsoft

Abstract: 

Learning from incidents: it's not as easy as it sounds! Research from numerous safety-critical industries (aviation! healthcare! firefighting!) is changing what we know about how to build resilient systems and organizations in a turbulent world. This talk is going to share some of that research with you in a direct and practically-applicable way.

One major obstacle to building resilience in an engineering organization is the traditional approach to post-incident review, which focuses heavily on incident prevention. Come and learn:

  1. that there is and always will be more to incident response and review than prevention,
  2. how to recognize and avoid four common traps during incident investigations, and
  3. when to apply four concrete recommendations on how to learn more from incidents in your organization.

Nick Stenning, Microsoft

Nick Stenning is a Site Reliability Engineer on Azure, poking and prodding at the internals of "somebody else's computers." He previously worked at the UK's Government Digital Service and at open-source startup Travis CI. He's been talking his colleagues' ears off on the topic of post-incident review for close to a decade.

BibTeX
@conference {239504,
author = {Nick Stenning},
title = {Building Resilience: How to Learn More from Incidents},
year = {2019},
address = {Dublin},
publisher = {{USENIX} Association},
month = oct,
}