
In the 2016 O’Reilly book Site Reliability Engineering, Google described our culture of blameless postmortems and recommended that operationally focused teams and organizations institute a similar culture of postmortems in their approach to production incidents. A postmortem is a written record of an incident that details its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions taken to prevent the incident from recurring. The chapter “Postmortem Culture: Learning from Failure” describes criteria for deciding when to conduct postmortems, some best practices around postmortems, and advice on how to cultivate a postmortem culture based upon the experience we’ve gained over the years.
Download Article:
Article Section:
SYSADMIN
;login: issue: