Are We Getting Better Yet? Progress Toward Safer Operations

Monday, December 07, 2020 - 11:55 am12:35 pm

Alex Elman, Indeed


In the aftermath of a serious outage leadership aims to improve performance and avoid future incidents. With so much data to analyze it's difficult to know where to direct attention. How do we know we're getting better? Focusing solely on shallow one-dimensional measures of progress like MTTR, incident count, and severity obscures the deeper lessons. Holding teams to performance metrics based on things they can't control can be demoralizing. Incidents subtly influence organizations through system changes, new designs, budgets, policies, procedures, and hiring. Thorough incident analysis uncovers these unseen influences and their contributions to safety. Incident analysis produces artifacts such as interview transcripts, annotated timelines, contributing factors, and themes. This enables meta-analyses across incidents uncovering previously unseen opportunities. By providing leaders with richer data, it'll unlock insights into reliability, organizational learning, and opportunities for strategic investments. This fosters deeper trust between leaders and practitioners and yields healthier happier teams.

For the past nine years Alex Elman has been helping Indeed cope with ever-increasing complexity and scale. He is a founding member of the Site Reliability Engineering team. Alex leads the Resilience Engineering team focused on learning from incidents, chaos engineering, and fault-tolerant design patterns.

