Reliability Reviews in the Wild: Using Data to Drive Production Health

Friday, December 09, 2022 - 9:00 am9:30 am AEDT

Karthik Nilakant, Canva

Abstract: 

"Reliability reviews" (also known as "production meetings", or "operational reviews") are regular opportunities for engineering teams (or groups of teams) that own production services to reflect on their operational health. Ideally, these reviews should be based on quantitative insights drawn from multiple sources (such as incidents, post-incident reviews, service levels and on-call alerts), to help teams objectively decide where to prioritize their efforts. In this talk, I aim to share my practical experiences in scaling out this practice across our organization. I'll also share some insights from our work in developing an automated reporting platform to support this practice, focusing on challenges in collating health data from multiple sources and mapping to service owners. This talk should help inform reliability practitioners or leaders that are considering increasing adoption of similar reviews in their own organizations.

Karthik Nilakant, Canva

Karthik is a coach in the Reliability Platform group at Canva and is currently based in Wellington, New Zealand. He's previously worked as a product manager, team lead and engineer within the SRE field.

BibTeX
@conference {284931,
author = {Karthik Nilakant},
title = {Reliability Reviews in the Wild: Using Data to Drive Production Health},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec
}

Presentation Video