Measuring Reliability: What Got Us Here Won't Get Us There

Tuesday, 25 October, 2022 - 11:4512:30 CEST

Štěpán Davidovič, Google

Abstract: 

Do you use data to decide if your system has been recently unreliable to your users? What data, and how do you interpret it?

Measuring reliability provided by our production systems has made great strides, and frequently we focus on metrics much closer to the business. But the appeal of now-common metrics like SLI, or models like SLO, can hide our inability to answer our questions with consistent application of data.

This talk will show some questions about reliability we might want answered, discuss the ways we might be coming short today, and how can we improve our ability to make data-inspired decisions going forward.

Štěpán Davidovič, Google

Štěpán is currently Senior Staff SRE at Google, working in the office of the technical advisor to the CFO. Prior to that, he worked among other things on building internal monitoring, reliability insights and canarying systems. He obtained his BSc from Czech Technical University in Prague.

BibTeX
@conference {284587,
author = {{\v S}t{\v e}p{\'a}n Davidovi{\v c}},
title = {Measuring Reliability: What Got Us Here Won{\textquoteright}t Get Us There},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}

Presentation Video