Refining Systems Data without Losing Fidelity

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Wednesday, March 25, 2020 - 10:20 am11:00 am

Liz Fong-Jones,


It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. The question is, how to scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?

Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. This talk advocates a three-R approach to data retention: Reducing junk data, statistically Reusing data points as samples, and Recycling data into counters. ,We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.

@conference {247263,
author = {Liz Fong-Jones},
title = {Refining Systems Data without Losing Fidelity},
year = {2020},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = mar,