A Practical Guide to Monitoring and Alerting with Time Series at Scale

Monday, March 13, 2017 - 1:50pm2:45pm

Jamie Wilkinson, Google

Abstract: 

Monitoring is the foundational bedrock of site reliability yet is the bane of most sysadmins’ lives. Why? Monitoring sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools like Riemann and Prometheus have emerged to address this problem by scaling out monitoring configurations sublinearly with the size of the system. 

In a talk complementing the Google SRE book chapter “Practical Alerting from Time Series Data,” Jamie Wilkinson explores the theory of alert design and time series-based alerting methods and offers practical examples in Prometheus that you can deploy in your environment today to reduce the amount of alert spam and help operators keep a healthy level of production hygiene.

Jamie Wilkinson, Google

Jamie Wilkinson works as a site reliability engineer in Google’s storage infrastructure group, on a Globally Replicated Eventually Consistent High Availability Low Latency Key Value Buzzword Store, but focusses primarily on automation, monitoring and devops.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {201786,
author = {Jamie Wilkinson},
title = {A Practical Guide to Monitoring and Alerting with Time Series at Scale},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video 

Presentation Audio