Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Programme
  • Grants for Women
  • Participate

sponsors

Gold Sponsor
[Amazon logo]
Gold Sponsor
Gold Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
[Demonware logo]
General Sponsor

connect with us


  •  Twitter
  •  Facebook
  •  LinkedIn
  •  Google+
  •  YouTube

general information

Venue
DoubleTree by Hilton Dublin - Burlington Road
Leeson Street Upper
Dublin 4, Ireland

Questions?
About SREcon?
About Registration?
About Sponsorship?

twitter

Tweets by @SREcon

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise
Tweet

connect with us

Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise

Björn Rabenstein, SoundCloud

Abstract: 

Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic "anomaly detection." Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than "magic." Alerting because "something seems weird" is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Björn Rabenstein, SoundCloud

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {208544,
author = {Bj{\"o}rn Rabenstein},
title = {Alerting for Distributed {Systems{\textemdash}A} Tale of Symptoms and Causes, Signals and Noise},
year = {2016},
address = {Dublin},
publisher = {USENIX Association},
month = jul,
}
Download
View the slides

Presentation Video 

  • Log in or    Register to post comments

Gold Sponsors

[Amazon logo]

Silver Sponsors

Bronze Sponsors

[Demonware logo]

General Sponsors

© USENIX

SREcon is a registered trademark of the USENIX Association.

  • Privacy Policy
  • Contact Us