Are We All on the Same Page? Let's Fix That

Thursday, 3 October, 2019 - 11:0011:45

Luis Mineiro, Zalando

Abstract: 

The industry defined as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.

Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.

Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.

Luis Mineiro, Zalando SE

Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support the company’s move to the Cloud and currently heading Site Reliability Engineering.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {239490,
author = {Luis Mineiro},
title = {Are We All on the Same Page? Let{\textquoteright}s Fix That},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video