Spike Detection in Alert Correlation at LinkedIn

Note: Presentation times are in Coordinated Universal Time (UTC).

Tuesday, 12 October, 2021 - 15:3016:00

Nishant Singh, LinkedIn

Abstract: 

LinkedIn's stack consists of thousands of different microservices and their associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed infrastructure, during an outage finding the real root cause of the issue is like finding a needle in a haystack even with all the right instrumentation. Since every service in the critical path of client request would have multiple active alerts. Lack of proper mechanism to derive meaningful information from these disjoint alerts often leads to false escalations causing increased issue resolution time. In this talk, we will showcase how we used Spike (Anomaly) detection on the alert correlation system at LinkedIn which helps us find alerts from false positives alerts and help reduce toil on engineers.

Nishant Singh, LinkedIn

Nishant Singh is a Site Reliability Engineer at LinkedIn, where he works toward improving the reliability of the site with a focus on reducing the MTTD and MTTR of incidents. Prior to joining LinkedIn, he worked at companies like PayTM and Gemalto as a DevOps Engineer, spending his time building custom solutions for clients and managing, maintaining services over the public cloud. Nishant loves building distributed systems and exploring the breadth of technologies to support business needs along with a focus on the usage of modern scalable solutions in SRE/DevOps environments.

SREcon21 Open Access Sponsored by Indeed

BibTeX
@conference {276649,
author = {Nishant Singh},
title = {Spike Detection in Alert Correlation at {LinkedIn}},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Presentation Video