Renjith Rajan and Rajneesh, LinkedIn
LinkedIn has hundreds of microservices spanning across different data centres and dependent on each other. Even though microservice architecture has its own advantages, often times, identifying the problematic service during an incident/outage could significantly contribute to a high MTTR and unnecessary escalations. Event Correlation Engine is an attempt to algorithmically identify the responsible service quickly and escalate to the right team.
Event Correlation Engine examines the entire service stack by looking at critical downstream latencies, error metrics and other monitoring events to recommend a responsible service. It understands callee caller communication for linkedIn’s complex microservices architecture.The engine uses dynamic thresholds, which it learns by processing last 30 days data, to provide an effective recommendation. We’ll discuss the approach we used in building it and how it is being used at LinkedIn to reduce MTTR and on call escalations.
Renjith Rajan is a Staff Site Reliability Engineer with LinkedIn's production SRE team. He joined LinkedIn in July 2016 and prior to that he was working with Yahoo in the Ad Serving team.
Rajneesh is a Site Reliability Engineering Manager leading the production SRE team at Linkedin, Bangalore. He joined Linkedin in April 2015 and prior to that he was leading the Global Platforms team at Yahoo.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.