Detecting Service Degradation and Failures at Scale through Distributed Log Processing

Yegya Narayanan; Veeramani Gandan

Thursday, June 13, 2019 - 2:00 pm–3:00 pm

Yegya Narayanan and Veeramani Gandan, PayPal

Detecting degradation and failures in distributed systems is a significantly complex problem, especially with 2000+ services running across multiple data centers. In this session, we will cover how the distributed log processing infrastructure in PayPal scales to process over 1PB of log volume per day and generate metrics to detect degradation in performance and failures in real time.

As an architect and lead engineer, Yegya has led diverse engineering teams within PayPal. Currently, he is an engineer and a member of the monitoring platform team at PayPal. In the current role, he is responsible for scaling the logging platform that provides near real-time metrics to monitor the applications.

Connect:

@gynarayan

Veeramani Gandan is a senior manager in PayPal focusing on monitoring platform. He is responsible for building and scaling the logging platform from Gigabytes to Petabyte system. When not at work he enjoys playing table tennis and badminton. Currently committed to building a strong SRE community in South India and working towards the same.

Connect:

@vgandanvs

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {233273,
author = {Yegya Narayanan and Veeramani Gandan},
title = {Detecting Service Degradation and Failures at Scale through Distributed Log Processing},
year = {2019},
address = {Singapore},
publisher = {USENIX Association},
month = jun
}

Download

View the slides

Detecting Service Degradation and Failures at Scale through Distributed Log Processing

Open Access Media

Presentation Video