Detecting Service Degradation and Failures at Scale through Distributed Log Processing

Thursday, June 13, 2019 - 2:00 pm3:00 pm

Yegya Narayanan and Veeramani Gandan, PayPal

Abstract: 

Detecting degradation and failures in distributed systems is a significantly complex problem, especially with 2000+ services running across multiple data centers. In this session, we will cover how the distributed log processing infrastructure in PayPal scales to process over 1PB of log volume per day and generate metrics to detect degradation in performance and failures in real time.

Yegya Narayanan, PayPal India

As an architect and lead engineer, Yegya has led diverse engineering teams within PayPal. Currently, he is an engineer and a member of the monitoring platform team at PayPal. In the current role, he is responsible for scaling the logging platform that provides near real-time metrics to monitor the applications.

Veeramani Gandan, PayPal India

Veeramani Gandan is a senior manager in PayPal focusing on monitoring platform. He is responsible for building and scaling the logging platform from Gigabytes to Petabyte system. When not at work he enjoys playing table tennis and badminton. Currently committed to building a strong SRE community in South India and working towards the same.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {233273,
author = {Yegya Narayanan and Veeramani Gandan},
title = {Detecting Service Degradation and Failures at Scale through Distributed Log Processing},
year = {2019},
address = {Singapore},
publisher = {{USENIX} Association},
month = jun,
}

Presentation Video