A Principled Approach to Monitoring Streaming Data Infrastructure at Scale

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 13 October, 2021 - 04:1504:30

Eric Schow and Praveen Yedidi, CrowdStrike

Abstract: 

We ingest over a trillion events per day into our cloud platform and it is very important that this platform is available, operational, reliable, and maintainable.

In creating a comprehensive monitoring strategy for our data processing platform, we found it strategic to model our platform's efficiency and resilience along two axes—complexity of implementation and engineer experience—from which we can define four quadrants—observability, operability, availability, and quality.

In this talk, we present how we've employed this four-quadrant model to establish key indicators and enforceable quality SLAs in order to improve the resilience of our cloud platform while reducing operational complexity.

Eric Schow, CrowdStrike

Computational Biophysicist turned Mobile Engineer turned Cloud Engineer turned Site Reliability aficionado. Currently on a mission to stop breaches at CrowdStrike, where I lead the Site Reliability team.

Praveen Yedidi, CrowdStrike

Distributed systems Developer with experience in mentoring, facilitating, and leading teams offering a decade of experience in Large Scale cloud-native application and tooling development. Possessing excellent analytical skills summed up with strong knowledge in Go, JavaScript, Kubernetes, AWS, Terraform, Vault, Consul, Service Meshes, Observability, and monitoring tools. Active open-source contributor and contributed to projects like Kubernetes, gvisor, grafana, terraform, firecracker-containerd. I enjoy speaking and spoke at conferences like Kafka Summit, JS Conf, ContainerCamp AU, DDD Sydney, and Go Days. Organizer of Serverless Days Melbourne.

SREcon21 Open Access Sponsored by Indeed

BibTeX
@conference {276651,
author = {Eric Schow and Praveen Yedidi},
title = {A Principled Approach to Monitoring Streaming Data Infrastructure at Scale},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Presentation Video