How the Metrics Backend Works at Datadog

Tuesday, March 15, 2022 - 2:00 pm2:45 pm

Adam Mckaig and Tahia Khan, Datadog

Abstract: 

Datadog is a popular cloud monitoring service which operates at scale in all three major cloud providers, ingesting 10s of GB/s of points across many billions of timeseries into PiBs of hot and cold storage. Naturally, reliability is paramount.

In this talk, we'll show how our very large distributed system works today, and how it grew from a very small not-distributed system. We'll share the most interesting scaling and reliability challenges we faced along the way, how we solved them (for now), and some important lessons and strategies which emerged. We'll also share a couple of bonus problems which are still very much unsolved today, and what we're planning next.

Adam Mckaig, Datadog

Adam Mckaig is a Staff Engineer at Datadog in New York, where he runs Metrics Reliability. Previously he has built things at Google, the New York Times, Bloomberg, and UNICEF. His favorite sound is a pager not going off.

Tahia Khan, Datadog

Tahia Khan is a Toronto-based SRE at Datadog. Before settling on SRE, she’s worked on everything but frontend at a bunch of startups, Mozilla and Amazon. Outside of work, Tahia draws bad art.

SREcon22 Americas Open Access Sponsored by Blameless

BibTeX
@conference {278128,
author = {Adam Mckaig and Tahia Khan},
title = {How the Metrics Backend Works at Datadog},
year = {2022},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video