Monitoring Cloudflare's {Planet-Scale} Edge Network

Matt Bostock

Friday, 1 September, 2017 - 11:30–12:00

Matt Bostock, Cloudflare

Cloudflare operates a global anycast edge network serving content for 6 million web sites. This talk explains how we monitor our network, how we migrated from Nagios to Prometheus and the architecture we chose to provide maximum reliability for monitoring. We'll also discuss the impact of alert fatigue and how we reduced alert noise by analysing data, making alerts more actionable and alerting on symptoms rather than causes.

This talk will cover:

The challenges of monitoring a high volume, anycast, edge network across 100+ locations
The architecture we chose to maximise the reliability of our monitoring
Why Prometheus excels as the new industry standard for modern monitoring
Approaches reducing alert noise and alert fatigue
Triaging alerts into a ticket system
Analysing past alert data for continuous improvement
The pain points we endured
Effecting change across engineering teams

Matt is a Platform Operations engineer at Cloudflare, where he has spent the last year promoting a monitoring utopia. He was previously tech lead for the GOV.UK Infrastructure team and is a keen contributor to open source software. He also loves bacon, avocado, running, and the Oxford comma.

Connect:

mattbostock

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {205574,
author = {Matt Bostock},
title = {Monitoring Cloudflare{\textquoteright}s {Planet-Scale} Edge Network},
year = {2017},
address = {Dublin},
publisher = {USENIX Association},
month = aug
}

Monitoring Cloudflare's Planet-Scale Edge Network

Open Access Media

Presentation Video

Presentation Audio