Service Monitoring Manual—2018 Edition

Wednesday, June 06, 2018 - 11:30 am11:55 am

Nikola Dipanov, Facebook

Abstract: 

Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE organization. Monitoring services the right way can have a profound impact on how we do SRE. Modern software systems can be incredibly complex, code running on thousands of machines, depending on services we don't control and running code on user devices. Observing behavior of such systems means we have to change how we think about monitoring.

This talk will go over what a modern monitoring infrastructure for running software at scale looks like:

  • Asking the right questions—how to decide what to monitor
  • Types of data we want to collect and what answers it can help us find
  • A look at how we build services at Facebook
  • Collecting, storing and querying monitoring data at scale
  • When things go wrong—what makes for a good alarm and what makes a bad one
  • Putting it all together—debugging an outage using data

As an attendee, you will come out of the talk with fresh ideas about logging and monitoring. You will hear how we tackle these problems at Facebook, and why we do things the way we do.

Nikola Dipanov, Facebook

Nikola has spent the last 2 years at Facebook as a Production Engineer, based in Dublin, Ireland. Prior to that he worked as a Software Engineer in several industries, ranging from small internet startups to large Telcos. Due to this prolonged exposure to production software, he's developed an acute need to measure and verify.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Audio

BibTeX
@conference {214981,
author = {Nikola Dipanov},
title = {Service Monitoring Manual{\textemdash}2018 Edition},
year = {2018},
publisher = {{USENIX} Association},
}