Service Monitoring Manual—2018 Edition

Wednesday, June 06, 2018 - 11:30 am11:55 am

Nikola Dipanov, Facebook


Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE organization. Monitoring services the right way can have a profound impact on how we do SRE. Modern software systems can be incredibly complex, code running on thousands of machines, depending on services we don't control and running code on user devices. Observing behavior of such systems means we have to change how we think about monitoring.

This talk will go over what a modern monitoring infrastructure for running software at scale looks like:

  • Asking the right questions—how to decide what to monitor
  • Types of data we want to collect and what answers it can help us find
  • A look at how we build services at Facebook
  • Collecting, storing and querying monitoring data at scale
  • When things go wrong—what makes for a good alarm and what makes a bad one
  • Putting it all together—debugging an outage using data

As an attendee, you will come out of the talk with fresh ideas about logging and monitoring. You will hear how we tackle these problems at Facebook, and why we do things the way we do.

Nikola has spent the last 2 years at Facebook as a Production Engineer, based in Dublin, Ireland. Prior to that he worked as a Software Engineer in several industries, ranging from small internet startups to large Telcos. Due to this prolonged exposure to production software, he's developed an acute need to measure and verify.

