Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data

Wednesday, March 28, 2018 - 4:50 pm5:10 pm

Tanner Lund, Microsoft


An important part of site reliability is identifying and eliminating the causes of outages. Good problem management requires good problem definition and theme identification. Historically, this has been a largely inefficient human process, but problem management should never be driven solely by manual review of individual postmortems or a limited study of top-level metrics. If we want to scale, we must be systematic.

Machine Learning is a key component in this process. However, fitting models is only a small piece of the pie. Without good data sets you will learn precious little. We'll talk though some of challenges we've identified when collecting and cleaning useful datasets for problem identification. How do you categorize? What is an outage theme? What is at risk for repeating and what problems have already been firmly left in the past?

On top of it all is the issue of success measurement. When we make reliability investments, how do we know that our actions are making a positive difference? We'll address some of the challenges we've encountered in measuring success (and reliability) in an environment that is ever-evolving. Join us as we discuss our vision for the future and the share our journey so far.

Tanner Lund, Microsoft

Tanner Lund has been a part of Azure's SRE organization from the beginning. He has worked in a variety of roles, including Crisis Management, working on SREBot, building data pipelines, and leading services through SRE/DevOps transitions. Throughout it all his focus has been on learning at scale, identifying trends, and eliminating outages before they happen (or at least shortening them significantly). Azure is growing at a rapid pace and our methods of learning from outages must grow with it.

Outside of work Tanner spends the most time on his family, his faith, progressive music, fiction, and eSports.

SREcon18 Americas Open Access Videos Sponsored by

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {213086,
author = {Tanner Lund},
title = {Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data},
year = {2018},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar

Presentation Video