Debugging at Scale—Going from Single Box to Production

Thursday, June 07, 2018 - 11:25 am11:50 am

Kumar Srinivasamurthy, Microsoft Corp


It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?

This talk will cover:

  1. Challenges with debugging in production
  2. Various approaches that are used in the industry
  3. Examples from Bing & Cortana incidents and steady state problems to illustrate the techniques
  4. How do you design services that make them easier to debug

Kumar Srinivasamurthy, Microsoft Corp

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {214933,
author = {Kumar Srinivasamurthy},
title = {Debugging at {Scale{\textemdash}Going} from Single Box to Production},
year = {2018},
publisher = {USENIX Association},
month = jun

Presentation Video 

Presentation Audio