Incident Response, Faster and Better with Traces

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Wednesday, March 25, 2020 - 11:30 am12:20 pm

Ashutosh Raina, eBay; Kamala Ramasubramanian, University of California, Santa Cruz


Site reliability engineers (SREs) are constantly building newer and better tools in order to be able to respond faster. The existing use of metrics, logs, and events doesn't fully capture the mental model of an SRE. Distributed Tracing has increasingly been adopted by industry and end to end traces captured by systems, both successful and unsuccessful. We motivate the use of traces through insights on a series of incidents, for each of which we demonstrate how traces could have been used to enable faster and more accurate triage. We present ideas on integrating distributed tracing in incident response and building better tolling using aggregate reasoning over traces. We will also talk about the challenges and opportunities we faced as part of this work.

Ashutosh is a member of the Site Reliability team at eBay. He works on the observability, fault tolerance, and reliability of systems at eBay. He works at the intersection of academia and industry, trying his best to fuse them together.

Kamala is a PhD student at UCSC. Her interests include reasoning about large scale distributed systems and applied machine learning, specifically how and when we might be able to apply machine learning effectively to understand complex systems better.

