Troubleshooting Transiently-Recurring Errors in Production Systems with Blame-Proportional Logging

Authors: 

Liang Luo, University of Washington; Suman Nath, Lenin Ravindranath Sivalingam, and Madan Musuvathi, Microsoft Research; Luis Ceze, University of Washington

Abstract: 

Many problems in production systems are transiently recurring— they occur rarely, but when they do, they recur for a short period of time. Troubleshooting these problems is hard as they are rare enough to be missed by sampling techniques and traditional postmortem analyses of runtime logs suffers either from low-fidelity of logging too little or from the overhead of logging too much.

This paper proposes AUDIT, a system specifically designed for troubleshooting transiently-recurring problems in cloud-based production systems. The key idea is to use lightweight triggers to identify the first occurrence of a problem and then to use its recurrences to perform blame-proportional logging. When a problem occurs, AUDIT automatically assigns a blame rank to methods in the application based on their likelihood of being relevant to the root-cause of the problem. Then AUDIT enables heavy-weight logging on highly-ranked methods for a short period of time. Over a period of time, logs generated by a method is proportional to how often it is blamed for various misbehaviors, allowing developers to quickly find the root-cause of the problem.

We have implemented AUDIT for cloud applications. We describe how to utilize system events to efficiently implement lightweight triggers and blame ranking algorithm, with negligible to < 1% common-case runtime overheads on real applications. We evaluate AUDIT with five mature open source and commercial applications, for which AUDIT identified previously unknown issues causing slow responses, inconsistent outputs, and application crashes. All the issues were reported to developers, who have acknowledged or fixed them.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {216023,
author = {Liang Luo and Suman Nath and Lenin Ravindranath Sivalingam and Madan Musuvathi and Luis Ceze},
title = {Troubleshooting {Transiently-Recurring} Errors in Production Systems with {Blame-Proportional} Logging},
booktitle = {2018 USENIX Annual Technical Conference (USENIX ATC 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Boston, MA},
pages = {321--334},
url = {https://www.usenix.org/conference/atc18/presentation/luo},
publisher = {USENIX Association},
month = jul
}

Presentation Audio