Understanding, Detecting and Localizing Partial Failures in Large System Software

Chang Lou; Peng Huang; Scott Smith

Chang Lou, Peng Huang, and Scott Smith, Johns Hopkins University
Awarded Best Paper!

Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing effective detectors to systematically detect such failures is both time-consuming and error-prone. We thus propose OmegaGen, a static analysis tool that automatically generates customized watchdogs for a given program by using a novel program reduction technique. We have successfully applied OmegaGen to six large distributed systems. In evaluating 22 real-world partial failure cases in these systems, the generated watchdogs can detect 20 cases with a median detection time of 4.2 seconds, and pinpoint the failure scope for 18 cases. The generated watchdogs also expose an unknown, confirmed partial failure bug in the latest version of ZooKeeper.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {246326,
author = {Chang Lou and Peng Huang and Scott Smith},
title = {Understanding, Detecting and Localizing Partial Failures in Large System Software },
booktitle = {17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)},
year = {2020},
isbn = {978-1-939133-13-7},
address = {Santa Clara, CA},
pages = {559--574},
url = {https://www.usenix.org/conference/nsdi20/presentation/lou},
publisher = {USENIX Association},
month = feb
}

Download

Lou PDF

View the slides

Understanding, Detecting and Localizing Partial Failures in Large System Software

Open Access Media

Presentation Video