Check before You Change: Preventing Correlated Failures in Service Updates

Authors: 

Ennan Zhai, Alibaba Group; Ang Chen, Rice University; Ruzica Piskac, Yale University; Mahesh Balakrishnan, Facebook; Bingchuan Tian, Nanjing University; Bo Song and Haoliang Zhang, Google

Abstract: 

The reliability of cloud services can be significantly undermined by correlated failures due to shared service dependencies, even when the services are already replicated across machines. State-of-the-art failure prevention systems can proactively audit a service before its deployment to detect risks for correlated failures, but their auditing speeds are too slow for frequent service updates. This paper presents CloudCanary, a system that can perform real-time audits on service updates to identify the root causes of correlated failure risks, and generate improvement plans with increased reliability.

CloudCanary achieves this with two primitives, SnapAudit and DepBooster. SnapAudit leverages two insights to achieve high accuracy and efficiency: a) service updates typically affect only a small part of the service stack, allowing the majority of previous auditing results to be reused; and b) structural reliability auditing tasks can be reduced to a Boolean satisfiability problem, which can then be solved efficiently using modern SAT solvers. DepBooster, on the other hand, can generate improvement plans efficiently by reducing the required reasoning load, using novel techniques such as model counting. We demonstrate in our experiments that CloudCanary can perform audits over large deployments 200x faster than state-of-the-art systems, and that it consistently generates high-quality improvement plans within minutes. Moreover, CloudCanary can yield valuable insights over real-world traces collected from production environments.

NSDI '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {246366,
author = {Ennan Zhai and Ang Chen and Ruzica Piskac and Mahesh Balakrishnan and Bingchuan Tian and Bo Song and Haoliang Zhang},
title = {Check before You Change: Preventing Correlated Failures in Service Updates },
booktitle = {17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)},
year = {2020},
isbn = {978-1-939133-13-7},
address = {Santa Clara, CA},
pages = {575--589},
url = {https://www.usenix.org/conference/nsdi20/presentation/zhai},
publisher = {USENIX Association},
month = feb
}

Presentation Video