Automatic Reliability Testing For Cluster Management Controllers

Authors: 

Xudong Sun, Wenqing Luo, and Jiawei Tyler Gu, University of Illinois at Urbana-Champaign; Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, and Lalith Suresh, VMware; Tianyin Xu, University of Illinois at Urbana-Champaign

Abstract: 

Modern cluster managers like Borg, Omega and Kubernetes rely on the state-reconciliation principle to be highly resilient and extensible. In these systems, all cluster-management logic is embedded in a loosely coupled collection of microservices called controllers. Each controller independently observes the current cluster state and issues corrective actions to converge the cluster to a desired state. However, the complex distributed nature of the overall system makes it hard to build reliable and correct controllers – we find that controllers face myriad reliability issues that lead to severe consequences like data loss, security vulnerabilities, and resource leaks.

We present Sieve, the first automatic reliability-testing tool for cluster-management controllers. Sieve drives controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the current cluster state in ways it is expected to tolerate. It then compares the cluster state’s evolution with and without perturbations to detect safety and liveness issues. Sieve’s design is powered by a fundamental opportunity in state-reconciliation systems – these systems are based on state-centric interfaces between the controllers and the cluster state; such interfaces are highly transparent and thereby enable fully-automated reliability testing. To date, Sieve has efficiently found 46 serious safety and liveness bugs (35 confirmed and 22 fixed) in ten popular controllers with a low false-positive rate of 3.5%.

OSDI '22 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280892,
author = {Xudong Sun and Wenqing Luo and Jiawei Tyler Gu and Aishwarya Ganesan and Ramnatthan Alagappan and Michael Gasch and Lalith Suresh and Tianyin Xu},
title = {Automatic Reliability Testing For Cluster Management Controllers},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
year = {2022},
isbn = {978-1-939133-28-1},
address = {Carlsbad, CA},
pages = {143--159},
url = {https://www.usenix.org/conference/osdi22/presentation/sun},
publisher = {USENIX Association},
month = jul
}

Presentation Video