Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management

Jiawei Tyler Gu, Zhen Tang, Yiming Su, Bogdan Alexandru Stoica, Xudong Sun, and William X. Zheng, University of Illinois Urbana-Champaign; Yue Zhang and Akond Rahman, Auburn University; Chen Wang, IBM Research; Tianyin Xu, University of Illinois Urbana-Champaign

Modern cloud applications are increasingly managed by software programs, often named “operators,” which automate laborious, human-based operations. While operator programs largely prevent human mistakes, their own reliability has unprecedented impact on managed applications. This paper discusses the emerging challenges of operator program reliability on cloud-native platforms like Kubernetes. Our work is grounded in a rigorous analysis of 412 real-world failures of thirteen Kubernetes operators. We find that challenges of operator reliability come from the multifold complexity of an operator’s interactions with its managed applications, environment, and user interface. Among these, operators’ interactions with managed applications are the largest contributor to real-world operator failures, but they are largely overlooked—these interactions are often ad hoc and lack well-defined interfaces. We advocate to rethink the management interface of cloud applications and demonstrate this urgent need by showing the prevalence of defects in existing operators. Specifically, we develop a simple testing tool to exercise interactions between operators and the managed cloud applications, which discovered 86 new bugs in six popular Kubernetes operators.

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316060,
author = {Jiawei Tyler Gu and Zhen Tang and Yiming Su and Bogdan Alexandru Stoica and Xudong Sun and William X. Zheng and Yue Zhang and Akond Rahman and Chen Wang and Tianyin Xu},
title = {Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1171--1190},
url = {https://www.usenix.org/conference/nsdi26/presentation/gu},
publisher = {USENIX Association},
month = may
}

Presentation Video