Liqun Li and Xu Zhang, Microsoft Research; Xin Zhao, University of Chinese Academy of Sciences; Hongyu Zhang, The University of Newcastle; Yu Kang, Pu Zhao, Bo Qiao, and Shilin He, Microsoft Research; Pochian Lee, Jeffrey Sun, Feng Gao, and Li Yang, Microsoft Azure; Qingwei Lin, Microsoft Research; Saravanakumar Rajmohan, Microsoft 365; Zhangwei Xu, Microsoft Azure; Dongmei Zhang, Microsoft Research
Incidents and outages dramatically degrade the availability of large-scale cloud computing systems such as AWS, Azure, and GCP. In current incident response practice, each team has only a partial view of the entire system, which makes the detection of incidents like fighting in the "fog of war". As a result, prolonged mitigation time and more finance loss are incurred. In this work, we propose an automatic incident detection system, namely Warden, as a part of the Incident Management (IcM) platform. Warden collects alerts from different services and detects the occurrence of incidents from a global perspective. For each detected potential incident, Warden notifies relevant on-call engineers so that they could properly prioritize their tasks and initiate cross-team collaboration. We implemented and deployed Warden in the IcM platform of Azure. Our evaluation results based on data collected in an 18-month period from 26 major services show that Warden is effective and outperforms the baseline methods. For the majority of successfully detected incidents (∼68%), Warden is faster than human, and this is particularly the case for the incidents that take long time to detect manually.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.