Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds

Authors: 

Zhe Wang, Shanghai Jiao Tong University; Huanwu Hu, Alibaba Cloud; Linghe Kong, Shanghai Jiao Tong University; Xinlei Kang and Teng Ma, Alibaba Cloud; Qiao Xiang, Xiamen University; Jingxuan Li and Yang Lu, Alibaba Cloud; Zhuo Song, Shanghai Jiao Tong University and Alibaba Cloud; Peihao Yang, Alibaba Cloud; Jiejian Wu, Shanghai Jiao Tong University; Yong Yang and Tao Ma, Alibaba Cloud; Zheng Liu, Alibaba Cloud and Zhejiang University; Xianlong Zeng and Dennis Cai, Alibaba Cloud; Guihai Chen, Shanghai Jiao Tong University

Abstract: 

Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent in currently deployed monitoring systems. Ping systems are oblivious to the "actual weights'' of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing.

We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retxs) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from ‘noisy’ retxs with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.

USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {298595,
author = {Zhe Wang and Huanwu Hu and Linghe Kong and Xinlei Kang and Qiao Xiang and Jingxuan Li and Yang Lu and Zhuo Song and Peihao Yang and Jiejian Wu and Yong Yang and Tao Ma and Zheng Liu and Xianlong Zeng and Dennis Cai and Guihai Chen},
title = {Diagnosing Application-network Anomalies for Millions of {IPs} in Production Clouds},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {885--899},
url = {https://www.usenix.org/conference/atc24/presentation/wang-zhe},
publisher = {USENIX Association},
month = jul
}

Presentation Video