Xiaoqing Sun, Alibaba Cloud; Xing Li, Zhejiang University and Alibaba Cloud; Xionglie Wei, Tian Pan, Ju Zhang, Bowen Yang, Yi Wang, Ye Yang, Yu Qi, Le Yu, Chenhao Jia, Zhanlong Zhang, Xinyu Chen, Xiaobo Xue, Jianyuan Lu, Shize Zhang, Enge Song, and Yang Song, Alibaba Cloud; Rong Wen, Fudan University and Alibaba Cloud; Biao Lyu, Alibaba Cloud and Hangzhou Alibaba Cloud Feitian Information Technology and Hangzhou Alibaba Feitian Information Technology; Yang Xu, Fudan University; Shunmin Zhu, Alibaba Cloud and Hangzhou Feitian Cloud
Failures are inevitable in production-scale cloud networks, making reliability a critical concern for both cloud service providers (CSPs) and their tenants. Existing network failure recovery solutions either fail to provide timely failover or require underlay upgrades, forcing tenants to deploy their own high availability systems with additional CapEX & OpEx. However, most tenants lack the expertise or willingness to invest in such systems but are highly sensitive to service disruptions. This motivates CSPs to assume the responsibility of fast and deterministic failure recovery as a cloud service.
In this work, we present ZooRoute, a tenant-transparent, underlay-agnostic network failure recovery service in Alibaba Cloud. ZooRoute leverages the overlay layer and enables failure bypass by modifying outer source ports during VXLAN tunnel encapsulation. A set of source port candidates per destination IP are maintained by proactive probing between tunnel endpoints to guarantee one-shot deterministic traffic reroute onto healthy paths. However, scaling such design at planet-scale cloud infra brings challenges such as probing overhead at hypervisors, memory consumption at Tofino gateways, and service disruptions at stateful middleboxes, which we address with a range of novel techniques. Deployed in Alibaba Cloud for 26 months, ZooRoute has significantly improved network reliability, reducing cumulative outage time by 93.19% and masking 98.21% of failures from tenant awareness.
NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Xiaoqing Sun and Xing Li and Xionglie Wei and Tian Pan and Ju Zhang and Bowen Yang and Yi Wang and Ye Yang and Yu Qi and Le Yu and Chenhao Jia and Zhanlong Zhang and Xinyu Chen and Xiaobo Xue and Jianyuan Lu and Shize Zhang and Enge Song and Yang Song and Rong Wen and Biao Lyu and Yang Xu and Shunmin Zhu},
title = {{ZooRoute}: Enhancing {Cloud-Scale} Network Reliability via Candidate Path Provisioning and Overlay Proactive Rerouting},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {509--526},
url = {https://www.usenix.org/conference/nsdi26/presentation/sun},
publisher = {USENIX Association},
month = may
}
