Mayank Bansal, Milind Chabbi, Kenneth Bøgh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, and Subhav Pradhan, Uber Technologies; Ivan Beschastnikh, University of British Columbia; Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Albert Greenberg, and Uday Kiran Medisetty, Uber Technologies
Operating a global, real-time platform at Uber’s scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2× capacity model—each service provisioned to handle global traffic independently across two regions—leaving half the fleet idle. We present Uber’s Failover Architecture (UFA), which replaces the uniform 2× model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare “full-peak” failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2× to 1.3×, raising utilization from 20% toward 30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated 575K CPU cores, and projected to reduce over one million cores from a baseline of about 4 million cores.
NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Mayank Bansal and Milind Chabbi and Kenneth B{\o}gh and Srikanth Prodduturi and Kevin Xu and Amit Kumar and David Bell and Ranjib Dey and Yufei Ren and Sachin Sharma and Juan Marcano and Shriniket Kale and Subhav Pradhan and Ivan Beschastnikh and Miguel Covarrubias and Chien-Chih Liao and Sandeep Koushik Sheshadri and Wen Luo and Kai Song and Ashish Samant and Sahil Rihan and Nimish Sheth and Albert Greenberg and Uday Kiran Medisetty},
title = {Uber{\textquoteright}s Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {2011--2026},
url = {https://www.usenix.org/conference/nsdi26/presentation/bansal},
publisher = {USENIX Association},
month = may
}