Preventing Avalanche Failures in {Large-Scale} Distributed Systems

Zhen Zhen

Tuesday, 7 October, 2025 - 13:50–14:35

Zhen Zhen, Baidu Online Network Technology, Beijing Co., Ltd.

In complex microservices architectures, a single frontend request can traverse dozens of service nodes. Under high concurrency, this creates tightly coupled scheduling chains where minor performance jitters—when combined with retries and queuing—can trigger avalanche failures. These occur when slow or failed services cause upstream retries and thread pool exhaustion, leading to blocked queues and cascading system-wide outages. Even after the initial fault is resolved, stuck requests may still clog the system, prolonging downtime.

This work presents a systematic approach to prevent and mitigate such failures. Key strategies include: fine-tuning timeout and retry settings, introducing full-link retry budgeting to reduce retry storms, and applying adaptive degradation and queue-shedding techniques during overloads. By breaking the positive feedback loop of retries and queuing, systems can recover faster and maintain availability—even under fault conditions.

Zhen Zhen is a Senior R&D Engineer at Baidu, responsible for the availability of Baidu's search system. His work focuses on stability engineering, infrastructure technologies, and data engineering.

BibTeX

@conference {311824,
author = {Zhen Zhen},
title = {Preventing Avalanche Failures in {Large-Scale} Distributed Systems},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Download

Preventing Avalanche Failures in Large-Scale Distributed Systems