How to Make Releases Safer in Baidu

Wednesday, June 06, 2018 - 4:20 pm4:45 pm

Pingping Xue and Yu Chen, Baidu


Changes/updates are a major source of service faults. In Baidu, around 54% of the faults are introduced by changes. As a result, progressive rollout becomes imperative to improve service stability. Progressive rollout divides the deployment process into several stages. Each stage only deploys the change on a subset of the instances. Checkings are applied between consecutive stages to detect faults. If a fault is detected, the deployment is terminated and rolled back.

Intuitively, we can build a rollout system that enables development engineers to specify checking rules in each stage. Surprisingly, however, the Devs are not good at this, although they are the creators of the modules. Therefore, the reliability engineers are forced to add rules on stability indicators. But this leads to numerous false alarms, stalling the release procedure frequently. As a result, we turn to machine learning based methods. In order to obtain satisfying results, the algorithm must be able to learn the “normal” changes of each indicators, and quantitatively measure current changes to decide whether there are faults or not.

In this talk, we will present several real cases to demonstrate the dilemma we confront in rollout checking, and how the machine learning algorithm works.

Pingping Xue, Baidu

Pingping Xue is the Senior SRE in the SRE Department of Baidu. She has worked on release efficiency and stability for four years. she helped to construct the progressive rollout mechanisms and avoid a lot of release faults. Her work improved Baidu's release efficiency and accelerate the product iteration siginificantly.

Yu Chen, Baidu

Yu Chen is a Data Architect at the IOP group of Baidu's Cloud Unit. His work focuses on service stability issues, including alerting and diagnosis. Previously, he has been working at Microsoft Research Asia. His research interests are distributed systems, consensus protocols, search ranking and query recommendation.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {214951,
author = {Pingping Xue and Yu Chen},
title = {How to Make Releases Safer in Baidu},
year = {2018},
publisher = {USENIX Association},
month = jun

Presentation Video 

Presentation Audio