Accept Partial Failures, Minimize Service Loss

Tuesday, May 23, 2017 - 3:55pm4:20pm

Daxin Wang, Baidu Inc.

Abstract: 

Large Internet products are too complex to completely recover from a failure rapidly, as root cause localization and large scale operation are very time-consuming. If the failures can be isolated to a small part of system, we can transfer user query to the other parts still work, or cut off the failed minor subsystem, which is much more rapid than completely system recovery. This talk will presents some practical experiences for failure isolation in Baidu.

First, we should have at least N+1 data center redundancy, and eliminate unnecessary global “single-point” module. All automated operations should be limited to execute in one data center first. When the system in one data center is damaged by network or software failure, we can transfer user requests to other data centers rapidly, even automatically.

Second, we make the non-essential components of the system detachable. When one of them fails, it can be detached immediately to keep the major function still work for user.

Daxin Wang, Baidu Inc.

Daxin Wang has been working in Baidu SRE team for more than 7 years, focusing on principles of building and operating high available products, including monitoring, HA architecture, safely automation.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {202771,
author = {Daxin Wang},
title = {Accept Partial Failures, Minimize Service Loss},
year = {2017},
publisher = {USENIX Association},
month = may
}

Presentation Video 

Presentation Audio