Xianping Qu and Jingjing Ha, Baidu
Baidu has thousands of applications and hundreds of thousands of servers. For high availability and reliability services, our SREs have developed many operation tools and systems. But, these tools are difficult to reuse and scale because of various of different operations concepts, runtime envoriments and operations strategies. Thus, we built a platform named AIOps platform (AI means automation and intelligence) to help SREs more quickly and efficiently develop operations tools. This platform provides unified operations abstract layer, operations strategies, automated scheduling and execution. Thus, SREs can focus on building their custom and advanced features.
In this talk, we demonstrate the core procedure of AIOps platform by actual cases in the productive environment of the core products at Baidu. The following technologies will be involved and mentioned: the platform architecture, OKB (operations knowledge base), OPAL(operations abstract layer), and practices in failover, auto scaling, etc.
Xianping Qu is a manager of DevOps team at Baidu, the largest search engine in China, and has built Baidu’s monitoring platform and data warehouse. Now, He leads DevOps team to work on some challenging projects, such as anomaly detection, RCA, auto-scaling, etc. He is also interested in data analysis and machine learning.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.