AIOps: Challenges and Experiences in Azure

Ze Li and Yingnong Dang, Microsoft Azure


While AI and Machine Learning are transforming the whole industry, AIOps is transforming cloud service building and operating. The value of AIOps can be broad and at least include service quality assurance, customer experience at scale, continuous COGS reduction, and boosting engineers’ productivity. Building AIOps solutions has unique challenges compared to applying AI and ML in other domains. In this talk, we will share the challenges we met in Azure in building AIOps solutions, and our experiences on solving these challenges. We will also share a few case studies, including (1) our disk failure prediction service that predict disk health condition and proactively live migrate the workloads to a healthy disk. (2) An end-to-end analytics service, for safe deployment in large-scale system infrastructure based on ensemble ranking and spatial/temporal algorithms using lambda architecture (3) An anomaly detection and auto diagnosis service that monitor the essential telemetrics in the cloud system.

Ze Li, Microsoft Azure

Dr. Ze Li is a data scientist in Microsoft Azure. Currently, he focus on using data driven and AI technologies to enable efficiently and effectively building and operating cloud service, such as safe deployment in large scale system, intelligent anomaly detection and pattern mining in cloud services. Previously, he worked as data scientist/engineer in Capital One and MicroStrategy, where he provided data driven solutions to improve efficiency in financial services and business intelligent services. He published more than 40 peer review papers in the field of data mining, distributed networks/systems and mobile computing. He hold a Ph.D degree in computer engineer from Clemson University.

Yingnong Dang, Microsoft Azure

Yingnong Dang is a Principal Data Scientist Manager in Microsoft Azure. Yingnong's focus is on building analytics and ML solutions for improving Azure Infrastructure availability and capacity, boosting engineering productivity, and increasing customer satisfaction. Yingnong and the team have a close partnership with Microsoft Research and academia. Before joining Azure in December 2013, Yingnong was a researcher in Microsoft Research Asia lab. His research areas include software analytics, data visualization, data mining, and human-compute interaction. As a researcher, he has transferred various technologies to Microsoft product teams including code clone analysis, crash dump analysis, performance trace analysis, etc. He owns 45+ U.S. patents and has published papers in top conferences including ICSE, FSE, VLDB, USENIX ATC, and NSDI.

@conference {232959,
author = {Ze Li and Yingnong Dang},
title = {AIOps: Challenges and Experiences in Azure},
year = {2019},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},