Improving Service Availability of Cloud Systems by Predicting Disk Error

Authors: 

Yong Xu and Kaixin Sui, Microsoft Research, China; Randolph Yao, Microsoft Azure, USA; Hongyu Zhang, The University of Newcastle, Australia; Qingwei Lin, Microsoft Research, China; Yingnong Dang, Microsoft Azure, USA; Peng Li, Nankai University, China; Keceng Jiang, Wenchi Zhang, and Jian-Guang Lou, Microsoft Research, China; Murali Chintalapati, Microsoft Azure, USA; Dongmei Zhang, Microsoft Research, China

Abstract: 

High service availability is crucial for cloud systems. A typical cloud system uses a large number of physical hard disk drives. Disk errors are one of the most important reasons that lead to service unavailability. Disk error (such as sector error and latency error) can be seen as a form of gray failure, which are fairly subtle failures that are hard to be detected, even when applications are afflicted by them. In this paper, we propose to predict disk errors proactively before they cause more severe damage to the cloud system. The ability to predict faulty disks enables the live migration of existing virtual machines and allocation of new virtual machines to the healthy disks, therefore improving service availability. To build an accurate online prediction model, we utilize both disk-level sensor (SMART) data as well as systemlevel signals. We develop a cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future. We evaluate our approach using real-world data collected from a production cloud system. The results confirm that the proposed approach is effective and outperforms related methods. Furthermore, we have successfully applied the proposed approach to improve service availability of Microsoft Azure.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {216071,
author = {Yong Xu and Kaixin Sui and Randolph Yao and Hongyu Zhang and Qingwei Lin and Yingnong Dang and Peng Li and Keceng Jiang and Wenchi Zhang and Jian-Guang Lou and Murali Chintalapati and Dongmei Zhang},
title = {Improving Service Availability of Cloud Systems by Predicting Disk Error},
booktitle = {2018 USENIX Annual Technical Conference (USENIX ATC 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Boston, MA},
pages = {481--494},
url = {https://www.usenix.org/conference/atc18/presentation/xu-yong},
publisher = {USENIX Association},
month = jul
}

Presentation Audio