Proactive error prediction to improve storage system reliability

Website Maintenance Alert

Due to scheduled maintenance on Wednesday, October 16, from 10:30 am to 4:30 pm Pacific Daylight Time (UTC -7), parts of the USENIX website (e.g., conference registration, user account changes) may not be available. We apologize for the inconvenience.

If you are trying to register for LISA19, please complete your registration before or after this time period.

Authors: 

Farzaneh Mahdisoltani, University of Toronto; Ioan Stefanovici, Microsoft Research; Bianca Schroeder, University of Toronto

Abstract: 

This paper proposes the use of machine learning techniques to make storage systems more reliable in the face of sector errors. Sector errors are partial drive failures, where individual sectors on a drive become unavailable, and occur at a high rate in both hard disk drives and solid state drives. The data in the affected sectors can only be recovered through redundancy in the system (e.g. another drive in the same RAID) and is lost if the error is encountered while the system operates in degraded mode, e.g. during RAID reconstruction.

In this paper, we explore a range of different machine learning techniques and show that sector errors can be predicted ahead of time with high accuracy. Prediction is robust, even when only little training data or only training data for a different drive model is available. We also discuss a number of possible use cases for improving storage system reliability through the use of sector error predictors. We evaluate one such use case in detail: We show that the mean time to detecting errors (and hence the window of vulnerability to data loss) can be greatly reduced by adapting the speed of a scrubber based on error predictions.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {203217,
author = {Farzaneh Mahdisoltani and Ioan Stefanovici and Bianca Schroeder},
title = {Proactive error prediction to improve storage system reliability},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {391--402},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani},
publisher = {{USENIX} Association},
month = jul,
}

Presentation Audio