Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures

Authors: 

Erci Xu, Ohio State University; Mai Zheng, Iowa State University; Feng Qin, Ohio State University; Yikang Xu and Jiesheng Wu, Alibaba Group

Abstract: 

Modern datacenters increasingly use flash-based solid state drives (SSDs) for high performance and low energy cost. However, SSD introduces more complex failure modes compared to traditional hard disk. While great efforts have been made to understand the reliability of SSD itself, it remains unclear what types of system level failures are related to SSD, what are the root causes, and how the rest of the system interacts with SSD and contributes to failures. Answering these questions can help practitioners build and maintain highly reliable SSD-based storage systems.

In this paper, we study the reliability of SSD-based storage systems deployed in Alibaba Cloud, which cover near half a million SSDs and span over three years of usage under representative cloud services. We take a holistic view to analyze both device errors and system failures to better understand the potential casual relations. Particularly, we focus on failures that are Reported As "SSD-Related" (RASR) by system status monitoring daemons. Through log analysis, field studies, and validation experiments, we identify the characteristics of RASR failures in terms of their distribution, symptoms, and correlations. Moreover, we derive a number of major lessons and a set of effective methods to address the issues observed. We believe that our study and experience would be beneficial to the community and could facilitate building highly-reliable SSD-based storage systems.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {234992,
author = {Erci Xu and Mai Zheng and Feng Qin and Yikang Xu and Jiesheng Wu},
title = {Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {961--976},
url = {https://www.usenix.org/conference/atc19/presentation/xu},
publisher = {{USENIX} Association},
month = jul,
}

Presentation Video