NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow

Authors: 

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, PDL; Yiming Zhang, Xiamen University; Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Minglu Li, Shanghai Jiao Tong University & Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Abstract: 

NVMe SSD has become a staple in modern datacenters thanks to its high throughput and ultra-low latency. Despite its popularity, the reliability of NVMe SSD under mass deployment remains unknown. In this paper, we collect logs from over one million NVMe SSDs deployed at Alibaba, and conduct extensive analysis. From the study, we identify a series of major reliability changes in NVMe SSD. On the good side, NVMe SSD becomes more resilient to early failures and variances of access patterns. On the bad side, NVMe SSD becomes more vulnerable to complicated correlated failures. More importantly, we discover that the ultra-low latency nature makes NVMe SSD much more likely to be impacted by fail-slow failures.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280680,
author = {Ruiming Lu and Erci Xu and Yiming Zhang and Zhaosheng Zhu and Mengtian Wang and Zongpeng Zhu and Guangtao Xue and Minglu Li and Jiesheng Wu},
title = {{NVMe} {SSD} Failures in the Field: the {Fail-Stop} and the {Fail-Slow}},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-9},
address = {Carlsbad, CA},
pages = {1005--1020},
url = {https://www.usenix.org/conference/atc22/presentation/lu},
publisher = {USENIX Association},
month = jul,
}

Presentation Video