Understanding and Finding Crash-Consistency Bugs in Parallel File Systems

Authors: 

Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir, University of Illinois at Urbana-Champaign

Abstract: 

Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC)infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCheck. PFSCheck is easy to use with low-performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCheck is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {254300,
author = {Jinghan Sun and Chen Wang and Jian Huang and Marc Snir},
title = {Understanding and Finding Crash-Consistency Bugs in Parallel File Systems},
booktitle = {12th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 20)},
year = {2020},
url = {https://www.usenix.org/conference/hotstorage20/presentation/sun},
publisher = {{USENIX} Association},
month = jul,
}

Presentation Video