Experiences in Measuring the Reliability of a {Cache-Based} Storage System

Dan Lambright

Experiences in Measuring the Reliability of a Cache-Based Storage System

We present our experiences in benchmarking the reliability of the cache component of a storage system in a development environment. The reliability metrics we measured are availability from the standpoint of the host and maintainability from the standpoint of the system operator. We created errors using software fault injection, and measured their impact using a combination of performance measurement techniques and the rehearsal of maintenance procedures. This paper gives three case studies. The first two describe experiments that recreate very specific breakdowns in the software logic, and the third describes an experiment simulating a memory hardware failure that creates unpredictable effects. We found that, taken together, these various techniques gave us a useful picture of how well our cache management software tolerated faults.

Dan Lambright, EMC

BibTeX

@inproceedings {271189,
author = {Dan Lambright},
title = {Experiences in Measuring the Reliability of a {Cache-Based} Storage System},
booktitle = {First Workshop on Industrial Experiences with Systems Software (WIESS 2000)},
year = {2000},
address = {San Diego, CA },
url = {https://www.usenix.org/conference/wiess-2000/experiences-measuring-reliability-cache-based-storage-system},
publisher = {USENIX Association},
month = oct
}

Download

Links

Paper:

http://www.usenix.org/events/osdi2000/wiess2000/full_papers/lambright/lambright.pdf

Paper (HTML):

http://www.usenix.org/events/osdi2000/wiess2000/full_papers/lambright/lambright_html/index.html

USENIX Conference Policies

Experiences in Measuring the Reliability of a Cache-Based Storage System

Dan Lambright, EMC

Links