USENIX Conference Policies
Exploring Failure Transparency and the Limits of Generic Recovery
We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90% of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases.
author = {David E. Lowell and Subhachandra Chandra and Peter Chen},
title = {Exploring Failure Transparency and the Limits of Generic Recovery},
booktitle = {Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000)},
year = {2000},
address = {San Diego, CA },
url = {https://www.usenix.org/conference/osdi-2000/exploring-failure-transparency-and-limits-generic-recovery},
publisher = {USENIX Association},
month = oct
}