Failure Recovery: When the Cure Is Worse Than the Disease
Zhenyu Guo, Sean McDirmid, Mao Yang, and Li Zhuang, Microsoft Research Asia; Pu Zhang, Microsoft Research Asia and Peking University; Yingwei Luo, Peking University; Tom Bergan, Microsoft Research and University of Washington; Madan Musuvathi, Zheng Zhang, and Lidong Zhou, Microsoft Research Asia
Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered foremost according to the maxim of primum non nocere, that it “does no harm.” Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Zhenyu Guo and Sean McDirmid and Mao Yang and Li Zhuang and Pu Zhang and Yingwei Luo and Tom Bergan and Madan Musuvathi and Zheng Zhang and Lidong Zhou},
title = {Failure Recovery: When the Cure Is Worse Than the Disease},
booktitle = {14th Workshop on Hot Topics in Operating Systems (HotOS XIV)},
year = {2013},
address = {Santa Ana Pueblo, NM},
url = {https://www.usenix.org/conference/hotos13/session/guo},
publisher = {USENIX Association},
month = may
}
connect with us