Remus: High Availability via Asynchronous Virtual Machine Replication

Brendan Cully; Geoffrey Lefebvre; Dutch Meyer; Mike Feeley; Norm Hutchinson; Andrew Warfield

Remus: High Availability via Asynchronous Virtual Machine Replication

Allowing applications to survive hardware failure is an expensive undertaking, which generally involves re-engineering software to include complicated recovery logic as well as deploying special-purpose hardware; this represents a severe barrier to improving the dependability of large or legacy applications. We describe the construction of a general and transparent high availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections. Our approach encapsulates protected software in a virtual machine, asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, and uses speculative execution to concurrently run the active VM slightly ahead of the replicated system state.

Brendan Cully, University of British Columbia

Geoffrey Lefebvre, University of British Columbia

Dutch Meyer, University of British Columbia

Mike Feeley, University of British Columbia

Norm Hutchinson, University of British Columbia

Andrew Warfield, University of British Columbia and Citrix Systems, Inc.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {213651,
author = {Brendan Cully and Geoffrey Lefebvre and Dutch Meyer and Mike Feeley and Norm Hutchinson and Andrew Warfield},
title = {Remus: High Availability via Asynchronous Virtual Machine Replication},
booktitle = {5th USENIX Symposium on Networked Systems Design and Implementation (NSDI 08)},
year = {2008},
address = {San Francisco, CA},
url = {https://www.usenix.org/conference/nsdi-08/remus-high-availability-asynchronous-virtual-machine-replication},
publisher = {USENIX Association},
month = apr
}

Download