PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

Authors: 

Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui, The University of Hong Kong

Abstract: 

Cloud computing enables a vast deployment of online services in virtualized infrastructures, making it crucial to provide fast fault-tolerance for virtual machines (VM). Unfortunately, despite much effort, achieving fast and multi-core scalable VM fault-tolerance is still an open problem. A main reason is that the dominant primarybackup approach (e.g., REMUS) transfers an excessive amount of memory pages, all of them, updated by a service replicated on the primary VM and the backup VM. This approach makes the two VMs identical but greatly degrades the performance of services.

State machine replication (SMR) enforces the same total order of inputs for a service replicated across physical hosts. This makes most updated memory pages across hosts the same and they do not need to be transferred. We present Virtualized SMR (VSMR), a new approach to tackle this open problem. VSMR enforces the same order of inputs for a VM replicated across hosts. It uses commodity hardware to efficiently compute updated page hashes and to compare them across replicas. Therefore, VSMR can efficiently enforce identical VMs by transferring only divergent pages. An extensive evaluation on PLOVER, the first VSMR system, shows that PLOVER’s throughput on multi-core is 2.2X to 3.8X higher than three popular primary-backup systems. Meanwhile, PLOVER consumed 9.2X less network bandwidth than both of them. PLOVER’s source code and raw results are released on github.com/ hku-systems/plover.

NSDI '18 Open Access Videos Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {211241,
author = {Cheng Wang and Xusheng Chen and Weiwei Jia and Boxuan Li and Haoran Qiu and Shixiong Zhao and Heming Cui},
title = {{PLOVER}: Fast, Multi-core Scalable Virtual Machine Fault-tolerance},
booktitle = {15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Renton, WA},
pages = {483--489},
url = {https://www.usenix.org/conference/nsdi18/presentation/wang},
publisher = {USENIX Association},
month = apr
}

Presentation Video