RRC: Responsive Replicated Containers

Authors: 

Diyu Zhou, UCLA and EPFL; Yuval Tamir, UCLA

Abstract: 

Replication is the basic mechanism for providing application-transparent reliability through fault tolerance. The design and implementation of replication mechanisms is particularly challenging for general multithreaded services, where high latency overhead is not acceptable. Most of the existing replication mechanisms fail to meet this challenge.

RRC is a fully-operational fault tolerance mechanism for multiprocessor workloads, based on container replication. It minimizes the latency overhead during normal operation by addressing two key sources of this overhead: (1) it decouples the latency overhead from checkpointing frequency using a hybrid of checkpointing and replay, and (2) it minimizes the pause time for checkpointing by forking a clone of the container to be checkpointed, thus allowing execution to proceed in parallel with checkpointing. The fact that RRC is based on checkpointing makes it inherently less vulnerable to data races than active replication. In addition, RRC includes mechanisms that further reduce the vulnerability to data races, resulting in high recovery rates, as long as the rate of manifested data races is low. The evaluation includes measurement of the recovery rate and recovery latency based on thousands of fault injections. On average, RRC delays responses to clients by less than 400mu's and recovers in less than 1s. The average pause latency is less than 3.3ms. For a set of eight real-world benchmarks, if data races are eliminated, the performance overhead of RRC is under 48%.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280784,
author = {Diyu Zhou and Yuval Tamir},
title = {{RRC}: Responsive Replicated Containers},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-61},
address = {Carlsbad, CA},
pages = {85--100},
url = {https://www.usenix.org/conference/atc22/presentation/zhou-diyu},
publisher = {USENIX Association},
month = jul
}

Presentation Video