Engineering Record and Replay for Deployability

Authors: 

Robert O’Callahan and Chris Jones, unaffiliated; Nathan Froyd, Mozilla Corporation; Kyle Huey, unaffiliated; Albert Noll, Swisscom AG; Nimrod Partush, Technion

Abstract: 

The ability to record and replay program executions with low overhead enables many applications, such as reverse-execution debugging, debugging of hard-to reproduce test failures, and “black box” forensic analysis of failures in deployed systems. Existing record-and replay approaches limit deployability by recording an entire virtual machine (heavyweight), modifying the OS kernel (adding deployment and maintenance costs), requiring pervasive code instrumentation (imposing significant performance and complexity overhead), or modifying compilers and runtime systems (limiting generality). We investigated whether it is possible to build a practical record-and-replay system avoiding all these issues. The answer turns out to be yes—if the CPU and operating system meet certain non-obvious constraints. Fortunately modern Intel CPUs, Linux kernels and user-space frameworks do meet these constraints, although this has only become true recently. With some novel optimizations, our system RR records and replays real-world low-parallelism workloads with low overhead, with an entirely user-space implementation, using stock hardware, compilers, runtimes and operating systems. RR forms the basis of an open-source reverse-execution debugger seeing significant use in practice. We present the design and implementation of RR, describe its performance on a variety of workloads, and identify constraints on hardware and operating system design required to support our approach.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {203227,
author = {Robert O{\textquoteright}Callahan and Chris Jones and Nathan Froyd and Kyle Huey and Albert Noll and Nimrod Partush},
title = {Engineering Record and Replay for Deployability},
booktitle = {2017 USENIX Annual Technical Conference (USENIX ATC 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {377--389},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/ocallahan},
publisher = {USENIX Association},
month = jul
}

Presentation Audio