USENIX Conference Policies
User-Level Checkpointing for LinuxThreads Programs
Multiple threads running in a single, shared address space is a simple model for writing parallel programs for symmetric multiprocessor (SMP) machines and for overlapping I/O and computation in programs run on either SMP or single processor machines. Often a long running program's user would like the program to save its state periodically in a checkpoint from which it can recover in case of a failure. This paper introduces the first system to provide checkpointing support for multithreaded programs that use LinuxThreads, the POSIX based threads library for Linux.
The checkpointing library is simple to use, flexible, and efficient. Virtually all of the overhead of the checkpointing system comes from saving the checkpoint to disk. The checkpointing library added no measurable overhead to tested application programs when they took no checkpoints. Checkpoint file size is approximately the same size as the checkpointed process's address space. On the current implementation WATER-SPATIAL from the SPLASH2 benchmark suite saved a 2.8 MB checkpoint in about 0.18 seconds for local disk or about 21.55 seconds for an NFS mounted disk. The overhead of saving state to disk can be minimized through various techniques including varying the checkpoint interval and excluding regions of the address space from checkpoints.
author = {William R. Dieter and James E. Lumpp, Jr.},
title = {{User-Level} Checkpointing for {LinuxThreads} Programs},
booktitle = {2001 USENIX Annual Technical Conference (USENIX ATC 01)},
year = {2001},
address = {Boston, MA},
url = {https://www.usenix.org/conference/2001-usenix-annual-technical-conference/user-level-checkpointing-linuxthreads-programs},
publisher = {USENIX Association},
month = jun
}