Libckpt: Transparent Checkpointing under Unix James S. Plank, Micah Beck, Gerry Kingsley University of Tennessee Kai Li Princeton University Abstract Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This ``user-directed'' checkpointing is an innovation which is unique to our work. 1. Introduction Consider a programmer who has developed an application which will take a long time to execute, say five days. Two days into the computation, the processor on which the application is running fails. If the programmer has not planned for this event, his only choice is to restart the program and lose two days of work. Upon restarting the program, he still needs five days of continuous failure-free processor time to complete the job. Libckpt is a checkpointing library designed for such a programmer. To use libckpt, all he must do is change one line of his source code and recompile with the library libckpt.a No other modifications need to be made. Upon execution, the program will periodically save its execution state to disk (at the default interval: every 10 minutes). Upon a processor failure, the programmer need only restart the program with the command line flag ``=recover,'' and the program will roll back to the most recently checkpointed state. In the example above, at most ten minutes of work will be lost, and three more days of non-continuous failure-free processor time will be needed to complete the job. Libckpt is a tool for transparent checkpointing on uniprocessors running Unix. It implements incremental and copy-on-write checkpointing, two optimizations well-known in the literature [2, 3, 9, 10]. Libckpt is a user-level library and uses only facilities which are commonly available under Unix. Libckpt has been ported to and tested on a variety of architectures and operating systems with no kernel modifications. Source code for libckpt can be obtained at no cost by anonymous FTP from cs.utk.edu:pub/plank/libckpt. In this paper, we show the performance gains available in libckpt through transparent incremental and copy-on-write checkpointing. In addition, we introduce a new optimization technique, implemented in libckpt, called ``user-directed checkpointing''. User-directed checkpointing works under the assumption that a little information from the user can yield large improvements in the performance of checkpointing. We demonstrate that this assumption is often valid. 2. Transparent Checkpointing The goal of checkpointing is to establish a recovery point in the execution of a program, and to save enough state to restore the program to this recovery point in the event of a failure. The most straightforward method for establishing a recovery point under Unix is to suspend execution of the application while the entire contents of a process's memory and registers are written to a file. This is called sequential checkpointing because disk transfers are not interleaved with program execution. Recovery is effected by reloading the executable from its original file, and then reconstructing the memory and register state from the checkpoint file. This is akin to creating a core file, from which a user may recover using the undump utility and execve(). We say that checkpointing is transparent when no changes need to be made to the application program. While transparency is easy to obtain at the kernel level, it is harder to achieve in a user level checkpointing library. All current implementations of checkpointing share this limitation: They operate transparently and correctly so long as the application is well-behaved in a sense we will define in Section 4. Libckpt and all other user-level checkpointers can cause a correct but ill-behaved application to fail or to produce incorrect output upon recovery. Checkpointing with libckpt is not completely transparent. The name of the initial procedure in C must be changed from main() to ckpt_target(). This enables libckpt to gain control of the program as it starts, check the command line for the =recover flag, read a file called .ckptrc to set checkpointing parameters, and begin checkpointing. In FORTRAN, libckpt is enabled by changing the main PROGRAM module to SUBROUTINE ckpt_target(). No other program modifications are needed. By default, once libckpt gets control of a program, it generates a timer interrupt every ten minutes, and takes a sequential checkpoint at each interrupt. This and other defaults can be changed by placing appropriate lines in the .ckptrc file. In this section, we describe all options, where appropriate, as they would appear in the .ckptrc file. Placing the line ``checkpointing '' in the .ckptrc file turns checkpointing on or off. If off, libckpt will take no checkpoints and will not affect the execution of the application. The default is on. ``dir '' specifies the directory in which checkpoint files are created and found. The default is the current directory. ``maxtime '' defines the interval between checkpoints. At the beginning of the program, and after each checkpoint, libckpt calls alarm(seconds) and takes a checkpoint upon catching each ALRM signal. Setting the timer interval to zero turns off all timer-based checkpointing. The default value of maxtime is 600 (10 minutes). Many optimizations to simple sequential checkpointing have been described in the literature. Libckpt implements all published optimizations that are applicable to general-purpose uniprocessor checkpointing, as well as the new user-directed optimization. In the remainder of this section, we consider each of them in turn. 2.1 Incremental Checkpointing When a checkpoint is taken, only the portion of the checkpoint that has changed since the previous checkpoint need to be saved. The unchanged portion can be restored from previous checkpoints. Incremental checkpointing [2, 3, 18] uses page protection hardware to identify the unchanged portion of the checkpoint. Saving only the changed portion reduces the size of each checkpoint, and thus the overhead of checkpointing. ``incremental '' turns incremental checkpointing on or off. The default is off. In general, the size of a non-incremental checkpoint grows very slowly over time if at all. Moreover, only the most recent checkpoint file needs to be retained for recovery --- older ones may be deleted. In contrast, old checkpoint files cannot be deleted when incremental checkpointing is employed, because the program's data state is spread out over many checkpoint files. The cumulative size of incremental checkpoint files will increase at a steady rate over time, since many updated values may be saved for the same page. In order to place an upper bound on the cumulative size of incremental checkpoint files, it is necessary to coalesce all old checkpoint files into one new file, and then discard the old files. For this purpose, libckpt includes a utility program ckpt_coa, which coalesces a collection of incremental checkpoint files into a single checkpoint file. ``maxfiles '' sets the maximum number of incremental checkpoint files to n. After n checkpoint files have been created, libckpt invokes ckpt_coa to coalesce them into one file. If n=1, then no incremental checkpointing can occur. Values of n greater than one allow the user to strike a balance between the time and space overhead of incremental checkpointing. The default is n=1. Libckpt uses page protection to identify which pages should be included in incremental checkpoints. Specifically, after initialization and after each checkpoint, the mprotect() system call is invoked to set the protection of all pages in the data space to read-only. When a write occurs to a memory location in a protected page, the SEGV signal is caught by a handler in libckpt. The faulting page has its access protection set to read-write, and the page is marked as dirty. When libckpt takes the next checkpoint, only the dirty pages are included. 2.2 Forked Checkpointing Incremental checkpointing as described in the previous section is still sequential: Execution of the application program is suspended while the checkpoint file is written out. An alternative is to make a copy of the program's data space and to use an asynchronously executing thread of control to write the checkpoint file. This is called ``main-memory checkpointing''[10], and improves checkpoint overhead if there is enough physical memory to hold the checkpoint, as the saving of the checkpoint to disk is overlapped with the execution of the application. The Unix fork() primitive provides exactly the mechanism needed to implement main-memory checkpointing [7, 12]. When forked checkpointing is specified, libckpt forks a child process, which creates and writes the checkpoint file while the parent process returns to executing the application. The fork() system call provides the child with a fixed snapshot of the parent's data space and a separate thread of control. ``fork in the .ckptrc file turns main forked checkpointing on or off. The default is off. An important improvement to main-memory checkpointing is copy-on-write checkpointing [2, 9, 10]. Here the copy of main memory is taken using copy-on-write [4, 16]. Many implementations of fork() use a copy-on-write mechanism to optimize the copying of the parent's address space [15]. Thus, forked checkpointing corresponds to either main-memory checkpointing or copy-on-write checkpointing, depending on the operating system's implementation of fork(). 2.3 Checkpoint Compression With checkpoint compression, a standard compression algorithm like LZW [17] is used to shrink the size of the checkpoint [8, 13]. While this may be successful at reducing checkpoint size, it only improves the overhead of checkpointing if the speed of compression is faster than the speed of disk writes, and if the checkpoint is significantly compressed. For uniprocessor checkpointing this is not the case. Compression has only been shown to be effective in parallel systems with disk contention [13]. For this reason, checkpoint compression is not implemented in libckpt. 3. User-Directed Checkpointing All the optimizations presented so far maintain the transparency of checkpointing through techniques that are not visible to the typical application program: signal handlers, page protection, and the creation of child processes. In this section, we consider a different approach that can improve on the performance of these transparent techniques and can also substitute for them when automatic mechanisms are not available. We call this approach ``user-directed checkpointing.'' We consider two ways in which user-supplied directives can improve the performance of checkpointing: memory exclusion and synchronous checkpointing. 3.1 Memory Exclusion There are two situations where the values of memory locations can be excluded from a checkpoint file: when the locations are dead and when they are clean. In the case of dead locations, the values in memory will never be read or written, and thus do not need to be saved. In the case of clean locations, the values in memory exist in a previous checkpoint and have not been changed. Thus they need not be saved in the current checkpoint. While the identification of excludable areas of memory can sometimes be automated (as in incremental checkpointing), libckpt also allows the programmer to declare them explicitly. For example, suppose the user allocates a large temporary array T to make a calculation. When the lifetime of the data in array T is over, it will never be referenced again --- the next use of array T will overwrite the old values. If a checkpoint is taken outside of the lifetime of array T, then it can be safely excluded from the checkpoint. Any computation proceeding from this point will not need to use the current values stored in array T. The stack is a run-time mechanism that helps the checkpointer to determine the lifetime of local variables. This is one form of memory exclusion: Only the live portion of the stack is saved. Unfortunately, this does not work for heap variables or for variables which reside in the statically allocated data segment. The basis of incremental checkpointing is that clean data need not be repeatedly written to disk. In order to implement automatic incremental checkpointing, libckpt monitors page modifications using the mprotect() system call and a handler for the SEGV signal. This approach has a few weaknesses: It can only operate at the page granularity; system calls can fail rather than generating a SEGV signal when asked to write to a protected page; and on some systems mprotect() is not reliable. In those cases where automatic mechanisms cannot determine all possible memory exclusions, the performance of checkpointing can suffer. For this reason, libckpt allows the programmer to manage memory exclusion explicitly through two procedure calls: exclude_bytes(char *addr, int size, int usage) include_bytes(char *addr, int size) Exclude_bytes() tells libckpt to exclude the region of memory specified from subsequent checkpoints. It may be called when the user knows that these bytes are not necessary for the correct recovery from the program. Usage is an argument which currently may have one of two values: CKPT_READONLY or CKPT_DEAD. If the former, then exclude_bytes() has been called because the specified memory will not be written to until the user calls include_bytes() on it. Consequently, libckpt includes this memory in the next checkpoint, but excludes it from subsequent checkpoints until the memory is included with include_bytes(). If CKPT_DEAD is specified, then the memory is dead --- it will not be read before it is next written. Thus, libckpt excludes this from the next and subsequent checkpoints, until it is is explicitly included with include_bytes(). Include_bytes tells libckpt to include the specified region of memory in the next and subsequent checkpoints. Thus, include_bytes() cancels the effect of calls to exclude_bytes(), although calls to include_bytes() do not have to match calls to exclude_bytes(). By default, libckpt includes all bytes in a process's active stack and data segments that have not been explicitly excluded. User-directed memory exclusion can dramatically reduce the size of sequential and incremental checkpoint files, but it must be used very carefully. If a live region of memory is mistakenly excluded from a checkpoint, then a subsequent failure and recovery can cause an otherwise correct application to fail or to generate incorrect results. 3.2 Synchronous Checkpointing} In the previous section we discuss a mechanism for optimizing asynchronous checkpointing by excluding certain areas of memory. This allows the checkpointer to make use of data lifetime information which would not otherwise be available to it. However, the amount of data which can be excluded from the checkpoint is determined by the program's state when the checkpoint is taken. If the stack is large, or the size of excluded memory is small, then memory exclusion will have little effect. Synchronous checkpointing is a user directive that allows the programmer to specify points in the program where it is most advantageous for checkpointing to occur. These are called ``synchronous'' checkpoints because they are not initiated by timer interrupts. Synchronous checkpoints should be inserted by the programmer at points where memory exclusion can have the greatest effect. checkpoint_here() is a procedure call specifying where a synchronous checkpoint can be taken. Synchronous checkpoints may be placed in program locations that are reached often. Checkpointing too often, however, can lead to poor performance, and in order to avoid this libckpt allows a minimum interval between checkpoints to be specified. ``mintime '' specifies the minimum period of time that must pass between checkpoints. The default is zero. If mintime seconds have not passed since the previous checkpoint, then checkpoint_here() calls are ignored. Synchronous and asynchronous checkpointing techniques can complement one another. If maxtime seconds have passed and no synchronous checkpoint has been taken since the last checkpoint, then an asynchronous checkpoint is still taken. However, the effect of memory exclusion is likely not to be as beneficial as in a synchronous checkpoint. If both the mintime and maxtime parameters are set, then the former specifies the minimum interval between synchronous checkpoints, and the latter specifies an interval after which an asynchronous checkpoint will be taken. Whenever a checkpoint is taken, both the minimum and maximum interval timers are reset. If maxtime is zero, then asynchronous checkpoints are disabled. In this case the specification of memory exclusion can be optimized for synchronous checkpoints, because there is no danger of asynchronous checkpoints being taken. 3.3 An Example There are many examples where user-directed, synchronous checkpointing can yield large performance gains. Consider the program in Figure 1. This is a typical driver program for many kinds of programs that repeat calculations over numerous points in a data set. Figure 2 shows how one can checkpoint this program with synchronous, user-directed checkpointing in libckpt. ---------------------------------------------------------------------- main() { struct data *D; FILE *fi, *fo; D = allocate_data_set(); fi = fopen("input", "r"); fo = fopen("output", "w"); while(read_data(fi, D) != -1) { perform_calculation(D); output_results(fo, D); } } ---------------------------------------------------------------------- Figure 1: A typical scientific driver program, no checkpointing ---------------------------------------------------------------------- ckpt_target() { struct data *D; FILE *fi, *fo; D = allocate_data_set(); fi = fopen("input", "r"); fo = fopen("output", "w"); while(read_data(fi, D) != -1) { perform_calculation(D); output_results(fo, D); exclude_bytes(D, sizeof(struct data), CKPT_DEAD); checkpoint_here(); include_bytes(D, sizeof(struct data)); } } ---------------------------------------------------------------------- Figure 2: A typical scientific driver program with checkpointing} By specifying that the checkpoint must be taken at the checkpoint_here() call, we are able to omit all of the variable D from the synchronous checkpoint. This is because D is initialized anew at each iteration of the program. If D is large, then user-directed checkpointing will be responsible for a significant savings in checkpoint overhead. Note that this will be a vast improvement over incremental checkpointing because the memory locations in D will be dirty at the time of the checkpoint. Section 6 shows other successful examples of user-directed checkpointing. 4. The Mechanics of Checkpointing and Recovery The motivation for checkpointing is to reconstruct the recovery point. We therefore begin with an overview of the recovery process before describing the details of checkpointing. Recovery has four parts: process creation, data state restoration, system state restoration, and processor state restoration. 1. Process creation is implemented by invoking the checkpointed program with a special command line argument for recovery. This automatically restores the text portion of the process's state and begins execution. Libckpt parses the command line, detects the argument for recovery, and calls the recovery routine. 2. The recovery routine performs the rest of the recovery. Data state restoration means reading the checkpoint file to recreate the contents of data memory: This consists of the process's stack and data segments. 3. System state restoration means restoring as much of the operating system state as possible to its state at the time of the checkpoint. Much of the operating system state, such as the process ID and parent process ID, is unrestorable. However, most applications that need checkpointing are what we call ``well-behaved,'' and do not rely on such state. Libckpt determines the state of the open file table at each checkpoint, and saves it as part of each checkpoint. Upon recovery, libckpt restores the system so that the state of open files is the same as it was at the time of the checkpoint. No other system state is either saved or restored by libckpt. 4. Processor state restoration requires that processor registers, including the program counter and stack pointers be restored to their values when the checkpoint was taken. In libckpt, we use setjmp() to store the processor state in memory. The processor state is restored using longjmp(). Thus the recovery routine never returns, and execution continues as an apparent ``second return'' from the setjmp() of the checkpointing routine. Thus the mechanics of checkpointing are straightforward: When taking a checkpoint libckpt saves the processor state using setjmp() and records the state of the open file table. Then the data state, consisting of the program's stack and data segments, is written to disk. 5. Experiments In this section, we present the results of checkpointing five application programs using libckpt. The applications are long-running FORTRAN and C programs written by scientists to run under Unix. All are typical of programs that can benefit from checkpointing for fault-tolerance. |------------------------|-------|----------|-------|------|----------| | Application | Abbrv | Language | Run | Max | Ckp | | | | | Time | Ckp | Interval | | | | | mm:ss | Size | (min) | | | | | | (MB) | | |------------------------|-------|----------|-------|------|----------| | Matrix Multiplication | MAT | C | 15:20 | 4.6 | 2 | | Linear Equation Solver | SOLVE | FORTRAN | 13:42 | 4.6 | 2 | | Cellular Automata | CELL | C | 17:39 | 8.4 | 2 | | Shallow Water Model | WATER | FORTRAN | 25:54 | 13.1 | 3 | | Multicommodity Flow | MCNF | FORTRAN | 18:38 | 24.3 | 6 | |------------------------|-------|----------|-------|------|----------| Table 1. Description of application instances The experiments were performed on a dedicated Sparcstation 2 running SunOS 4.1.3, and writing to a Hewlett Packard HP6000 disk via NFS. The specific instances of the applications are described in Table 2. We describe the applications below: - Matrix Multiplication (MAT): This is a straightforward matrix multiplication. Two 615 X 615 matrices are read from disk and multiplied, and the product matrix is written to an output file. - Linear Equation Solver (SOLVE): This is a testing program from LAPACK, a high-performance package of linear-algebra subroutines [1]. This program generates a system of 750 equations with 750 unknowns, uses LU decomposition to solve the system, and then writes the solution to disk. It repeats this process for seven separate systems of equations. - Cellular Automata (CELL): This program executes a 2048 X 2048 grid of cellular automata for fifteen generations. - Shallow Water Model (WATER): This is the program STSWM from the National Center for Atmospheric Research. The program is a shallow water model based on the spectral transform method [5]. The instance used here is ``Zonal Flow over a Mountain'' from their test suite, modeled at 15-minute intervals for six hours. - Multicommodity Flow (MCNF): This program solves the multicommodity network flow problem using the simplex method [6]. The instance used here runs on a network of 100 vertices and 50 commodities. Note that for the purposes of these experiments, input values have been chosen to give running times between thirteen and thirty minutes. Typically, the programs would be set up to run for much longer, thus making them ideal candidates for libckpt. We present results pertaining to the three important metrics of checkpointing performance: - Checkpoint time: This is the average duration of a checkpoint, from start to finish. - Checkpoint overhead: This is amount of time added to the running time of the application as a result of checkpointing. Note that in sequential checkpointing, overhead is equal to the total checkpoint time. In main-memory and copy-on-write checkpointing, the overhead is smaller than the total checkpoint time because the disk writes are performed in parallel with the execution of the application. - Checkpoint size: This is the average size of the checkpoint file. The prime goal of checkpoint optimization is to minimize all three of these metrics, while still providing adequate fault-tolerance. Minimizing checkpoint overhead is the most important, because users would rather take the risk of failure than use a checkpointer that increases their applications' running time significantly. Keeping the overhead of checkpointing under 10% of the program's total running time is a reasonable goal [9, 13]. Minimizing checkpoint size is also important, as disk space rarely comes for free. Checkpoint time is the least important of the three metrics: When checkpointing for fault-tolerance, the only concern is that the current checkpoint complete before the user desires the next checkpoint to begin. 6. Results All of the experimental results are contained in Table 2 in the appendix. All of the graphs and data in this section are drawn directly from Table 2. (as this version of the paper is plain text, I have omitted the postscript figures. In their place, I have included small tables which themselves are derived from Table 2). 6.1 Sequential Checkpointing With no optimizations, checkpoint time and overhead should be the same, and should be directly proportional to the checkpoint size. Figure 3 confirms this prediction, showing checkpoint overhead and time vs. size for the sequential checkpointing runs described in Table 2. |-----------------|------------------------------| | Checkpoint Size | Checkpoint Time and Overhead | | (Mbytes) | (seconds) | |-----------------|------------------------------| | 0.11 | 1.3 | | 0.59 | 3.6 | | 4.26 | 19.3 | | 4.59 | 21.2 | | 4.62 | 19.6 | | 8.46 | 45.4 | | 14.17 | 74.6 | | 24.31 | 159.3 | |-----------------|------------------------------| Figure 3. Checkpoint Time vs. Size for Sequential Checkpointing 6.2 Checkpointing with fork() When checkpointing with fork(), the application writes its checkpoints to disk asynchronously. This enables it to run concurrently with the saving of the checkpoint, thereby reducing the overhead of checkpointing dramatically, as shown in Figure 4. This figure displays the percentage reduction in the overhead of checkpointing by using fork() (Ninety percent reduction in checkpoint overhead means that the overhead of checkpointing using fork() is ten percent of the overhead of sequential checkpointing). Because SunOS 4.1.3 implements fork() with copy-on-write, Figure 4 shows that copy-on-write improves the overhead of checkpointing by over 70 percent in almost all cases. |-----------------|------------------------| | Checkpoint Size | Percent Reduction | | (Mbytes) | In Checkpoint Overhead | |-----------------|------------------------| | 0.04 | 37.74 | | 0.11 | 46.32 | | 0.56 | 79.32 | | 0.57 | 74.69 | | 0.59 | 77.10 | | 1.75 | 92.44 | | 3.63 | 65.96 | | 4.26 | 70.86 | | 4.55 | 76.35 | | 4.59 | 79.61 | | 4.62 | 75.73 | | 5.23 | 76.30 | | 8.44 | 78.76 | | 8.46 | 78.63 | | 14.17 | 80.18 | | 24.31 | 89.89 | |-----------------|------------------------| Figure 4. Percentage Reduction in Checkpointing Overhead by Using fork() 6.3 Incremental Checkpointing Figure 5 is a graph showing the percentage reduction of checkpoint size and checkpoint overhead when using incremental checkpointing instead of simple sequential checkpointing. In three of the applications (MAT, WATER, and MCNF), only a fraction of the applications' address spaces are modified between checkpoints, resulting in a significant reduction in the average checkpoint size. Correspondingly, the overhead of checkpointing is significantly reduced. In the other two programs, the entire address spaces of the programs are modified between checkpoints, yielding little to no reduction in the size of checkpoints. Therefore, in SOLVE and CELL, the overhead of checkpointing is increased due to the fact that the cost of handling page faults is not offset by a savings in the time to write the checkpoint to disk. |-------------|----------------------|------------------------| | Application | Percentage Reduction | Percentage Reduction | | | in Checkpoint Size | in Checkpoint Overhead | |-------------|----------------------|------------------------| | MAT | 87.59 | 82.99 | | SOLVE | 1.35 | -23.78 | | CELL | 0.17 | -20.07 | | WATER | 63.11 | 60.03 | | MCNF | 92.11 | 82.62 | |-------------|----------------------|------------------------| Figure 5. Reduction in Checkpoint Size and Overhead Through Incremental Checkpointing 6.4 User-Directed Checkpointing In the previous three sections, the results corroborate published research concerning checkpointing optimizations [2, 3, 9, 10]. In this section, we evaluate the the new technique: user-directed checkpointing. In three of the applications, we analyzed the application programs and inserted directives in the code. In each case, we were able to add under ten lines of code, making checkpoints synchronous, and excluding memory from these checkpoints. We describe the details of each application below. SOLVE: Adding directives to the Linear Equation Solver was straightforward. At the end of each iteration, all of the program's arrays are dead: The matrix of equations will be initialized anew for the next iteration, and the solution vector will be recalculated. Therefore at the end of each iteration, we insert exclude_bytes() calls for the equation matrix and solution vector, then a checkpoint_here() call, and finally include_bytes() calls to re-include the matrix and vector in case of an asynchronous checkpoint. The results can be seen in Figure 6: The calls to exclude_bytes() and checkpoint_here() produce checkpoint files that are almost, reducing the checkpoint size and overhead by over 90 percent. This is significant, because it is an application where incremental checkpointing fails to improve the performance of checkpointing. Figure 6: Just see the results comparing "trans" and "user" in Table 2. CELL: At the end of each generation of the cellular automaton application, the previous value of the automaton grid becomes dead --- its values are not used for the calculation of the subsequent generations of the computation. Therefore we added user directives to checkpoint at the end of each generation, excluding the dead half of the grid from each checkpoint. In order to checkpoint at roughly the same interval as before, we also set mintime to 100, so that every second generation is checkpointed. The results are in Figure 7. With our user directives, the checkpoint size is halved. Accordingly, the overhead of checkpointing is also halved. Thus, as in SOLVE, the calls to exclude_bytes() and checkpoint_here() succeed in improving the overhead of checkpointing in an application where incremental checkpointing fails. Figure 7: Just see the results comparing "trans" and "user" in Table 2. MAT: In the matrix multiplication the two input matrices are read-only data. Moreover, once a product element is calculated it too is read-only. This is why incremental checkpointing works so well. In this application, we inserted exclude_bytes() calls (with flag=CKPT_READONLY) after reading the input matrices and also after calculating a product row to mark the memory as read-only. Thus, once a checkpoint contains these values, subsequent checkpoints omit them. The behavior of the application with these calls should approximate standard incremental checkpointing --- after data becomes read-only, it is omitted from subsequent checkpoints. The results of this experiment are shown in Figure 8. The important bars are the solid ones, showing that the checkpoints obtained with the user directives are approximately the same size as those obtained with incremental checkpointing. Moreover, they show slightly lower overhead, because they spend no extra time catching page faults. Figure 8: Just see the results comparing "trans" and "user" in Table 2. 7. Related Work There has been much computer science research devoted to checkpointing. Checkpointing has been implemented on uniprocessors [8, 11], multiprocessors [9, 10], transputers [14], multicomputers [13], and and distributed systems [2, 7]. Of these implementations, only two (Condor [11] and Fail-Safe PVM [7]) are publicly available code for Unix environments. Both implement sequential checkpointing with forking, and neither is designed for simple uniprocessor checkpointing: Condor is a system for batch programming using process migration, and Fail-Safe PVM requires the programmer to have access to the PVM infrastructure. Neither package implements any optimizations beyond calling fork(). User-directed checkpointing bears some similarity to checkpointers by Li and Fuchs [8], and Silva et al [14]. The former describes ``static checkpointing'', which is similar to our synchronous checkpointing. The user places potential_checkpoint_here() calls into his program, and the compiler and/or runtime system decides which of those calls would be best for checkpointing. They call for no user assistance in determining the memory to exclude (they only exclude the stack and unallocated heap memory), and do not show the dramatic performance improvements gained by user-directed checkpointing in the SOLVE, CELL, and MAT applications. Silva et al implement a checkpointing package for transputers in which the user specifies exactly what and where to checkpoint, but the process state is not included in checkpoints. Thus the user is responsible for rebuilding the call stack, although not the data, on recovery. Our approach differs because the checkpointer is responsible for the entire process state, and not just for the integrity of the data. 8 Conclusion We have written a general-purpose checkpointing library, libckpt, that provides fault-tolerance for long-running programs under Unix. The strengths of this library are its ease of use and low overhead. Libckpt is currently available via anonymous FTP to cs.utk.edu in the directory pub/plank/libckpt. Our experiments with libckpt show first and foremost that it is general-purpose and easy to use. We were able to checkpoint all five applications by changing one line of the applications' source code, and relinking with libckpt. Once enabled, these programs could save their state to disk periodically for fault-tolerance, using the fork() and incremental checkpointing optimizations if so desired. For all five applications, we were able to dramatically lower the overhead of checkpointing with copy-on-write, as implemented by libckpt's fork() optimization. Moreover, in three of the five applications, checkpoint size and overhead were reduced by over 60 percent using incremental checkpointing. Thus, libckpt is able to take efficient checkpoints using standard techniques from the checkpointing literature. Libckpt also implements user-directed checkpointing, a new technique for improving the performance of checkpointing based on the assumption that a little user input to the checkpointer can result in a large performance payoff. Memory exclusion and synchronous checkpointing are the two ways in which a user can direct the checkpointer to checkpoint more efficiently. In our experiments, directives added to three of the applications yielded performance improvements in all three cases. One avenue of future research is to employ compiler analysis to assist user-directed checkpointing. If the user places the checkpoint_here() calls, the compiler can use data dependence analysis to make calls to exclude_bytes() and include_bytes(). The benefits may be twofold. First, the compiler may discover dead variables to exclude that the user may omit. Second, the compiler can guarantee that its memory exclusion will yield correct checkpoints. In other words, whereas the user might err in excluding too much memory from a checkpoint, resulting in a faulty recovery state, the compiler can guarantee correctness. It is the authors' opinion that checkpointing primitives such as those provided by libckpt should be implemented in the operating system. This will improve both the performance and the generality of checkpointing. Until such a time, users can make use of a tool such as libckpt to render their programs resilient to failure. Acknowledgements The authors thank Jian Xu, Jack Dongarra, Christian Halloy, and the National Center for Atmospheric Research for help in obtaining test programs. We also thank Nitin Vaidya, Mootaz Elnozahy, Heather Booth, and the referees for their valuable comments. James Plank is supported by NSF grant CCR-9409496. Kai Li is supported by ARPA and ONR under contracts N00014-91-J-4039, and Intel Supercomputer Systems Division. References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK User's Guide. SIAM, Philadelphia, PA, 1992. [2] E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39--47, October 1992. [3] S. I. Feldman and C. B. Brown. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging, 24(1):112--123, Jan 1989. [4] R. Fitzgerald and R.F. Rashid. The integration of virtual memory management and interprocess communication in accent. ACM Transactions on Computer Systems, 4(2):147--177, May 1986. [5] J. J. Hack, R. Jakob, and D. L. Williamson. Solutions to the shallow water test set using the spectral transform method. Technical Report TN-388-STR, National Center for Atmospheric Research, Boulder, CO, 1993. [6] J. Kennington. A primal partitioning code for solving multicommodity flow problems (version 1). Technical Report IEOR-79009, Southern Methodist University, 1979. [7] J. Leon, A. L. Fisher, and P. Steenkiste. Fail-safe PVM: A portable package for distributed programming with transparent recovery. Technical Report CMU-CS-93-124, Carnegie Mellon University, February 1993. [8] C-C. J. Li and W. K. Fuchs. CATCH -- Compiler-assisted techniques for checkpointing. In 20th International Symposium on Fault Tolerant Computing, pages 74--81, 1990. [9] K. Li, J. F. Naughton, and J. S. Plank. Real-time, concurrent checkpoint for parallel programs. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 79--88, March 1990. [10] K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874--879, August 1994. [11] M. Litzkow and M. Solomon. Supporting checkpointing and process migration outside the Unix kernel. In Conference Proceedings, Usenix Winter 1992 Technical Conference, pages 283--290, San Francisco, CA, January 1992. [12] D. Z. Pan and M. A. Linton. Supporting reverse execution of parallel programs. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging, 24(1):124--129, January 1989. [13] J. S. Plank and K. Li. Ickp --- a consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology, 2(2):62--67, Summer 1994. [14] L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD applications on transputer networks. In Scalable High Performance Computing Conference, pages 694--701, Knoxville, TN, May 1994. [15] W. Richard Stevens. Advanced Programming in the UNIX Environment. Addison-Wesley, Reading, Mass., 1992. [16] M. M. Theimer, K. A. Lantz, and D. R. Cheriton. Preemptable remote execution facilities for the V-system. In Tenth ACM Symposium on Operating System Principles, pages 2--11, Orchas Island Washington, December 1985. [17] T. A. Welch. A technique for high-performance data compression. IEEE Computer, 17:8--19, June 1984. [18] P. R. Wilson and T. G Moher. Demonic memory for process histories. SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 330--343, June 1989. Author Information Jim Plank is an assistant professor in the Department of Computer Science at the University of Tennessee. He received his Ph.D. degree from Princeton University in 1993. His areas of interest are checkpointing, fault-tolerance, operating systems, and architecture. Micah Beck is an assistant professor in the Department of Computer Science at the University of Tennessee. He received his Ph.D. degree from Cornell University in 1992. His areas of interest are program analysis and compilation for parallelism. Gerry Kingsley is a Ph.D. student in the Department of Computer Science at the University of Tennessee. His research interests include compiler analysis techniques, user level checkpointing, and general fault tolerance. Kai Li received his Ph.D. degree from Yale University in 1986 and is currently an associate professor of the Department of Computer Science, Princeton University. His research interests are in operating systems, computer architecture, fault tolerance, and parallel computing. He is an editor of the IEEE Transactions on Parallel and Distributed Systems, and a member of the editorial board of International Journal of Parallel Programming. The first three authors' address is : Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN 37996. Their email addresses are [plank,beck,kingsley]@cs.utk.edu. Kai Li's address is: Princeton University, 35 Olden Street, Princeton, NJ 08544-2087. He may be reached electronically at li@cs.princeton.edu. Appendix: A U I F R O % A A N p s n o u v v v u p e c r n e O g g m l r r k n r v i e i h e C T o c D m n e r k o f a i e g a h p t t r n d e a C i e t T a T l k o c a i d i p n t l m m C t i e e k s v (sec) p e t s S z (MB) MAT No Checkpointing 920.7 MAT no no no 1160.0 239.3 26.0 4.591079 21.2 11 MAT no yes no 961.3 40.7 4.4 0.569529 3.6 11 MAT no no yes 969.5 48.8 5.3 4.591079 22.4 11 MAT no yes yes 931.0 10.3 1.1 0.569529 3.4 11 MAT yes no no 963.5 42.8 4.7 0.592265 3.6 11 MAT yes yes no 970.5 49.8 5.4 0.564494 3.5 11 MAT yes no yes 930.5 9.8 1.1 0.592265 5.0 11 MAT yes yes yes 931.0 10.3 1.1 0.564494 3.3 11 SOLVE No Checkpointing 822.7 SOLVE no no no 966.5 143.8 17.5 4.617315 19.6 7 SOLVE no yes no 1000.7 178.0 21.6 4.554788 24.7 7 SOLVE no no yes 857.6 34.9 4.2 4.617315 19.6 7 SOLVE no yes yes 864.8 42.1 5.1 4.554788 24.7 7 SOLVE yes no no 836.3 13.6 1.7 0.108339 1.3 7 SOLVE yes yes no 833.3 10.6 1.3 0.040529 1.0 7 SOLVE yes no yes 830.0 7.3 0.9 0.108339 1.4 7 SOLVE yes yes yes 829.3 6.6 0.8 0.040529 1.1 7 CELL No Checkpointing 1059.9 CELL no no no 1389.3 329.4 31.1 8.457519 45.4 7 CELL no yes no 1455.4 395.5 37.3 8.443536 53.0 7 CELL no no yes 1130.3 70.4 6.6 8.457519 43.7 7 CELL no yes yes 1143.9 84.0 7.9 8.443536 50.0 7 CELL yes no no 1202.3 142.4 13.4 4.255027 19.3 7 CELL yes yes no 1210.3 150.4 14.2 3.633065 20.3 7 CELL yes no yes 1101.4 41.5 3.9 4.255027 19.3 7 CELL yes yes yes 1111.1 51.2 4.8 3.633065 20.4 7 WATER No Checkpointing 1553.9 WATER no no no 2170.4 616.5 39.7 14.170641 74.6 8 WATER no yes no 1800.3 246.4 15.9 5.227099 29.9 8 WATER no no yes 1676.1 122.2 7.9 14.170730 74.6 8 WATER no yes yes 1612.3 58.4 3.8 5.150530 28.5 8 MCNF No Checkpointing 1118.2 MCNF no no no 1681.8 563.6 50.4 24.312483 159.3 3 MCNF no yes no 1216.1 97.9 8.8 1.745483 10.7 3 MCNF no no yes 1175.2 57.0 5.1 24.312427 131.3 3 MCNF no yes yes 1125.6 7.4 0.7 1.748317 10.3 3 Table 2: Results of all checkpointing experiments