################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Transparent Fault Tolerance for Parallel Applications on Networks of Workstations Daniel J. Scales and Monica S. Lam Computer Systems Laboratory Stanford University Stanford, CA 94305 Abstract This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communication overheads in distributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation using the dynamic caching already providedby SAM. The replicated data is accessible to the local processor like other cached data, making access to shared data faster and potentially offsetting some of the fault tolerance overhead. In addition, our method uses information available in SAM applications on how processes access shared data to enable several optimizations which reduce the fault-tolerance overhead. We have built an implementation of our fault tolerance method in SAM for heterogeneous networks of workstations running PVM3. In this paper, we present our fault-tolerance method and describe its implementation in detail. We give performance results and overhead numbers for several large SAM applications running on a cluster of Alpha workstations connected by an ATM network. Our method is successful in providing transparent fault tolerance for parallel applications running on a network of workstations and is unique in requiring no global synchronizations and no disk operations to a reliable file server. Author's current address: Digital Equipment Corporation Western Research Laboratory, 250 University Ave., Palo Alto, CA 94301. This research was supported in part by DARPA contract DABT6391-K-0003. 1 Introduction Networks of workstations linked by high-speed interconnects such as ATM networks look quite attractive as platforms for running parallel applications, especially since many institutions already have numerous workstations on employees' desks or in common areas that are often idle. To date, networks of workstations have been used for only a limited range of parallel applications, because of issues of programmability and efficiency. The basic message-passing primitives provided by systems such as PVM [18] support only a very low-level programming interface and are difficult to use for programming applications with complex communication patterns. Systems that support a global sharedmemory model in software greatly ease programming of parallel applications, but can encounter efficiency problems because of the high cost of communication in workstation environments. In addition, programmers must deal with the possibility that individual processes or workstations participating in a parallel computation may fail. Workstations are often administered by the individual owners or reside in public areas. It is therefore not uncommon for workstations to be rebooted without notice. Because workstations are a shared resource, processes on a workstation may fail because they are explicitly killed by another user or exceed the resource limits of the workstation. An applications programmer using a cluster of workstations must either accept the loss in performance that results when an application run must be restarted after a failure, or attempt to deal explicitly with the possibility of failure in the application code. SAM is a run-time system for distributed memory machines and networks of workstations that eases programming by providing a global name space and dynamic caching of shared data in software at the level of user-defined types (or objects) [15, 16]. SAM incorporates a novel design that retains many of the efficient properties of message passing and minimizes communication. The basic approach in SAM is to require the programmer to designate the way in which data will be accessed, thus allowing communication for synchronization and data access to be combined. Producer/consumer relationships are expressed by accesses to single-assignment values, and mutual exclusion constraints are represented by access to data items called accumulators. SAM currently runs on the CM-5, Intel Paragon, IBM SP2, and heterogeneous networks of workstations. Experience with SAM has shown that it significantly eases the programming of complex applications and allows the programmer to achieve good performance for these applications on a variety of distributed memory platforms. This paper describes a method of providing fault tolerance in software for long-running parallel applications on networks of workstations written using SAM. Our method recovers from the common case of a small number of processors that fail by halting; it does not handle a global system failure or failures in which processors operate incorrectly. It is desirable that a fault tolerance mechanism have transparency, low overhead, portability, and scalability. By transparency, we mean that the fault-tolerance mechanism can automatically be used for applications without any additional programming by the user. The fault-tolerance mechanism should have small overhead, so that the performance benefits of running an application across many workstations is not significantly reduced by the cost of providing fault tolerance. By portability, we mean that the fault-tolerance mechanism can be used on a variety of kinds of workstations without changes to the basic system software. Finally, the mechanism should be inherently scalable and must therefore minimize the need for global communication and synchronization. Dynamic caching of shared data is an important part of the SAM design which makes it possible to exploit the data locality in parallel applications. Our fundamental method of providing fault tolerance is to use the dynamic caching functionality already provided by SAM to ensure, at checkpoints, that all shared data is replicated on more than one host (workstation). Our method thus provides fault tolerance without doing expensive writes to disk and does not require a common file server. Our integration of the fault tolerance method with the SAM system provides advantages of simplicity and efficiency. The implementation of the fault tolerance is simplified by using the existing caching mechanism. Because the replicated data is accessible to the local host like all other cached data, the apparent overhead of the fault tolerance method may be reduced by faster application access to shared data. In addition, though our method is applicable to any shared object system with caching, it uses information available about data accesses in SAM applications to reduce the fault tolerance overhead. Our method runs on an arbitrary collection of workstations with standard operating systems and transparently recovers from the common case of a single (or a small number of) host failures without any user intervention. Only the process on the failed host must be restarted and redo some computation; other processes continue executing without any rollback. It also avoids global synchronizations among all the workstations. The main disadvantage of our method is that the amount of fault-tolerance overhead depends on the communication pattern of the application and can potentially be large. In addition, because of its technique of replicating data in the memory of other hosts, our method may increase the memory requirements of an application on each host. We have implemented our method in the SAM system for heterogeneous collections of workstations, using PVM3 as the underlying message-passing layer. PVM3 provides portability to a wide variety of processor architectures and interconnects, and supports the basic functionality for detecting when process and host failures occur. Our implementation automatically provides transparent fault tolerance for all SAM applications in any such environment. In this paper, we describe our method in detail and provide performance results for several large parallel applications running on a network of workstations and using our fault tolerance method. In the next section, we give a brief overview of the SAM system. We then describe the basic software approaches that have been taken for making parallel applications fault-tolerant. We describe the key ideas of our approach and its implementation in detail. We then provide performance results for several SAM applications running in parallel onworkstations connected by an ATM network and analyze the overhead of our method for fault tolerance. Finally, we compare with other software distributed shared memory systems that have provided fault tolerance and conclude. 2 SAM Overview In this section, we present a brief overview of SAM mainly via several examples. SAM is implemented as preprocessor and run-time system for C programs, but could be modified to work with FORTRAN-90 and C++ applications as well. A more detailed description is provided in [15, 16]. SAM is a software distributed shared memory system that provides a global name space for accessing shared objects on distributed memory machines. In SAM, all shared data is communicated at the level of userdefined types (objects) and is represented by either a value or an accumulator. (SAM deals only with the management and communication of shared data; data that are completely local to a process can be managed by any appropriate method.) In Figure 1, we show several common idioms, as they would be expressed using semaphores on a shared address space machine and using SAM primitives. While the SAM primitives differ from the typical primitives on a shared memory machine, the complexity of expressing the examples using the two models is quite similar. In the first example, mutual exclusion is required to protect updates to shared data. In SAM, an accumulator is used to represent a piece of data that is to be updated a number of times, and whose final value is independent of the order in which the updates occurs. SAM automatically migrates the accumulator between processes as necessary and ensures that a process does not access the accumulator until mutual exclusion is obtained. Updates to an accumulator must be encapsulated by the SAM primitives begin update accum and end update accum. The call to begin update accum returns a pointer by which the accumulator can be accessed. SAM supports the idiom of chaotic computation via primitives which provide read access to a ``recent'' value of the accumulator, which is not guaranteed to be the most current value of the accumulator. SAM maintains a cache in each process of versions of accumulators that have been recently accessed and may be able to satisfy the chaotic request locally without communication. In the second example, a consumer (right column) accesses data created by a producer (left column). In SAM, a value provides producer/consumer synchronization. Values have a single-assignment semantics: a value is atomically created once its initial contents are set and is henceforth immutable. The code to create a value, which may include arbitrary updates to different components of the value, is encapsulated by a pair of primitives begin create value and end create value. Similarly, code accessing a value is encapsulated by the primitives begin use value and end use value. A process that attempts to access a value will automatically be suspended until the value is created and has been brought to the local process. Conversely, an access will succeed immediately if the value is already available in the local process, returning a pointer to the local copy. SAM maintains copies of values fetched from remote processes in local main memory in the form of a cache. Because all values have distinct names and are immutable, there is no consistency problem associated with maintaining this cache. SAM automatically frees up local copies that are not in use in an LRU fashion when the cache memory becomes filled. SAM must ensure that at least one copy of a value is maintained in the system, until it can determine that there will not be any other processes that will need to access the value. The SAM programmer provides this information by specifying the number of accesses to the value that will occur or explicitly indicating when all accesses to the value have occurred. In the third example, a consumer accesses a sequence of values created by a producer through a limited-sized buffer. To avoid memory usage problems that are associated with single-assignment values, SAM allows one value to reuse the storage of another value via the begin rename value primitive. This primitive provides the necessary synchronization to ensure that the producer does not reuse the storage of a value before the consumer has accessed it. In this way, imperative data objects are easily represented in SAM via a sequence of values which can all share the same storage. SAM also allows a value to be converted to an accumulator and vice versa. The creator of a value or an accumulator must specify the data type of the new object. With the help of a preprocessor, SAM uses this type information to allocate space for, pack (for sending in a message), unpack, and free the storage of the data. The preprocessor can handle complex C data types, including structures, arrays, and types that contain pointers and therefore are not necessarily stored in one contiguous block in memory. In heterogeneous environments, SAM also handles any necessary data conversion between dissimilar machines. Data is always transmitted in units of whole objects. SAM provides two additional mechanisms for optimizing communication. First, it provides an operation for producers to push data to consumers. Second, it provides operations to do asynchronous fetches of objects, thus allowing optimizations such as prefetching in which communication is overlapped with computation or other communication. Each of these optimizations is well integrated into SAM's shared memory model. 3 Software Approaches for Tolerating Failures In this section, we describe some of the common software methods for tolerating failures. We consider parallel applications in which a number of processes cooperate in doing a computation. To achieve parallelism, each process typically runs on a different host (workstation), so we will assume that there is one process on each host. Each process may be described as going through a series of states during its lifetime, which is determined by its initial input parameters, the program code it is executing, and communications with other processes. Most software fault tolerance mechanisms operate by maintaining information about the state of each process in a parallel execution. When a process fails, the mechanism restarts the failed process (and potentially some of the other processes) in a previous state so that the parallel application can continue executing. The fault tolerance mechanism must ensure that the states of the restarted processes are consistent with each other and any remaining processes, so that the processes execute the remainder of the parallel application correctly (i.e. in the same way as some failure-free execution of the application.) The most common mechanisms for allowing restoration of the state of a process are checkpointing and logging. A checkpointing mechanism saves essentially the entire state of the process at one point in time. A logging mechanism saves information about changes to the state of the process and may be viewed as an incremental form of checkpointing. The checkpointing or logging information is most commonly saved to disk. Checkpointing methods for providing fault tolerance in software for parallel applications may be divided broadly into two classes: consistent checkpointing methods and independent checkpointing methods. We illustrate these methods in Figure 2, where the arrows indicate communication between processes P1, P2, and P3, and the heavy horizontal bars indicate checkpoints. Figure 2(a) shows consistent checkpointing [6,11], in which all processes periodically synchronize and checkpoint their state simultaneously. This method requires a global synchronization that ensures that all processes are in a consistent state (in which all messages sent by one process have been received by the destination process). Then each process saves its entire state (processor registers and its entire virtual address space), typically to disk. If a failure occurs, then the computation is resumed by restoring every process to its state at the last checkpoint (as indicated by the dotted line) and restarting execution. The advantage of this approach is that it can recover from any number of host failures, if the checkpoint information is saved to reliable, nonvolatile storage. Also, when failures are very infrequent, the fault-tolerance overhead can be made very low by making the interval between checkpoints very large. The disadvantages are that global communication between all the hosts is required during a checkpoint, and the states of all processes are rolled back to the last checkpoint during a recovery. Independent checkpointing methods avoid global synchronizations and allow the state of each process to be checkpointed or logged independently. To ensure that a process can be restarted from its last checkpoint without affecting other processes, independent checkpointing methods generally require that each process saves its state whenever it communicates with other processes. The independent checkpointing methods divide into two classes, optimistic and pessimistic methods. In the pessimistic method [4], when a process communicates with another process, it atomically checkpoints or logs its incremental state change with the communication. It is then guaranteed that the process can be restarted from its last checkpoint or last set of loggingrecords withoutrequiringany retransmission of communications from other processes and without sending out a duplicate of any communication. The process can therefore be recovered without affecting any other processes. Figure 2(b) illustrates the use of pessimistic independent checkpointing to recover from the crash of a host. The advantage of this approach is that no global synchronization is required, and only the work of the failed process is rolled back. The disadvantage is that fault-tolerance overhead depends on the pattern of communication and can be large. In the optimistic method [9, 17], a process also checkpoints or logs information when it communicates with another process, but the checkpointing or logging is not required to occur atomically with the communication. Because of this fact, it is not guaranteed that a process can be recovered without affecting the state of other processes. Instead, it may be necessary to rollback the state of other processes which have processed information sent by the process that failed. Dependence information is maintained by all processes to determine which processes need to be rolled back during a recovery. 4 Our Approach Our approach applies the concepts of pessimistic, independent checkpointing to the SAM system. That is, when a process fails, the recovery procedure involves only restarting that process at a previous state (on the same host or a different host). All other processes proceed normally without rollback. The failed process appears to the other processes during the failure and recovery as if it has become very slow, but otherwise does not affect their execution. However, unlike a naive independent checkpointingapproach, our method does not require a checkpoint every time a process is involved in a communication; based on information available about data accesses in SAM applications, it requires checkpoints for only a subset of the communications. In addition, we take the novel approach of checkpointing the state of a process by ensuring that its data is replicated to other processes using the caching mechanism of SAM, instead of saving its data to disk. Using information maintained by SAM, we reduce the amount of state that must be saved by only replicating the objects that have changed since the last checkpoint. In the following sections, we describe our approach in detail. We first describe when a process must do a checkpoint and what state information must be preserved at each checkpoint. We next describe the memory management of data that has been replicated for fault-tolerance purposes. Then we give step-bystep descriptions of the checkpointing procedure and the procedure for recovering from a failure. 4.1 Determining When To Checkpoint During a parallel execution, each application process communicates with and affects the execution of the other processes. A process restarted after a failure must interact with the other processes in the same way as if the process never failed; otherwise, inappropriate messages sent by the restarted process will cause other processes to execute incorrectly. In a system such as SAM with a shared-memory model, one process can affect the execution of another process only by creating or modifying shared data which is accessed by the other process. A restarted process will therefore interact with other processes as if it never failed if it satisfies the following condition: any shared data produced by the original process that was communicated to other processes (and is still accessible) has the same contents in the restarted process. This condition is sufficient for ensuring a proper execution and is in general necessary, since processes may potentially access any data created or modified by the original process. Our fault tolerance method must checkpoint sufficiently so that, at any point in the execution of a process, it can restart the process from a checkpointed state that satisfies the above condition. The concept of a reexecutable operation is important in deciding when checkpoints are necessary. Suppose an operation is executed when a process is in a particular state and then the process fails. The operation is reexecutable if the operation is guaranteed to produce identical effects (on the current process and other processes) when executed from the same state in a restarted process. Any operation which only involves the local process is inherently reexecutable. However, operations that interact with sources external to the process, such as other processes or the operating system, may not be reexecutable. In parallel applications, most operations that are not reexecutable result from communication with other processes. For example, message sends in a message-passing program are not reexecutable, since the restarted process will send out a duplicate message. Similarly, message receives are not reexecutable, since the restarted process will wait for a message that the original process already received. In the SAM system, a process only interacts with other processes by creating, modifying, or accessing shared data. The creation of a single-assignment value is reexecutable, because a value is not changed once created, and the restarted process will recreate the value with the same contents as in the original process. Similarly, accessing a value is a reexecutable operation, because any value that was accessed by the original process is guaranteed to have the same contents when the restarted process accesses the value again. Conversely, creating or updating an accumulator is not a reexecutable operation, because the accumulator may be modified by another process between the time it created or updated by the original process and the time when the restarted process repeats the operation. If a process uses the results of an operation that is not reexecutable in creating or modifying a shared data object and there is no checkpoint between that operation and the creation of the data, then we classify the current contents of the shared object as nonreproducible. If the process is restarted from a checkpoint preceding the operation that is not reexecutable, then the resumed process may produce the shared object again with different contents than during the first execution. If the original contents of the object were communicated to another process, then the resumed process may produce data which is not consistent with the data that has already been sent to another process. Hence, the condition described above requires that we cause a checkpoint to occur whenever nonreproducible data is sent to another process. With this requirement, we know that the nonreproducible data will be recreated exactly from the checkpoint information if the process crashes. Conversely, if we communicate reproducible data to another process, we do not need to do a checkpoint. If the process crashes, the data will either be restored from the last checkpoint or it will be recreated with exactly the same contents as before when the process reexecutes from the last checkpoint state. An important consequence of the above discussion is that an application that operates mainly with values will have few operations that are not reexecutable. The application will therefore mainly create reproducible data and processes will not usually have to checkpoint when sending data. In other distributed shared memory systems, any access to shared data is not a reexecutable operation, since the shared data may be modified by another process between the time of the access by the original process and the access by the restarted process. Any data that is modified is therefore not reproducible, and processes must checkpoint any time they transmit data that they modified to another process. 4.2 Information Preserved By a Checkpoint The basic state of any process on a host may be divided into the shared data managed by the SAM system and the processor's private state. We can minimize the amount of state that must be preserved at each checkpoint by recognizing which data on a processor will be available from another process or can be reconstructed during a recovery. An important observation about the shared data is that only one copy of each data object need be preserved by checkpointing, even though several hosts may be caching copies of the object. For simplicity, we therefore designate a main copy of each data object and only checkpoint the main copy of each object. If a process using a cached copy (other than the main copy) fails, then the process holding the main copy can send a copy of the object to the recovering process. (Section 4.3 describes our method of ensuring that the main copy is not freed while it might still be needed for a recovery.) In our implementation, the main copy of a value is the copy on the process that created the value. The main copy of an accumulator is the actual current contents of the accumulator, which migrates between processes. We define the owner of an accumulator or value as the process that currently holds the main copy. The private state of a processor consists of the following: o processor state (mainly registers) o stacks of all active tasks in the process o values of statically allocated global variables o shared data objects in the process of being created or modified o pending requests by the current process for data owned by other processes o pending requests by other processes for data owned by the local process o directory information about the location of shared objects owned by other processes The first four items are the local state of the parallel application, and the remaining items are essentially the private state maintained by the SAM system on each processor. Each processor must save its private state at every checkpoint. However, pending requests for data from other processes need not be saved during a checkpoint, since the other processes can reissue the requests during the recovery process. Also, directory information on the location of shared objects owned by other processes does not have to be saved, since the owners of those objects can transmit that directory information during a recovery. Instead of using checkpointing or logging to disk to preserve the state of a process, we achieve fault tolerance by ensuring the state is replicated in two or more processes. We accomplish the replication of the shared data objects using the caching mechanism already provided by SAM. The replicated data is therefore available for use by the local process, potentially offsetting some the cost of the fault tolerance scheme. We will refer to a cached copy of an object used for checkpointing purposes as a checkpoint copy. We also replicate the private state of a process to another process as well. We therefore entirely avoid expensive disk operations during checkpointing. We also avoid depending on the local disk of a host for holding checkpoint information and can restart a process on a new host when the old host has a permanent failure. Much of the shared data owned by a process is likely not to change at every checkpoint. We therefore keep track of which accumulators or values have been created and which accumulators have been modified since the last checkpoint and only replicate those objects at the next checkpoint. Objects that have not changed have already been replicated in a previous checkpoint. We currently preserve the entire contents of the private state at every checkpoint; we could instead attempt to save only the incremental changes to the private state since the last checkpoint. To guarantee that we can recover from the simultaneous or near-simultaneous failure of n hosts, we must ensure that the private state and shared data owned by a host is replicated at a minimum of n other hosts. In practice, because simultaneous failures are rare, we choose n to be 1, thereby minimizing the fault-tolerance overhead, but guaranteeing recovery only when one host failure occurs at a time. However, all of our techniques immediately generalize to the case where n is greater than one. To simplify the recovery procedure, we always replicate a particular object to a specific process which is determined directly from the name of the object. Similarly, we always replicate a process's private state to a specific process. 4.3 Memory Management of Replicated Data Because the main copy and checkpoint copy of a data object are potentially needed to aid in a recovery, memory management of these copies must be handled specially. Normally, the SAM implementation can free the main copy when it determines (via user-provided information) that all accesses to the data have occurred. However, to allow for recovery from a failure, the SAM implementation must maintain the maincopysomewhat longer, until every other process has done a checkpoint since its last access to that object. The reason is that some process which has accessed the object since its last checkpoint may crash. When the process is restarted from that checkpoint, the process will attempt to access the object. If the main copy has been freed up, then that access (hence the recovery procedure) will fail. Because the checkpoint copy is essentially a backup for the main copy (in case the process with the main copy is the one that fails), it can only be freed when the main copy is finally freed. These conditions on freeing the main and checkpoint copy are analogous to the conditions in other fault-tolerance methods on when checkpoint files and log records can be reclaimed. We wish to avoid having to send any extra messages in order to determine when the main copy can be freed. Our approach is to mark the main copy as ``freeable'' at the point at which all accesses to it have occurred, but leave it in the shared object cache. At some point later, it will be replaced in the cache, because it is no longer being referenced. At that time, before actually freeing it, we must ensure that all processes have done a checkpoint since the copy was made freeable. The straightforward way of ensuring this fact is to send a message to each process and wait for a reply message indicating that it has done a checkpoint before freeing the copy. However, we have designed a technique which avoids these extra messages in almost all cases by piggybacking timestamp information on other fault-tolerance messages (message sent during checkpointing). The basic idea is to maintain information, in all processes, on the last known time at which each process checkpointed, and to keep this information updated by transmittingwitheach fault-tolerance message the time when the sender last checkpointed. However, because of the lack of a globally consistent time among the hosts, the actual method is more complex. Our technique is as follows. Each process i maintains a ``virtual time'' counter which is incremented each time there is a checkpoint or a freeing of an object that it owns. Process i also maintains a time vector T i of the last-known virtual times on each process. The i th component of T i is always the current virtual time on process i. The rest of the vector is kept relatively current by transmitting the sender's current time vector with each fault-tolerance message and using it to update the receiver's time vector. Each process i also maintains vectors C i = (c i1 ; c i2 ; :::; c i;n-1 ) and D i = (d i1 ; d i2 ; :::; d i;n-1 ). C i is the value of the time vector T i when the process last did a checkpoint. d ij is the last known value of c ji from process j and is maintained by transmitting the current value of c ji with each fault-tolerance message from process j to process i. If the current value of c ji is c, then process j has done a checkpoint since the time on process i was c. Therefore, if the current value of d ij on process i is d, then process j has done a checkpoint since the time on processor i was d. We can thus use the D i vector to determine if each process has checkpointed since a particular virtual time on process i. Suppose that we are about to remove the main copy of an object from the cache, and it was first marked freeable at time f on process i. If all elements of D i are greater than f , we can free the copy immediately. If not, we must delay freeing the copy and send a ``force-checkpoint'' message to each process j for which d ij < f . If process j receives a force-checkpoint message and its value of c ji is less than f , then it does a checkpoint, which ensures that c ji becomes greater than or equal to f . If its value of c ji is already greater or equal to f , then process j does not need to checkpoint. In either case, process j replies to process i with its current values of T j and c ji . Process i can free the main copy when it receives replies from all the processes to which it sent force-checkpoint messages. It also informs the process holding the checkpoint copy that the checkpoint copy can be freed. We will present statistics in Section 5 showing that these force-checkpoint messages and forced checkpoints are required very infrequently in our applications. 4.4 Checkpointing Procedure We can now describe the process of checkpointing in detail. Checkpointing is a transaction in which we atomically save the private state of the process, replicate all the necessary data, and execute the operation that originally started the checkpoint. The transaction is accomplished by sending checkpoint copies to other processes initially in an ``inactive'' state, which makes the data inaccessible until the checkpoint commits. The steps of a checkpoint are as follows: o Send a copy of the private state to a designated process. o For each nonreproducible data object owned by the current process which has not yet been checkpointed, send a checkpoint copy to a designated process in an ``inactive'' state. o For each reproducible object owned by the current process which has not yet been checkpointed, send a checkpoint copy to a designated process (in an active state). o Send the requested object (that caused the checkpoint) to the requesting process in an inactive state if it has not yet been sent to that process as a checkpoint copy. o Wait for acknowledgements from the processes that received the private state or any inactive objects. Then ``activate'' all of the inactive objects by sending out a message to all of the processes with inactive objects. While the local process is doing a checkpoint, it continues serving incoming requests for reproducible data, receiving data it has requested from other processes, and receiving replicated data sent by another process that is checkpointing. However, to ensure that each checkpoint is consistent, a process delays serving requests for nonreproducible data and responding to ``activate messages'' from other processes that have just completed a checkpoint. After the checkpoint has been committed, all of these delayed operations are executed, possibly causing another checkpoint in the process. 4.5 Recovering from a Failure The goal of the recovery procedure is to restore the necessary state of the failed process for the SAM application to continue normally and to restore the necessary checkpoint information for the system to tolerate further failures. The state that must be restored includes all of the private state of the process, and all shared objects that were in use at the last checkpoint or whose main copy was on the failed process. The checkpoint information that must be restored are all the checkpoint copies that were maintained on the failed process. Our system depends on the underlying messagepassing layer to detect the failure of a process or host. Our current implementation runs on PVM3, which provides a notification message to surviving processes when another process fails. The recovery procedure when process p fails is as follows: o One or more processes get a notification that process p failed. They send a message to a distinguished process d (process 0, or process 1 if process 0 is the process that failed) to coordinate the recovery. o Process d receives the failure message and starts the recovery process, ignoring further messages indicating that process p has failed. o Process d chooses a host on which to restart the failed process. This can be the same host, if it is still running or has been restarted, or a completely different host. Process d starts up a process in recovery mode on the new host. o Process d sends a message to all other processes indicating that a recovery is in progress and including the PVM id of the new process. The new process will not receive any messages sent to the old process, since it has a different PVM id. o The process holding the copy of the private state of the failed process sends that state to the new process. o Each process besides the failed process aborts and restarts any checkpoint it has started that involves process p (since process p will have lost all information about that partial checkpoint). o Each process besides the failed process also sends copies of data to the new process as follows: -- If a process has a checkpoint copy of a data object whose main copy was at process p, it sends a copy of the data to process p, which will again hold the main copy. -- If a process holds the main copy of a data object that was being accessed at the time of the last checkpoint by process p, it sends a copy of the data to process p. -- If a process holds the main copy of an object whose checkpoint copy was on process p, it sends a checkpoint copy to process p. -- If a process holds the main copy of a data item and the directory information for that data was on process p, the process informs process p that it holds the main copy. o Each process reissues any requests for data that have been made to the failed process and have not yet been fulfilled. o When the recovering process has received all of the data from other processes, it resumes its computation with its restored registers, stacks, etc. 5 Performance Results We have implemented our fault tolerance method for SAM applications running on clusters of workstations using PVM3. When linked with our fault-tolerant SAM library, a parallel SAM application recovers transparently when we kill one of the processes involved in the computation. In this section we give performance results for three long-running applications--GPS, Water, and Barnes-Hut--executing on a cluster of eight 225 MHz Alpha workstations connected by an ATM network. The workstations are connected by a 155 Mbit/sec AN2 ATM network [1, 19] developed at DEC SRC, and use a version of PVM3 that is highly tuned for the AN2 network. The maximum achievable bandwidth under PVM3 is about 14.6 Mbytes/sec and the latency for sending a message from one host to another is about 90 /s. The timings for different runs of the applications are very consistent; the numbers below are from averaging the results for several runs, excluding the initializationphases, when there were no other users of the cluster. For each of the applications below, recovery from a failed process only takes on the order of a few seconds. GPS is an application which attempts to determine a useful formula that predicts the degree of exposure to solvent of amino acids via a technique called genetic programming [7]. Genetic programming applications attempt to evolve useful formulas by emulating evolution and typically require large amounts of computer time. They build an initial population of individuals, which are candidate formulas, and then evolve the population over many generations using techniques analogous to genetic recombination, mutation, and survival of the fittest. Figure 3 gives the speedup curves for the evolution of a population of 1000 individuals, both when not using fault tolerance (labeled GPS) and when using fault tolerance (labeled GPS-FT). The first three rows in the table to the right give the percent change in performance when fault tolerance is used in the runs, the average number of checkpoints executed on each processor per second, and the percentage of sends of shared objects that cause checkpoints (equivalently, the percentage of sends of nonreproducible data). The low percentages indicate that many checkpoints are avoided and that the fault-tolerance overhead is much lower than it would be for systems which must assume that all data accesses are not reexecutable. The next two rows give the average number of ``force-checkpoint'' messages sent out on each processor per second and the average number of forced checkpoints on each processor per second. These operations, which are explained in Section 4.3, are never required for GPS runs. The final row gives the average miss rate on shared data both without and with fault tolerance. The miss rate is actually lower with fault tolerance because the replication of data during checkpoints can eliminate later misses to that data. GPS is a relatively coarse-grained application, since there is much computation per individual in determining its fitness. In addition, the individuals of the population are evenly distributed across the processes, so the overhead of checkpointingthe individuals is distributed well across processes. The fault-tolerance overhead is therefore quite low. GPS actually runs slightly faster on two hosts when using fault tolerance, because all the individuals created by one host are replicated to the other host and hence are immediately available when that host wants them. The Water application evaluates forces and potentials in a system of water molecules in the liquid state. Water is derived from the Perfect Club benchmark MDG [3] and performs the same computation. Water is implemented using Jade [14], which is a parallel language implemented entirely in SAM. Since we have added fault tolerance to the SAM system, all Jade applications are also automatically fault-tolerant. Figure 4 gives speedup curves for the simulation of 1728 particles, both when not using fault tolerance and when using fault tolerance, and with the same statistics as before. In Water, the main process collects all the data at each time step, so it must checkpoint most of the shared state of the system, which is quite large. The main process can therefore become a bottleneck with increasing processors, but the actual overheads remain quite small. Because of Jade's load-balancing functionality, the distribution of tasks to processors involves an operation which is not reexecutable on the receiving process. Since tasks cause checkpoints only upon completion when they communicate their results, all data produced by these tasks is considered nonreproducible. Some force-checkpoint messages and forced checkpoints occur in Water, but the number is very small. Again, the miss rate is actually lower with fault tolerance because the replication of objects during checkpoints eliminates some misses to these objects. Barnes-Hut [2] is an application that simulates the evolution of an n-body system using a tree data structure to compute the forces between bodies in O(nlogn) time. The SAM version of the code is based on the original serial version of the code and is written in a shared-memory style. All of the processes participate in the building and processing of the distributed tree data structure, and communication between processes is much more fine-grain than in GPS and Water. The force computation is partitioned across processes so that each process has extensive locality in accessing the distributed tree, which is exploited by the dynamic caching provided by SAM. Figure 5 gives speedup curves for a simulation of 8000 bodies. There is significant software overhead in implementing the Barnes-Hut algorithm in a sharedmemory style on distributed memory machines, but this overhead only reduces the overall performance by a fixed percentage and performance scales well. Because of the fine-grain nature of the application and the large tree data structure that is built, the fault-tolerance overhead is quite high. The communication in building and modifying the tree causes many checkpoints, and the checkpointing process adds significantly to the latency of communication. If the Barnes-Hut application did not have strong locality which reduces the amount of fetching of remote data, the number of checkpoints would be even higher. The number of checkpoints is also greatly reduced by the high percentage of sends of reproducible data. Two conflicting factors affect the shared data miss rate when fault tolerance is used. As described above, the miss rate can be reduced if processors which have received checkpoint copies of an object then access that object. However, because of the extra checkpoint copies, the shared data cache may be utilized less efficiently for other objects. Because of these conflicting effects, the shared data miss rate decreases for some runs when fault tolerance is used, but increases for others. Our performance numbers show that our fault tolerance method has low overhead and is quite suitable for the coarse-grain applications that would typically be run on networks of workstations. The overhead of our method is larger for the finer-grain Barnes-Hut application, but the parallel performance still scales well despite the overhead. 6 Related Work In this section, we compare with recent work that have provided fault tolerance in the context of software distributed shared memory. Like our work, Kaashoek et al. [10] implement fault tolerance for a shared object system called Orca. However, the methods for achieving fault tolerance are completely different. Kaashoek et al. use global checkpoints at periodic intervals to ensure that the system can be restarted from a previous state if there is a host failure. Checkpointing the state of the entire system consistently is fairly simple, because their implementation of Orca is based on reliable ordered broadcasting of all messages. Like our system, they provide fault tolerance transparently without any extra effort by the Orca programmer. Because they use global checkpointing, they can recover from failures of many hosts simultaneously (assuming that none of the disks used for checkpointing have permanent failures). However, their fault tolerance method depends completely on their use of reliable ordered broadcast, whose performance does not scale well for large numbers of hosts or networks with variable delays. MOM [5] is a fault-tolerant implementation of a Linda-like programming model. The programmer uses the basic operations of Linda, but must sometimes specify additional information that is necessary only to support fault tolerance. In addition, the programmer is constrained to a model of parallel computation in which each task reads some input tuples and produces some output tuples, but otherwise does not interact with other tasks. Xu and Liskov [21] describe a design for a distributed implementation of Linda in which the tuple space is made highly available by replicating it across several processors. However, they do not consider the problem of restarting Linda tasks that were running on hosts that have failed. Wu and Fuchs [20] describe techniques for making a page-based shared virtual memory system fault tolerant. Like our work, Wu and Fuchs implement asynchronous pessimistic checkpointing by forcing a checkpoint whenever a host sends a page to another host in response to a request. They propose using two copies of each virtual memory page to allow checkpointing of pages to disk to proceed quickly. Janssens and Fuchs [8] apply this method to a DSM system with a relaxed memory consistency model. The expected checkpointing overhead is greatly reduced because a host need only checkpoint when it holds a synchronization variable that another host attempts to acquire. Richard and Singhal [13] describe a related approach in which a processor logs to disk the pages that it has accessed via read operations whenever it sends a modified page to another processor, instead of doing a full checkpoint. A processor must periodically checkpoint (independently of other processors) in order to reduce the size of the log. Neves et al. [12] describe a method of providing fault-tolerance for an entry-consistent DSM system called DiSOM. Each process logs the contents of an object in its own local volatile memory, whenever it releases a write lock on the object. If another process acquires the object and later fails, then the data can be recovered from the process that holds the log entry. This technique is similar to our approach of not freeing data objects until we can guarantee that all other processes have checkpointed since their last access to the data. Each process must checkpoint to disk frequently to reclaim the memory used by the log entries. Of the methods described above, only those associated with Orca and MOM have been implemented. 7 Conclusion We have described a method of providing transparent fault tolerance for long-running parallel applications on networks of workstations in the context of the SAM shared object system. Our fundamental approach is to use the dynamic caching functionality provided by SAM to ensure the replication of all data on more than one host at checkpoints. Our method thus avoids expensive writes to disk and does not require a common file server. Each process checkpoints independently and only when sending data which is not reproducible to another process. The SAM design, which provides extra information on how data is accessed, is quite useful in reducing the number of checkpoints. Our method works on heterogeneous clusters of workstations (which have a common message-passing layer such as PVM3), without changes to the system software and without doing any global synchronizations among the hosts. When there is a failure, only the work of the failed process must be redone from the last checkpoint; other processes continue without rollback. Interestingly, in a task-based application, checkpoints naturally occur at task boundaries when the results of tasks are transferred between processes, thus ensuring that only the work of the currently executing task must be redone during a recovery. We are investigating further optimizations to reduce the checkpoint overhead, such as using a more sophisticated scheme for determining where to replicate data or using available information to decrease the amount of state saved during a checkpoint. It is interesting to note that our method is potentially useful for fast process migration. We can move a process in a parallel application off a particular host very quickly by killing it and restarting it on another host from its last checkpoint. Such fast process migration is useful when a parallel application is running on idle workstations that are on individuals' desks and an owner begins to use one of the workstations again. We have implemented our method in SAM for workstation clusters using PVM3. Our method successfully recovers from process and host failures during longrunning SAM applications. The amount of overhead is dependent on the communication patterns of an application and may be significant for applications with finer-grain communication such as Barnes-Hut. For the coarse-grain applications that are more likely to be run on networks of workstations, the overhead is quite low. Acknowledgments We thank Chandu Thekkath, Mike Burrows, and DEC SRC for allowing us to use the AN2 cluster and for providing a fast version of PVM3 for the cluster. We thank Simon Handley for the use of his GPS code and Martin Rinard for providing the Water code. We also thank the referees for comments that helped improve the paper. Availability: The SAM system is available via anonymous FTP at suif.stanford.edu in /pub/sam and via the World Wide Web at https://suif.stanford.edu. References [1] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker. High-speed switch scheduling for local-area networks. ACM Trans. Comput. Syst., 11(4):319--352, Nov. 1993. [2] J. E. Barnes and P. Hut. A Hierarchical O(NlogN) Force Calculation Algorithm. Nature, 324(6096):446--449, Dec. 1986. [3] M. Berry and et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers. International Journal of Supercomputer Applications, 3(3):5--40, 1989. [4] A. Borg, J. Baumbach, and S. Glazer. A Message Passing System Supporting Fault-Tolerance. In Proceedings of the ACM SIGOPS Symposium on Operating System Principles, pages 90--99, Oct. 1993. [5] S. R. Cannon. Adding Fault-tolerant Transaction Processing to Linda. Software -- Practice and Experience, 24(5):449--466, May 1994. [6] K. M. Chandy and L. Lamport. Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63--75, Feb. 1985. [7] S. Handley. The Prediction of the Degree of Exposureto Solvent of Amino Acid Residues via Genetic Programming. In Second International Conferenceon Intelligent Systems for Molecular Biology, pages 156--160, Stanford, CA, 1994. [8] B. Janssens and W. K. Fuchs. Relaxing Consistency in Recoverable Distributed Shared Memory. In Proceedings of the Twenty-third International Symposium on Fault-Tolerant Computing, pages 155--163, July 1993. [9] D. B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, pages 171--181, Aug. 1988. [10] M. F. Kaashoek,R. Michiels, H. E. Bal, andA. S. Tanenbaum. Transparent Fault-Tolerance in Parallel Orca Programs. In Symposium on Experiences with Distributed and Multiprocessor Systems,pages 297--311,Mar. 1992. [11] K. Li, , J. F. Naughton, and J. S. Plank. Checkpointing Multicomputer Applications. In Proceedings of the Tenth Symposium on Reliable Distributed Systems, pages 2--11, Sept. 1991. [12] N. Neves, M. Castro, and P. Guedes. A Checkpoint Protocol for an Entry Consistent Shared Memory System. In Proceedings of the Thirteenth ACM Symposium on Principles of Distributed Computing, pages 121--129, Aug. 1994. [13] G. G. Richard III and M. Singhal. Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory. In Proceedings of the Twelfth Symposiumon Reliable Distributed Systems, pages 58--67, 1993. [14] M. C. Rinard, D. J. Scales, and M. S. Lam. Heterogeneous Parallel Programming in Jade. In Proceedings of Supercomputing '92, pages 245--256, Nov. 1992. [15] D. J. Scales and M. S. Lam. An Efficient Shared Memory System for Distributed Memory Machines. Technical Report CSL-TR-94-627, Computer Systems Laboratory, Stanford University, July 1994. [16] D. J. Scales and M. S. Lam. The Design and Evaluation of a Shared Object System for Distributed Memory Machines. In Proceedings of the First Symposium on Operating System Design and Implementation, pages 101--114, Nov. 1994. [17] R. E. Strom and S. Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, pages 204--226, Aug. 1985. [18] V. Sunderam. PVM: a Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, 2(4):315--339, Dec. 1990. [19] C. P. Thacker and M. D. Schroeder. AN2 Switch Overview. In preparation. [20] K.-L. Wu and W. K. Fuchs. Recoverable Distributed Shared Virtual Memory. IEEE Transactions on Computers, 39(4):460--469, Apr. 1990. [21] A. S. Xu and B. Liskov. A Design for a Fault-tolerant, Distributed Implementation of Linda. In The Nineteenth International Symposiumon Fault-Tolerant Computing, pages 199--206, June 1989. Author Information Dan Scales is a researcher at Digital Equipment Corporation's Western Research Laboratory. His research interests include parallel languages and runtime systems, distributed systems, and software development environments. He received a BS in electrical engineering and computer science from Princeton University and a PhD in computer science from Stanford University in 1995. He may be contacted at scales@pa.dec.com. Monica Lam is an associate professor in the Computer Science Department at Stanford University. Her research interests are in compilers, computer architectures, and parallel computing, and she currently leads the SUIF (Stanford University Intermediate Format) parallelizing compiler research project at Stanford. She received a BS degree from the University of British Columbia and a PhD in computer science from Carnegie Mellon University in 1987. She is the recipient of a National Science Foundation 1992 Young Investigator Award. She may be contacted at lam@cs.stanford.edu.