TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel Department of Computer Science Rice University Houston, TX 77251-1892 TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Our objective is to determine the efficiency of a user-level DSM implementation on commercially available workstations and operating systems. We achieved good speedups on the 8-processor ATM network for Jacobi (7.4), TSP (7.2), Quicksort (6.3), and ILINK (5.7). For a slightly modified version of Water from the SPLASH benchmark suite, we achieved only moderate speedups (4.0) due to the high communication and synchronization rate. Speedups decline on the 10-Mbps Ethernet (5.5 for Jacobi, 6.5 for TSP, 4.2 for Quicksort, 5.1 for ILINK, and 2.1 for Water), reflecting the bandwidth limitations of the Ethernet. These results support the contention that, with suitable networking technology, DSM is a viable technique for parallel computation on clusters of workstations. To achieve these speedups, TreadMarks goes to great lengths to reduce the amount of communication performed to maintain memory consistency. It uses a lazy implementation of release consistency, and it allows multiple concurrent writers to modify a page, reducing the impact of false sharing. Great care was taken to minimize communication overhead. In particular, on the ATM network, we used a standard low-level protocol, AAL3/4, bypassing the TCP/IP protocol stack. Unix communication overhead, however, remains the main obstacle in the way of better performance for programs like Water. Compared to the Unix communication overhead, memory management cost (both kernel and user level) is small and wire time is negligible. This research was supported in part by the National Science Foundation under Grants CCR-9116343, CCR-9211004, CDA-9222911, and CDA-9310073, by the Texas Advanced Technology Program under Grant 003604014, and by a NASA Graduate Fellowship. Introduction With increasing frequency, networks of workstations are being used as parallel computers. High-speed general-purpose networks and very powerful workstation processors have narrowed the performance gap between workstation clusters and supercomputers. Furthermore, the workstation approach provides a relatively low-cost, low-risk entry into the parallel computing arena. Many organizations already have an installed workstation base, no special hardware is required to use this facility as a parallel computer, and the resulting system can be easily maintained, extended and upgraded. We expect that the workstation cluster approach to parallel computing will gain further popularity, as advances in networking continue to improve its cost/performance ratio. Various software systems have been proposed and built to support parallel computation on workstation networks, e.g., tuple spaces, distributed shared memory, and message passing. TreadMarks is a distributed shared memory (DSM) system. DSM enables processes on different machines to share memory, even though the machines physically do not share memory (see Figure X). This approach is attractive since most programmers find it easier to use than a message passing paradigm, which requires them to explicitly partition data and manage communication. With a global address space, the programmer can focus on algorithmic development rather than on managing partitioned data sets and communicating values. Many DSM implementations have been reported in the literature (see for an overview). Unfortunately, none of these implementations are widely available. Many run on in-house research platforms, rather than on generally available operating systems, or require kernel modifications that make them unappealing. Early DSM systems also suffered from performance problems. These early designs implemented the shared memory abstraction by imitating consistency protocols used by hardware shared memory multiprocessors. Given the large consistency units in DSM (virtual memory pages), false sharing was a serious problem for many applications. TreadMarks overcomes most of these problems: it is an efficient DSM system that runs on commonly available Unix systems. This paper reports on an implementation on Ultrix using 8 DECStation-5000/240s, connected both by a 100-Mbps point-to-point ATM LAN and by a 10-Mbps Ethernet. The system has also been implemented on SunOS using SPARCstation-1's and -2's connected by a 10-Mbps Ethernet. The implementation is done at the user level, without modification to the operating system kernel. Furthermore, we do not rely on any particular compiler. Instead, our implementation relies on (user-level) memory management techniques to detect accesses and updates to shared data. In order to address the performance problems with earlier DSM systems, the TreadMarks implementation focuses on reducing the amount of communication necessary to keep the distributed memories consistent. It uses a lazy implementation of release consistency and multiple-writer protocols to reduce the impact of false sharing. On the 100-Mbps ATM LAN, good speedups were achieved for Jacobi, TSP, Quicksort, and ILINK (a program from the genetic LINKAGE package). TreadMarks achieved only a moderate speedup for a slightly modified version of the Water program from the SPLASH benchmark suite, because of the high synchronization and communication rates. We present a detailed decomposition of the overheads. For the applications measured, the software communication overhead is the bottleneck in achieving high performance for finer grained applications like Water. This is the case even when using a low-level adaptation layer protocol (AAL3/4) on the ATM network, bypassing the TCP/IP protocol stack. The communication overhead dominates the memory management and consistency overhead. On a 100-Mbps ATM LAN, the ``wire'' time is all but negligible. The outline of the rest of this paper is as follows. Section X focuses on the principal design decisions: release consistency, lazy release consistency, multiple-writer protocols, and lazy diff creation. Section X describes the implementation of these concepts, and also includes a discussion of the Unix aspects of the implementation. The resulting performance is discussed in Section X, and compared against earlier work using eager release consistency in Section X. We discuss related work in Section X, and conclude in Section X. Design TreadMarks' design focuses on reducing the amount of communication necessary to maintain memory consistency. To this end, it presents a release consistent memory model to the user. Release consistency requires less communication than conventional, sequentially consistent shared memory, but provides a very similar programming interface. The lazy implementation of release consistency in TreadMarks further reduces the number of messages and the amount of data compared to earlier, eager implementations. False sharing is another source of frequent communication in DSM systems. TreadMarks uses multiple-writer protocols to address this problem. Multiple-writer protocols require the creation of diffs, data structures that record updates to parts of a page. With lazy release consistency, diff creation can often be postponed or avoided, a technique we refer to as lazy diff creation. Release Consistency Release consistency (RC is a relaxed memory consistency model that permits a processor to delay making its changes to shared data visible to other processors until certain synchronization accesses occur. Shared memory accesses are categorized either as ordinary or as synchronization accesses, with the latter category further divided into acquire and release accesses. Acquires and releases roughly correspond to synchronization operations on a lock, but other synchronization mechanisms can be implemented on top of this model as well. For instance, arrival at a barrier can be modeled as a release, and departure from a barrier as an acquire. Essentially, RC requires ordinary shared memory updates by a processor $p$ to become visible at another processor $q$, only when a subsequent release by $p$ becomes visible at $q$. In contrast, in sequentially consistent (SC) memory, the conventional model implemented by most snoopy-cache, bus-based multiprocessors, modifications to shared memory must become visible to other processors immediately. Programs written for SC memory produce the same results on an RC memory, provided that (i) all synchronization operations use system-supplied primitives, and (ii) there is a release-acquire pair between conflicting ordinary accesses to the same memory location on different processors. In practice, most shared memory programs require little or no modifications to meet these requirements. Although execution on an RC memory produces the same results as on a SC memory for the overwhelming majority of the programs, RC can be implemented more efficiently than SC. In the latter, the requirement that shared memory updates become visible immediately implies communication on each write to a shared data item for which other cached copies exist. No such requirement exists under RC. The propagation of the modifications can be postponed until the next synchronization operation takes effect. Lazy Release Consistency In lazy release consistency (LRC, the propagation of modifications is postponed until the time of the acquire. At this time, the acquiring processor determines which modifications it needs to see according to the definition of RC. To do so, LRC divides the execution of each process into intervals, each denoted by an interval index. Every time a process executes a release or an acquire, a new interval begins and the interval index is incremented. Intervals of different processes are partially ordered: (i) intervals on a single processor are totally ordered by program order, and (ii) an interval on processor $p$ precedes an interval on processor $q$ if the interval of $q$ begins with the acquire corresponding to the release that concluded the interval of $p$. This partial order can be represented concisely by assigning a vector timestamp to each interval. A vector timestamp contains an entry for each processor. The entry for processor $p$ in the vector timestamp of interval $i$ of processor $p$ is equal to $i$. The entry for processor $q \neq p$ denotes the most recent interval of processor $q$ that precedes the current interval of processor $p$ according to the partial order. A processor computes a new vector timestamp at an acquire according to the pair-wise maximum of its previous vector timestamp and the releaser's vector timestamp. RC requires that before a processor $p$ may continue past an acquire, the updates of all intervals with a smaller vector timestamp than $p$'s current vector timestamp must be visible at $p$. Therefore, at an acquire, $p$ sends its current vector timestamp to the previous releaser, $q$. Processor $q$ then piggybacks on the release-acquire message to $p$, write notices for all intervals named in $q$'s current vector timestamp but not in the vector timestamp it received from $p$. A write notice is an indication that a page has been modified in a particular interval, but it does not contain the actual modifications. The timing of the actual data movement depends on whether an invalidate, an update, or a hybrid protocol is used (see). TreadMarks currently uses an invalidate protocol: the arrival of a write notice for a page causes the processor to invalidate its copy of that page. A subsequent access to that page causes an access miss, at which time the modifications are propagated to the local copy. Alternative implementations of RC generally cause more communication than LRC. For example, the DASH shared-memory multiprocessor implements RC in hardware, buffering writes to avoid blocking the processor until the write has been performed with respect to main memory and remote caches. A subsequent release is not allowed to perform (i.e., the corresponding lock cannot be granted to another processor) until all outstanding shared writes are acknowledged. While this strategy masks latency, LRC sends far fewer messages, an important consideration in a software implementation on a general-purpose network because of the high per message cost. In an eager software implementation of RC, a processor propagates its modifications of shared data when it executes a release. This approach also leads to more communication, because it requires a message to be sent to all processors that cache the modified data, while LRC propagates the data only to the next acquirer. Multiple-Writer Protocols False sharing was a serious problem for early DSM systems. It occurs when two or more processors access different variables within a page, with at least one of the accesses being a write. Under the common single-writer protocols, false sharing leads to unnecessary communication. A write to any variable of a page causes the entire page to become invalid on all other processors that cache the page. A subsequent access on any of these processors incurs an access miss and causes the modified copy to be brought in over the network, although the original copy of the page would have sufficed, since the write was to a variable different from the one that was accessed locally. This problem occurs in snoopy-cache multiprocessors as well, but it is more prevalent in software DSM because the consistency protocol operates on pages rather than smaller cache blocks. To address this problem, Munin introduced a multiple-writer protocol. With multiple-writer protocols two or more processors can simultaneously modify their local copy of a shared page. Their modifications are merged at the next synchronization operation in accordance with the definition of RC, thereby reducing the effect of false sharing. Lazy Diff Creation In order to capture the modifications to a shared page, it is initially write-protected. At the first write, a protection violation occurs. The DSM software makes a copy of the page (a twin), and removes the write protection so that further writes to the page can occur without any DSM intervention. The twin and the current copy can later be compared to create a diff, a runlength encoded record of the modifications to the page. In TreadMarks, diffs are only created when a processor requests the modifications to a page or a write notice from another processor arrives for that page. In the latter case, it is essential to make a diff in order to distinguish the modifications made by the different processors. This lazy diff creation is distinct from Munin's implementation of multiple-writer protocols, where at each release a diff is created for each modified page and propagated to all other copies of the page. The lazy implementation of RC used by TreadMarks allows diff creation to be postponed until the modifications are requested. Lazy diff creation results in a decrease in the number of diffs created (see Section X) and an attendant improvement in performance. Implementation Data Structures Figure X gives an overview of the data structures used. The principal data structures are the PageArray, with one entry for each shared page, the ProcArray, with one entry for each processor, a set of interval records (containing mainly the vector timestamp for that interval), a set of write notice records, and a diff pool. Each entry in the PageArray contains: The current state: no access, read-only access, or read-write access. An approximate copyset specifying the set of processors that are believed to currently cache this page. For each page, an array indexed by processor of head and tail pointers to a linked list of write notice records corresponding to write notices received from that processor for this page. If the diff corresponding to the write notice has been received, then a pointer to this diff is present in the write notice record. This list is maintained in order of decreasing interval indices. Each entry in ProcArray contains a pointer to the head and the tail of a doubly linked list of interval records, representing the intervals of that processor that the local processor knows about. This list is also maintained in order of decreasing interval indices. Each of these interval records contains a pointer to a list of write notice records for that interval, and each write notice record contains a pointer to its interval record. Interval and Diff Creation Logically, a new interval begins at each release and acquire. In practice, interval creation can be postponed until we communicate with another process, avoiding overhead if a lock is reacquired by the same processor. When a lock is released to another processor, or at arrival at a barrier, a new interval is created containing a write notice for each page that was twinned since the last remote synchronization operation. With lazy diff creation these pages remain writable until a diff request or a write notice arrives for that page. At that time, the actual diff is created, the page is read protected, and the twin is discarded. A subsequent write results in a write notice for the next interval. Locks All locks have a statically assigned manager. Lock management is assigned in a round-robin fashion among the processors. The manager records which processor has most recently requested the lock. All lock acquire requests are directed to the manager, and, if necessary, forwarded to the processor that last requested the lock. The lock acquire request contains the current vector timestamp of the acquiring processor. The lock request arrives at the processor that either holds the lock or did the last release on it, possibly after forwarding by the lock manager. When the lock is released, the releaser ``informs'' the acquirer of all intervals between the vector timestamp in the acquirer's lock request message, and the releaser's current vector timestamp. The message contains the following information for each of these intervals: The processor id. The vector timestamp. All write-notices. The write notice in the message is a fixed 16-bit entry containing the page number. All of this information can easily be derived by following the pointers from the ProcArray to the appropriate interval records and from there to the appropriate write notice records. After receiving this message, the acquirer ``incorporates'' this information into its data structures. For each interval in the message, the acquirer appends an interval record to the interval record list for that processor, and for each write notice it prepends a write notice record to the page's write notice record list, and adds pointers from the write notice record to the interval record, and vice versa. Incorporating this information invalidates the pages for which write notices were received. Barriers Barriers have a centralized manager. At barrier arrival, each client ``informs'' the barrier manager of its vector timestamp and all of the client's intervals between the last vector timestamp of the manager that the client is aware of (found at the head of the interval record list for the ProcArray entry for the manager) and the client's current vector timestamp. When the manager arrives at the barrier, it ``incorporates'' these intervals into its data structures. When all barrier arrival messages have been received, the manager then ``informs'' all clients of all intervals between their vector timestamp, as received in their barrier arrival message, and the manager's current vector timestamp. The clients then ``incorporate'' this information as before. As for locks, incorporating this information invalidates the pages for which write notices were received. Access Misses If the faulting processor does not have a copy of the page, it requests a copy from a member of the page's approximate copyset. The approximate copyset for each page is initialized to contain processor $0$. If write notices are present for the page, the faulting processor obtains the missing diffs and applies them to the page. The missing diffs can be found easily following the linked list of write notices starting from the entry for this page in the PageArray. The following optimization minimizes the number of messages necessary to get the diffs. If processor $p$ has modified a page during interval $i$, then $p$ must have all the diffs of all intervals (including those from processors other than $p$) that have a smaller vector timestamp than $i$. It therefore suffices to look at the largest interval of each processor for which we have a write notice but no diff. Of that subset of the processors, a message needs to be sent only to those processors for which the vector timestamp of their most recent interval is not dominated by the vector timestamp of another processor's most recent interval. After the set of necessary diffs and the set of processors to query have been determined, the faulting processor sends out requests for the diffs in parallel, including the processor id, the page number and the interval index of the requested diffs. When all necessary diffs have been received, they are applied in increasing vector timestamp order. Garbage Collection Garbage collection is necessary to reclaim the space used by write notice records, interval records, and diffs. During garbage collection, each processor validates its copy of every page that it has modified. All other pages, all interval records, all write notice records and all diffs are discarded. In addition, each processor updates the copyset for every page. If, after garbage collection, a processor accesses a page for which it does not have a copy, it requests a copy from a processor in the copyset. The processors execute a barrier-like protocol, in which processors request and apply all diffs created by other processors for the pages they have modified themselves. Garbage collection is triggered when the amount of free space for consistency information drops below a threshold. An attempt is made to make garbage collection coincide with a barrier, since many of the operations are similar. Unix Aspects TreadMarks relies on Unix and its standard libraries to accomplish remote process creation, interprocessor communication, and memory management. In this section, we briefly describe the implementation of each of these services. TreadMarks interprocessor communication can be accomplished either through UDP/IP on an Ethernet or an ATM LAN, or through the AAL3/4 protocol on the ATM LAN. AAL3/4 is a connection-oriented, unreliable message protocol specified by the ATM standard. Since neither protocol guarantees reliable delivery, TreadMarks uses operation-specific, user-level protocols on top of UDP/IP and AAL3/4 to insure delivery. To minimize latency in handling incoming asynchronous requests, TreadMarks uses a SIGIO signal handler. Message arrival at any socket used to receive request messages generates a SIGIO signal. Since AAL3/4 is a connection-oriented protocol, there is a socket corresponding to each of the other processors. To determine which socket holds the incoming request, the handler for AAL3/4 performs a select system call. The handler for UDP/IP avoids the select system call by multiplexing all of the other processors over a single receive socket. After the handler receives the message, it performs the request and returns. To implement the consistency protocol, TreadMarks uses the mprotect system call to control access to shared pages. Any attempt to perform a restricted access on a shared page generates a SIGSEGV signal. The SIGSEGV signal handler examines the local PageArray to determine the page's state. If the local copy is read-only, the handler allocates a page from the pool of free pages and performs a bcopy to create a twin. Finally, the handler upgrades the access rights to the original page and returns. If the local page is invalid, the handler executes the access miss procedure. Performance Experimental Environment Our experimental environment consists of 8 DECstation-5000/240's running Ultrix V4.3. Each machine has a Fore ATM interface that is connected to a Fore ATM switch. The connection between the interface boards and the switch operates at 100-Mbps; the switch has an aggregate throughput of 1.2-Gbps. The interface board does programmed I/O into transmit and receive FIFOs, and requires fragmentation and reassembly of ATM cells by software. Interrupts are raised at the end of a message or a (nearly) full receive FIFO. All of the machines are also connected by a 10-Mbps Ethernet. Unless otherwise noted, the performance numbers describe 8-processor executions on the ATM LAN using the low-level adaptation layer protocol AAL3/4. Basic Operation Costs The minimum roundtrip time using send and receive for the smallest possible message is 500 $\mu$seconds. The minimum time to send the smallest possible message through a socket is 80 $\mu$seconds, and the minimum time to receive this message is $80 \mu$seconds. The remaining 180 $\mu$seconds are divided between wire time, interrupt processing and resuming the processor that blocked in receive. Using a signal handler to receive the message at both processors, the roundtrip time increases to 670 $\mu$seconds. The minimum time to remotely acquire a free lock is 827 $\mu$seconds if the manager was the last processor to hold the lock, and 1149 $\mu$seconds otherwise. In both cases, the reply message from the last processor to hold the lock does not contain any write notices (or diffs). The time to acquire a lock increases in proportion to the number of write notices that must be included in the reply message. The minimum time to perform an 8 processor barrier is 2186 $\mu$seconds. A remote page fault, to obtain a 4096 byte page from another processor takes 2792 $\mu$seconds. Applications We used five programs in this study: Water, Jacobi, TSP, Quicksort, and ILINK. Water, obtained from SPLASH, is a molecular dynamics simulation. We made one simple modification to the original program to reduce the number of lock accesses. We simulated 343 molecules for 5 steps. Jacobi implements a form of Successive Over-Relaxation (SOR) with a grid of 2000 by 1000 elements. TSP uses a branch-and-bound algorithm to solve the traveling salesman problem for a 19-city tour. Quicksort sorts an array of 256K integers, using a bubblesort to sort subarrays of less than 1K elements. ILINK, from the LINKAGE package, performs genetic linkage analysis (see for more details). ILINK's input consists of data on 12 families with autosomal dominant nonsyndromic cleft lip and palate (CLP). Results Figure X presents speedups for the five applications. The speedups were calculated using uniprocessor times obtained by running the applications without TreadMarks. Figure X provides execution statistics for each of the five applications when using 8 processors. The speedup for Water is limited by the high communication (798 Kbytes/second and 2238 messages/second) and synchronization rate (582 lock accesses/second). There are many short messages (the average message size is 356 bytes), resulting in a large communication overhead. Each molecule is protected by a lock that is accessed frequently by a majority of the processors. In addition, the program uses barriers for synchronization. Jacobi exclusively uses barriers for synchronization. Jacobi's computation to communication ratio is an order of magnitude larger than that of Water. In addition, most communication occurs at the barriers and between neighbors. On the ATM network, this communication can occur in parallel. The above two effects compound, resulting in near-linear speedup for Jacobi. TSP is an application that exclusively uses locks for synchronization. Like Jacobi, TSP has a very high computation to communication ratio, resulting in near-linear speedup. While the number of messages per second is slightly larger than for Jacobi, TSP transmits only a quarter of the amount of data transmitted by Jacobi. Quicksort also uses locks for synchronization. Quicksort's synchronization rate is close to that of Jacobi's. It, however, sends over twice as many messages and data per second, resulting in slightly lower, although good, speedups. The number of kilobytes per second transmitted by Quicksort is similar to that transmitted by Water, but it sends 3 times fewer messages and the number of synchronization operations is an order of magnitude lower than for Water. As a result, speedup for Quicksort is higher than for Water. ILINK achieves less than linear speedup on TreadMarks because of a load balancing problem inherent to the nature of the algorithm. It is not possible to predict in advance whether the set of iterations distributed to the processors will result in the same amount of work on each processor, without significant computation and communication. Consequently, speedups are somewhat lower than one would expect based on the communication and synchronization rates. Execution Time Breakdown Figure X shows a percentage breakdown of the execution times for 8-processor versions of all 5 applications. The ``Computation'' category is the time spent executing application code; ``Unix'' is the time spent executing Unix kernel and library code; and ``TreadMarks'' is the time spent executing TreadMarks code. ``Idle Time'' refers to the time that the processor is idle. Idle time results from waiting for locks and barriers, as well as from remote communication latency. The largest overhead components are the Unix and idle times. The idle time reflects to some extent the amount of time spent waiting for Unix and TreadMarks operations on other nodes. The TreadMarks overhead is much smaller than the Unix overhead. The largest percentage TreadMarks overhead is for Water (2.9% of overall execution time). The Unix overhead is at least three times as large as the TreadMarks overhead for all the applications, and is 9 times larger for ILINK. Figure X shows a breakdown of the Unix overhead. We divide Unix overhead into two categories: communication and memory management. Communication overhead is the time spent executing kernel operations to support communication. Memory management overhead is the time spent executing kernel operations to support the user-level memory management, primarily page protection changes. In all cases, at least 80% of the kernel execution time is spent in the communication routines, suggesting that cheap communication is the primary service a software DSM needs from the operating system. Figure X shows a breakdown of TreadMarks overhead. We have divided the overhead into three categories: memory management, consistency, and ``other''. ``Memory management'' overhead is the time spent at the user-level detecting and capturing changes to shared pages. This includes twin and diff creation and diff application. ``Consistency'' is the time spent propagating and handling consistency information. ``Other'' consists primarily of time spend handling communication and synchronization. TreadMarks overhead is dominated by the memory management operations. Maintaining the rather complex partial ordering between intervals adds only a small amount to the execution time. Effect of Network and Communication Protocol We ran Water, the application the highest communication overhead on two other communication substrates: UDP over the ATM network, and UDP over an Ethernet. Figure X shows the total 8-processor execution times for all three different communication substrates and a breakdown into computation, Unix overhead, TreadMarks overhead, and idle time. Overall execution time increases from 15.0 seconds on ATM-AAL3/4 to 17.5 seconds on ATM-UDP and to 27.5 seconds on Ethernet-UDP. Computation time and TreadMarks overhead remain constant, Unix overhead increases slightly, but the idle time increases from 3.9 seconds on AAL3/4 to 5.0 seconds on ATM/UDP, and to 14.4 seconds over the Ethernet. The increase from ATM-AAL3/4 to ATM-UDP is due to increased protocol overhead in processing network packets. For the Ethernet, however, it is largely due to network saturation. Summary TreadMarks achieves good speedups for Jacobi, TSP, Quicksort, and ILINK on the 100 Mbit/sec ATM LAN. For a slightly modified version of the Water program from the Splash benchmark suite, TreadMarks achieved only a moderate speedup, because of the large number of small messages. The overhead of the DSM is dominated by the communication primitives. Since wire time is negligible on the ATM LAN for our applications, the greatest potential to improve overall performance is reducing the software communication overhead. Although the use of the lightweight AAL3/4 protocol reduces the total send and receive time, these are only a part of the overall communication overhead. Lower-overhead user-level communications interfaces or a kernel-level implementation would improve performance. A kernel implementation of the memory management would have little effect on overall performance. In the worst case (Water), TreadMarks spent less than 2.2% of its time detecting and capturing changes to shared pages. Most of this time is spent copying the page and constructing the diff. Less than 0.8% of the time is spent in the kernel generating the signal or performing the mprotect. Lazy vs. Eager Release Consistency Eager Release Consistency: Design and Implementation We implemented an eager version of RC (ERC) to assess the performance differences between ERC and LRC. At the time of a release, ERC creates diffs of modified pages, and distributes each diff to all processors that cache the corresponding page. Our implementation of ERC uses an update protocol. Eager invalidate protocols have been shown to result in inferior performance for DSM systems. We are thus comparing LRC against the best protocol available for ERC. With an eager invalidate protocol, the diffs cause a large number of invalidations, which trigger a large number of access misses. In order to satisfy these access misses, a copy of the entire page must be sent over the network. In contrast, lazy invalidate protocols only move the diffs, because they maintain enough consistency information to reconstruct valid pages from the local (out-of-date) copy and the diffs. Performance Figures X to X compare the speedups, the message and data rates, and the rate of diff creation between the eager and lazy version of the five applications. In order to arrive at a fair comparison of the message and the data rate, we normalize these quantities by the average execution time of ERC and LRC. LRC performs better than ERC for Water and Quicksort, because the LRC sends fewer messages and a smaller amount of data. In Water, in particular, ERC sends a large number of updates at each release, because all processors have copies of most of the shared data. Jacobi performs slightly better under LRC than under ERC. Although communication requirements are similar in both cases, Figure X shows that the lazy diff creation of LRC generates 25% fewer diffs than ERC, thereby decreasing the overhead. For ILINK, performance is comparable under both schemes. For TSP, ERC results in better performance than LRC. TSP is implemented using a branch-and-bound algorithm that uses a current minimum to prune searching. The performance on LRC suffers from the fact that TSP is not a properly labeled program. Although updates to the current minimum tour length are synchronized, read accesses are not. Since LRC updates cached values only on an acquire, a processor may read an old value of the current minimum. The execution remains correct, but the work performed by the processor may be redundant since a better tour has already been found elsewhere. With ERC, this is less likely to occur since ERC updates cached copies of the minimum when the lock protecting the minimum is released. By propagating the bound earlier, ERC reduces the amount of redundant work performed, leading to a better speedup. Adding synchronization around the read accesses would deteriorate performance, given the very large number of such accesses. Related Work Among the many proposed relaxed memory consistency models, we have chosen release consistency, because it requires little or no change to existing shared memory programs. An interesting alternative is entry consistency (EC. EC differs from RC in that it requires all shared data to be explicitly associated with some synchronization variable. On a lock acquisition EC only propagates the shared data associated with that lock. EC, however, requires the programmer to insert additional synchronization in shared memory programs to execute correctly on an EC memory. Typically, RC does not require additional synchronization. In terms of comparisons with other systems, we restrict ourselves to implementations on comparable processor and networking technology. Differences in processor and network speed and their ratio lead to different tradeoffs, and makes comparisons with older systems difficult. We have however borrowed from Munin the concept of multiple-writer protocols. Munin also implements eager release consistency, which moves more messages and data than lazy release consistency. Bryant et al implemented SSVM (Structured Shared Virtual Memory) on a star network of IBM RS-6000s running Mach 2.5. % no mention of link speed? Two different implementation strategies were followed: one using the Mach external pager interface, and one using the Mach exception interface. They report that the latter implementation---which is very similar to ours---is more efficient, because of the inability of Mach's external pager interface to asynchronously update a page in the user's address space. Also, the time to update a page in a user's address space is higher for the external pager interface than for the exception interface (1.2 vs. 0.7 milliseconds) because the need for a data_request - data_provided message transaction when using the external pager interface. The overhead of a page fault (without the actual page transfer) is approximately 1 milliseconds, half of which is attributed to process switching overhead in the exception-based implementation. The time to transfer a page (11 milliseconds) dominates all other overheads in the remote page fault time. Bershad et al use a different strategy to implement EC in the Midway DSM system, running on DECStation-500/200s connected by an ATM LAN and running Mach 3.0. Instead of relying on the VM system to detect shared memory updates, they modify the compiler to update a software dirty bit. Our results show that, at least in Ultrix and we suspect in Mach as well, the software communication overhead dominates the memory management overhead. DSVM6K is a sequentially consistent DSM system running on IBM RS/6000s connected by 220-Mbps fiber optic links and a nonblocking crossbar switch. The system is implemented inside the AIX v3 kernel and uses a low-overhead protocol for communication over the fiber optic links (IMCS). A remote page fault takes 1.75 milliseconds when using IMCS, and is estimated to take 3.25 milliseconds when using TCP/IP. The breakdown of the 1.75 milliseconds page fault time is: 1.05 milliseconds for DSVM6K overhead, 0.47 milliseconds for IMCS overhead and 0.23 milliseconds of wire time. Shiva is an implementation of sequentially consistent DSM on an Intel IPSC/2. Shiva is implemented outside the kernel. A remote page fault takes 3.82 milliseconds, and the authors estimate that time could be reduced by 23 percent by a kernel implementation. In comparison, our page fault times are 2.8 milliseconds using AAL3/4. While these numbers are hard to compare because of differences in processor and networking hardware, our results highlight the cost of the software communication overhead. Either an in-kernel implementation or fast out-of-kernel communication interfaces need to be provided in order to build an efficient DSM system. Conclusions Good performance has been achieved for DSM systems built on various research operating systems. However, in order to use DSM as a platform for parallel computation on clusters of workstations, efficient user-level implementations must be available on commercial operating systems. It is with this goal in mind that we set out to conduct the experiment described in this paper. We implemented a DSM system at the user level on DECstation-5000/240's connected to a 100-Mbps ATM LAN and a 10-Mbps Ethernet. We focused our implementation efforts on reducing the cost of communication, using techniques such as lazy release consistency, multiple-writer protocols, and lazy diff creation. On the ATM network, we avoided the overhead of UDP/IP by using the low-level AAL3/4 protocol. On the ATM network, we achieved good speedups for Jacobi, TSP, Quicksort, and ILINK, and moderate speedups for a slightly modified version of Water. Latency and bandwidth limitations reduced the speedups by varying amounts on the Ethernet. We conclude that user-level DSM is a viable technique for parallel computation on clusters of workstations connected by suitable networking technology. In order to achieve better DSM performance for more fine-grained programs like Water, the software communication overhead needs to be reduced through lower-overhead communication interfaces and implementations.