Integrating Coherency and Recoverability in Distributed Systems Michael J. Feeley, Jeffrey S. Chase, Vivek R. Narasayya, and Henry M. Levy Department of Computer Science and Engineering, FR-35 University of Washington Seattle, WA 98195 {feeley,chase,nara,levy}@cs.washington.edu Abstract 94], and Midway [Zekauskas et al. 94]. These DSM We propose a technique for maintaining coherency systems maintain the illusion of a single shared mem- of a transactional distributed shared memory, used by ory by synchronizing data access and moving data be- applications accessing a shared persistent store. Our tween nodes when required, transparently to the appli- goal is to improve support for fine-grained distributed cation. DSM is useful in this context, because it sim- data sharing in collaborative design applications, such plifies programming of these distributed-parallel pro- as CAD systems and software development environ- grams. ments. In contrast, traditional research in distributed Parallel programs are not the only applications that shared memory has focused on supporting parallel pro- can benefit from the concept of distributed shared grams; in this paper, we show how distributed pro- memory; DSM can be applied to other application do- grams can benefit from this shared-memory abstrac- mains as well. Our goal is to support coherent virtual tion as well. memory for programs that perform transactional up- Our approach, called log-based coherency, integrates dates to their shared memory space. While this style coherency support with a standard mechanism for en- of programming is not ordinarily seen in parallel pro- suring recoverability of persistent data. In our sys- grams, it is standard for applications using a persistent tem, transaction logs are the basis of both recoverabil- store, a system that supports storage and retrieval of ity and coherency. We have prototyped log-based co- virtual memory data structures in disk files. herency as a set of extensions to RVM [Satyanarayanan Our work explores the interactions between co- et al. 94], a runtime package supporting recoverable herency and transactions. We describe log-based co- virtual memory. Our prototype adds coherency sup- herency, a simple technique for maintaining consis- port to RVM in a simple way that does not require tency of a transactional distributed virtual memory. changes to existing RVM applications. We report on The key idea behind log-based coherency is that log our prototype and its performance, and discuss its re- records used to support atomic transactions are also lationship to other DSM systems. used as the basic mechanism for updating cached copies of the data on peer nodes. This unification of mechanisms allows us to add DSM support to sys- 1 Introduction tems that support persistence, without modifying ex- isting persistent programs and without adding signifi- Existing distributed shared memory (DSM) systems cant software overhead. support parallel programming on distributed-memory multicomputers and workstation networks. Exam- 1.1 Persistent Stores ples of such systems include IVY [Li & Hudak 89], Munin [Carter et al. 91], TreadMarks [Keleher et al. Persistent storage systems have evolved to meet the _____________________________ data management needs of design applications, in- This work is supported in part by the National Science cluding electronic and mechanical CAD systems and Foundation (Grants No. CDA-9123308 and CCR-9200832), the Washington Technology Center, Digital Equipment Corpora- software development environments. These applica- tion, Boeing Computer Services, Intel Corporation, Hewlett- tion environments consist of collections of programs Packard Corporation, and Apple Computer. Both Feeley that operate on persistent data structures represent- and Chase have been supported by Intel Foundation Grad- ing design artifacts and derived information. Client uate Fellowships. Chase's present address is: Department of Computer Science, Duke University, Durham, NC 27706 programs navigate through the stored data by follow- (chase@cs.duke.edu). ing pointers directly in virtual memory, with the sys- tem moving data between memory and the persistent tributed file systems, the caches should be viewed as store as needed. A number of persistent stores have a distributed shared memory, since each client process been built; some are research systems [Cockshott et al. accesses the cache directly in its virtual address space. 84, Moss 90, Carey et al. 94] and others are commer- There are other important similarities with DSM sys- cial object-oriented database (OODB) products (e.g., tems. Fine-grained sharing is common in collaborative [Butterworth et al. 92, Lamb et al. 91, O. Deux 92]), design environments, where changes made by one en- augmented with database features such as query pro- gineer must be quickly integrated into structures that cessors, schema languages, and indexing facilities. are readable by the others. The caches of different To prevent failures from corrupting persistent struc- clients will overlap considerably when clients access tures, updates to a persistent store are grouped to- the same artifacts, more so as physical memories grow gether and applied "all or nothing" at commit points and data remains in client memory for longer periods. designated by the program during its execution. For For these reasons, DSM techniques such as fine-grained our purposes, a transaction is a period of execution client-client transfers are appropriate for maintaining ending in a commit point. The key property of trans- coherency for cached persistent stores. In the past, dis- actions is that each commit point atomically enters tributed persistent systems have supported coherency a set of updates that together transform the store in a manner similar to most distributed file systems: from one durable consistent state to another. Fail- by reading and writing shared data through the server, ures may cause some uncommitted updates to be lost, usually in fixed-size blocks, and invalidating cached but the last committed state can always be recov- blocks when another client acquires the file token or ered. Database systems typically combine these re- lock. quirements of atomicity and durability with additional In addition, there is commonality between the key assumptions about how concurrent transactions are implementation aspects for coherency and recoverabil- synchronized, but this is not essential to our notion ity. To implement either property, the system must of a transaction. capture the updates made by the application, and Our work explores techniques for extending a per- propagate those updates _ either to durable mem- sistent store to allow network access. For example, ory (e.g., disk) or to other memories in the network _ a persistent store that supports design applications in such a way that all clients see a consistent view of can be extended to allow a group of collaborating de- the data at all times. The key performance factors are signers to run CAD tools at their workstations, ac- the same: (1) the cost of capturing updates, and (2) cessing the shared store through the network. In the the precision with which those updates are captured, simplest configuration, updates are written atomically which determines the amount of traffic to the disk or to a centralized server that maintains the authorita- network. This synergy can be exploited when both tive copy of the data in the store. Clients fetch data properties are supported together, and optimizations from the server in bulk, cache it locally, and oper- that improve the performance of one may also improve ate directly on the cached image in virtual memory. the performance of the other. In the database community this architecture is some- This paper develops a combined approach that in- times called a client/server database. We refer to it tegrates coherency support with a mechanism for en- as a cached persistent store since the ideas generalize suring recoverability of persistent data. It differs from to persistent virtual memory systems that do not pro- other DSM systems in that it accommodates a trans- vide full database features. Note that cached persis- actional programming model and exploits the captur- tent stores are distinct from transactional systems for ing of updates for a transaction log. Section 2 out- reliable distributed programming (e.g., Argus [Liskov lines our approach and its variations, discusses some 88] and Camelot [Eppinger et al. 91]) in that the of the design issues, and sets our work in context with database itself is not distributed; each transaction up- other systems that support distributed database access dates a cached image of a centralized database in vir- and distributed shared memory. Section 3 describes tual memory on a single node. In particular, there is our prototype implementation, and Section 4 presents no need for two-phase commit, since each transaction some performance results. In Section 5 we discuss ad- commits or aborts within a single process. ditional related work, and we conclude in Section 6. 1.2 Combining Coherency and Recov- 2 Log-Based Coherency for a Cached erability Persistent Store A key problem for cached persistent stores is main- Log-based coherency is an extension to write-ahead taining consistency of the data cached in memory on redo logging, a common mechanism for implement- each client. While these systems are similar to dis- ing atomic and durable transactions. This mechanism works by recording new values of data items modi- o how updates are synchronized with client threads fied by a transaction, and writing them to a log on executing locally, and, in a recipient's cache. durable memory when the transaction commits. The system ensures that commits are atomic by writing In our prototype, described in Section 3, we have the log before writing the updated objects back to the made choices for ease of implementation and for the permanent database file. In the event of failure, a re- workloads we expect. However, our approach is flex- covery procedure restores the database to a consistent ible enough to accommodate variations. The follow- state by replaying the committed log records into the ing subsections discuss underlying concepts, the design permanent database file. Many recoverable systems choices we made, and some alternative possibilities. use write-ahead redo logging; for example, it is funda- mental to the pin/update/log commit protocol used by 2.1 The Role of Synchronization Camelot [Eppinger et al. 91]. When multiple clients are accessing the store In DSM systems, the method and timing of update through a network, the log records are written to a propagation is closely tied to synchronization events. logically centralized storage service that also holds the The first DSM systems (Monads [Rosenberg & Abram- permanent database file. A client failure aborts all son 85] and IVY [Li & Hudak 89]) used virtual page uncommitted transactions executing in that client. A protections and reference traps to capture updates and server failure may abort uncommitted transactions in synchronize access to shared pages. In these systems, all clients and initiate the recovery procedure to bring page-grain locking and page-grain coherency led to the permanent database file to a consistent state, re- performance problems caused by false sharing. Newer flecting the committed updates made by all clients. high-performance DSM systems based on release con- (Note that the storage service could be transparently sistency [Gharachorloo et al. 90] (e.g., Munin), lazy replicated to reduce the probability of a server failure.) release consistency [Keleher et al. 92] (TreadMarks) The key to log-based coherency is that the redo log and entry consistency (Midway) have reduced this generated by each client holds exactly the information problem by supplying synchronization primitives that needed to maintain consistency of distributed mem- function independently of the coherency protocol, and ory. The system need only transmit the committed log drive the propagation of updates. These systems re- tails to peer nodes that are sharing the modified ob- duce false sharing by allowing concurrent accesses to jects; the recipients apply the log records they receive a shared page. However, they assume a fully synchro- to update their cached data images. There is no ex- nized (properly labeled) application program; that is, tra runtime cost for collecting the information needed the application must acquire and release the correct to maintain coherency, since it is already collected to locks at the correct times. For all of these systems, the support recoverability. Thus, a common implementa- coherency algorithm will fail for programs that make tion technique for a recoverable store is extended to synchronization errors. support coherency in a straightforward way. In fact, Log-based coherency uses a similar approach resting our approach can be viewed as transaction logging to on similar assumptions. Our prototype supplies stan- remote memory instead of (or in addition to) the disk. dard mutex primitives that are acquired as a trans- As an example, suppose that several nodes are shar- action executes and released at transaction commit ing a persistent object, X; each has a copy of X cached (we currently assume that transactions use strict two- in its primary memory. When one of the nodes exe- phase locking). The system propagates updates only cutes a transaction modifying X, the updates are per- after a writer has released all relevant locks at commit, formed and logged locally; other nodes are neither up- and it ensures that all relevant updates are applied lo- dated nor invalidated at that time. When the trans- cally before a reader is permitted to acquire a lock. action is complete, the generated log record contains Our system differs from release consistency, and is sim- exactly the information needed to transition to the cur- ilar to entry consistency (Midway), in that coherency rent state of X from its initial state. Thus, to bring operations initiated by a lock operation are restricted the other nodes up to date, it is sufficient to propagate to the data under the scope of that lock. The store this portion of the log to the peer nodes. is partitioned into segments, each under the control of Within this broad framework, systems that use log- a separate lock. Segments can be large or small, pre- based coherency could vary in a number of details, senting an obvious tradeoff between synchronization including: overhead and false lock conflicts. We expect locks to be relatively coarse-grained; most commercial persis- o when log records are transmitted to peer nodes, tent stores support coarse-grained locking as natural o which peers receive the log records, for collaborative design environments. Regardless of the locking grain, the updates made while a lock is o how updates are applied to a recipient's cache, held often involve only a small number of the bytes controlled by that lock. In fact, we expect that most larly on networks with point-to-point links that do not updates will be fine-grained (e.g., "change this and to support broadcast or multicast. an or"), but that read operations will consume large We believe that alternative policies could be imple- amounts of data for input to functions such as de- mented transparently to the application by replacing sign analysis or graphical display. For this reason, log- the synchronization primitives and their embedded co- based coherency separates the coherency grain from herency code. For example, the system could propa- the synchronization grain: the updates sent to peer gate segment updates lazily, using an embedded call nodes are determined not by the locks acquired, but in the acquire primitive to fetch and apply pending by the values logged at transaction commit. log records. Segment updates could be fetched from Transaction log records include information about the server, where all log records are cached in mem- the locks acquired during the transaction. Each lock ory for a time, or passed with the lock by the last has a unique lock number and a sequence number writer. In Midway, for example, the acquire primi- (timestamp) that is incremented on each acquire. tive retrieves current versions of any stale objects in When a transaction acquires a lock, its log record is the requester's cache from the current lock holder. The tagged with the sequence number and lock number of acquire request includes a timestamp for the last time that lock. The synchronization primitives contain em- the lock was held by the requester; the lock holder de- bedded calls to logging routines to generate these tags. termines which modified objects to return by compar- The tags are used to determine which nodes must re- ing the timestamp against a modification timestamp ceive the log records and when they must be applied. for each object under the lock's scope. For example, updates for a segment need only be sent to nodes that have previously acquired the lock for the For log-based coherency, lazy update propagation segment. The sequence numbers are also used to en- raises the question of how to determine when pending sure that all relevant data has been rendered coherent log records are no longer needed by peer nodes and before an acquire can succeed, and to preserve the can be discarded. One solution is to pass information global ordering of updates from multiple nodes during about how many log records to hold for each segment recovery, as described in Section 3.4. along with the lock token, as each node acquires the We believe that alternative synchronization models lock in turn. Each node holds all log records up to can be implemented with a modest effort, by replacing and including the oldest records needed by the most the synchronization primitives and the calls they make out-of-date peer. Acquiring a lock brings new records, to underlying logging and coherency routines. Several as well as an opportunity to discard records held lo- relaxed models have been proposed by researchers ar- cally. Note that discarding pending update records is guing that strict serializability is inappropriate for de- not a concern with Midway's lazy propagation scheme, sign transactions (e.g., [Hornick & Zdonik 87]). We because no log records or captured updates are held; are exploring a read/write model that permits readers nodes do not save intermediate versions of objects that to operate on a previous consistent version of the data are modified repeatedly. In our case, these log records while an update is in progress elsewhere; readers use must be held in order to support the nonserializable an accept primitive to explicitly signal their willing- read/write model described above. That is, a reader ness to move forward to a newer consistent version. In may acquire a previous version even if uncommitted this scheme, pending log records must be buffered in writes are present in the writer's cache; the reader's the recipient until they can be applied. We have used cache must be updated to reflect the previous com- a similar version-based consistency model in the past mitted version. for a range of parallel applications [Feeley & Levy 92]. 2.2 Propagating Log Records Our prototype uses a simple eager policy for propa- 2.3 Summary gating updates. At each commit point, after the log records have been written to the storage server, the In this section we presented an overview of log-based in-memory copies of the log records for each segment coherency. Built-in synchronization primitives are fun- are eagerly propagated to all clients that have recently damental to our approach. Alternative coherency pro- acquired the locks for the modified segments. We use tocols _ rules for propagating, applying, and releas- eager updates because they are simple (i.e., no buffer- ing log records _ can be realized by embedding calls ing of log records), they are tolerant of client failures, to logging routines in the synchronization primitives. and they reduce the latency of data access on a client. The locking routines can collect and maintain infor- However, eager updates may increase network traffic mation about which peers must receive a given set of and cannot scale to large numbers of clients, particu- updates, and when. 3 Prototype Implementation 3.1 Capturing Updates Both DSM and recoverable systems must determine To experiment with our approach, we prototyped which bytes are modified by the application using ei- a simple implementation of log-based coherency by ther VM write faults, a write barrier inserted by the adding distribution support to CMU's Recoverable compiler, or explicit calls from the application. Like Virtual Memory (RVM) package [Satyanarayanan the RVM system, our prototype assumes explicit run- et al. 94]. RVM is a logging facility that supports time calls to a set_range procedure in the runtime transactional update and recovery of virtual-memory package. A call to this function indicates an intent to resident data structures. RVM is designed to be a update a particular range of bytes. We expect that lightweight and portable package for use with small this range corresponds to an object, and that the call databases that easily fit in physical memory: an RVM is made by code generated explicitly by the language client copies the entire database into virtual memory compiler (such as the ML compiler that has been used when it starts up. This avoids the need to pin modified with RVM [O'Toole et al. 93]). In contrast, most pages, but for large databases it causes double paging DSM systems use virtual page access faults to captue and unnecessary pageouts of clean pages by the vir- updates, though software-based write detection is also tual memory system. This limits RVM's usefulness for used in Midway [Zekauskas et al. 94]. This issue is the collaborative design environments that interest us; discussed in more detail in Section 4. however, it is an expedient vehicle for experimenting RVM coalesces modified ranges that are adjacent with log-based coherency. or overlapping in order to avoid writing redundant In keeping with its minimalist philosophy, RVM does bytes to the disk log. Overlapping ranges, however, not support or rely upon any particular synchroniza- are unlikely when calls to set_range are generated by tion scheme. Though updates are transactional, multi- a compiler, as is anticipated. To improve common- threaded updates may or may not be serializable. In a case performance, we modified set_range to coalesce similar spirit, our RVM-based prototype separates the only when there is an exact match with a previously synchronization aspects of coherency and recoverabil- added range. Thus, objects that are modified multi- ity from the mechanisms for collecting, propagating, ple times during a transaction are still coalesced but and applying redo log records. Our intent is to ac- with a simpler and more efficient implementation; this commodate various policies for propagating updates reduces set_range overhead by a factor of five. The re- as part of a synchronization scheme "plugged in" to maining per-update overhead is dominated by search- RVM. ing the binary tree that stores modified ranges (in or- der by their address). As a second optimization, we We use RVM as a client/server distributed database avoid this search in the special case where a sequence by placing the transaction log files and the database of set_range calls is ordered by address. file on a central NFS server. Clients maintain caches of the central database in their local virtual memories by reading the entire database into memory at startup (as 3.2 Propagating Log Records in centralized RVM), in this case using the NFS pro- In response to set_range calls, the RVM runtime li- tocol. As write transactions execute, RVM produces brary builds a data structure that describes the modi- redo log records that are written to the NFS server. fications made by a transaction. When the transaction Log-based coherency is implemented by sending the commits, this data structure is used to build I/O vec- log tails to other nodes that have the region mapped, tors for the Unix writev system call. The writev where they are applied to each recipient's cache. Each causes the new values of all modified objects to be node recording such a transaction produces a separate copied from virtual memory to a system buffer for writ- log. We added an RVM utility that merges these into ing to disk. RVM thus avoids building an object log a single log for recovery (see Section 3.4). in virtual memory that contains copies of the modified Our application interface is summarized in Table 1. objects. The left-hand column describes the procedures the ap- We modified the RVM commit procedure to broad- plication uses to initialize, begin, and commit a trans- cast the same new-value information that is written to action; to acquire a segment lock; and to describe the disk. Coherency data is broadcast using TCP/IP by data that is modified by the transaction. The right- issuing a writev system call for each node that has the hand column shows the RVM calls made by each of current region mapped. We use the OSF/1 PThreads these procedures. We added a new procedure, called facility [Mueller 93] to create receiver threads for each rvm_setlockid_transaction, to the standard RVM communication channel that connects a node to its interface. This procedure is called by the acquire peers. These threads block in readv system calls wait- primitive, as described in Section 3.3. ing for coherency messages and applying the updates _______________________________________________________________________________________________________________________________ |__Log-Based_Coherency_Operation__|_______________RVM_Routine_Called_______________________________________________|___________ | Trans.Init() | tid=rvm_malloc_tid() | | Trans.Begin(rvm_mode) | rvm_begin_transaction(tid, rvm_mode) | | Trans.Commit(rvm_mode) | rvm_end_transaction(tid, rvm_mode) | | Trans.Acquire(lock) | rvm_setlockid_transaction(tid, lock.lockId, lock.sequenceNum) | |__Trans.SetRange(addr,_size)______________|______rvm_set_range(addr,_size)______________________________________________|_____ Table 1: Log-based coherency interface. when they arrive. rent sequence number is passed along with the lock to- The format of the coherency records data differs ken when ownership of the lock changes. The acquire from the data sent to disk in two respects. First, some primitive calls RVM to associate the lock with the cur- records that are needed only for recovery and log trim- rent transaction. We added a new procedure to the ming are not included in the broadcast data; i.e., only RVM interface, rvm_setlockid_transaction(tid, new-value range records are needed for coherency. Sec- lockId, sequenceNum), for this purpose. Because ond, the header information for each range record is locks follow a strict two-phase protocol, each lock is compressed from 104 bytes to between 4 and 24 bytes. acquired at most once during a transaction. The standard RVM header contains fields that are not needed for coherency; our header contains only the range's type, address, and size. The header is further 3.4 Preserving Ordering compressed when ranges are small (less than 4 Kbytes) While synchronization is mostly independent of co- or close together (fewer than 256 Kbytes apart) by us- herency and recovery, there is an important interac- ing smaller fields and by replacing the range's address tion. When a transaction commits, the lock informa- with its offset from the preceding range; our modified tion provided to RVM is used to generate lock records set_range orders modified ranges by their address. that are inserted in the log entry for the transaction. Lock records contain the lock identifier and sequence 3.3 Synchronization number. These lock records are used by both the co- herency and recovery code to preserve the ordering of We added distributed locks that provide mutually- updates generated by different nodes. exclusive access to non-overlapping portions of an For coherency, the lock records ensure that log RVM region. Locks are acquired inside of transactions records applied by a receiver thread are properly in- in a two-phase manner; all locks are automatically re- terleaved with those sent by other nodes. The lock leased when a transaction commits. Applications must records included with a received update are used to acquire a lock before reading or writing the data pro- set the receiver's local sequence number for the lock. tected by that lock. If necessary, receiver threads hold log records until The lock implementation is token based with a cen- the updates for the immediately preceding sequence tralized lock manager and a distributed waiter queue, number have been applied. Also, the sequence num- an approach used in many distributed systems includ- ber must match the sequence number passed with the ing TreadMarks. At all times, exactly one node owns lock token before the lock can be acquired on that the lock token. The lock can be acquired on that node node. An application that tries to acquire the lock without remote communication; nodes hold onto the prematurely will wait on a condition variable until sig- token until they are requested to pass it to another naled by the receiver thread that applies the latest up- node. Acquire operations at other nodes send a mes- date. This interlock between coherency and synchro- sage to the lock manager to request the lock. The nization is necessary because updates are broadcast node number of the lock manager is determined from asynchronously; the lock token could arrive at a node the lock identifier number. The manager maintains a before all necessary coherency updates have been re- distributed queue of nodes waiting to acquire the lock. ceived and applied. For example, if the token passes It adds the requester to the tail of the queue and for- in order through nodes A, B, and C, C might receive wards its request to the previous tail which responds the token from B before A's updates arrive at C. The either by sending the lock token to the requester, if protocol ensures that, for updates made to data pro- the lock is available, or by queueing the request until tected by the same lock, (1) none of B's updates are the lock is released. applied at C until C has applied A's updates and (2) Each lock has a unique identifier and a sequence C is not allowed to acquire the lock until B's updates number that is incremented on each acquire. The cur- have been applied. For recovery, lock records are used to merge the in- different traversals, updates, and queries of a syn- dividual RVM redo logs produced by each node. When thetic object-oriented database. The database and the applications share a segment, their logs may record in- traversal tests are intended to be suggestive of typical terleaving updates to the same data. Thus, before any engineering database applications. of these logs can be used by the standard RVM recov- We modified OO7 to run with RVM in standard vir- ery procedure, they must be merged into a single log. tual memory (i.e., no OODB) and integrated it with We built a new RVM utility to do this. Our merge our log-based coherency prototype. We report timings utility reads input logs from head to tail; transaction and related overheads for several OO7 traversals us- records from different logs are compared by comparing ing our prototype. These experiments were conducted their lock records. Our current merge utility exploits using two Digital 3000-400 Alpha APX (133 Mhz, 74 the fact that our current synchronization model guar- SPECints) workstations with 8-Kbyte pages and sep- antees strictly serializable transactions. When merg- arate 512-Kbyte direct-mapped instruction and data ing records from different logs, it is sufficient to order caches [Dig 92]. The machines are connected by a transactions so that if two transactions acquired the 100 Mbit/s AN1 network, an experimental switch- same lock, the transaction with the earlier sequence based network capable of sending message packets of number for that lock is ordered first. up to 64 Kbytes [Rodeheffer & Schroeder 91]. Elapsed time measurements were taken using the Alpha cy- 3.5 Distributed Log Trimming cle counter. For these experiments, we disabled RVM disk logging so that we could isolate the costs associ- The current RVM log-trimming algorithm has an un- ated with coherency from the synchronous disk writes fortunate interaction with our distribution implemen- needed to support recovery. This is important in part tation. In the current version of RVM, log records are because optimizations using non-volatile RAM can be trimmed from the head of the log using the standard used to eliminate synchronous disk writes from the recovery procedure to compute a new checkpoint. This commit critical path [Hagmann 86]. operation, which runs asynchronously with normal The figures also depict estimated lower bounds for transactions, is triggered when the number of records alternative DSM implementations that use page ac- in the log reaches a high-water point. In our system, it cess faults to capture updates. For log-based co- is no longer possible to use the log generated by a single herency, updates are captured by calls to the RVM node to compute a new checkpoint; instead, log records set_range procedure. These calls are coded explic- from all nodes must first be merged. Our current pro- itly in our OO7 benchmarks, although other RVM totype performs log trimming offline using the recovery applications have used compiler-generated set_range procedure described above. Online trimming could be calls [O'Toole et al. 93]. In contrast, most DSM sys- implemented using the merging procedure by coordi- tems use page access faults to capture updates without nating checkpointing; one node would checkpoint at a involvement from the compiler or application. Early time, broadcasting to other nodes when done to inform page-locking DSM systems, such as Monads and IVY, them of their new log head. use page access faults to grant a writer exclusive access An improved log-trimming scheme for RVM is de- to a page while updates are in progress, then trans- scribed in [Satyanarayanan et al. 94]. In this scheme, mit the entire contents of each modified page to other nodes checkpoint a page at a time by writing the cur- nodes accessing the page. In newer multiple-writer rent version of a page to the checkpoint file. Log "copy/compare" DSM systems, such as Munin and records for updates made to a page before it was check- TreadMarks, updates are also detected a page at a pointed can be discarded. This checkpointing scheme time, but multiple nodes are permitted to make con- could be more easily incorporated into our prototype, current non-conflicting updates to any given page. In because it does not require logs to be merged. these systems, the first store to a unmodified page on each node delivers a write-access fault to the co- 4 Performance herency software, which makes a copy of the page be- fore enabling write access. Updates are collected by This section presents performance measurements of later comparing the modified page with its copy. The our prototype. While most DSM systems have used copy/compare technique could improve performance parallel programs for evaluation, our application do- for some OODBs that use page-grained locking and main _ collaborative design applications accessing updates today. a distributed persistent store _ requires a different Our lower-bound estimates for page-locking (labeled benchmark. For this reason, we have chosen to use Page) and multiple-writer (labeled Cpy/Cmp) DSM OO7 [Carey et al. 93], a standard object-oriented systems are computed from the measurements listed database benchmark. OO7 consists of a number of in Table 2. We measured the cost of using the OSF/1 ________________________________________________________________________________________ Operation | Cost | Throughput | ||_______________________________________________|(~sec/page)__|_____(MBytes/s)_____|___ | page copy (cold cache) | 171.9 | 43 | | page copy (warm cache) | 57.8 | 135 | | page compare (cold cache) | 281.0 | 28 | | page compare (warm cache) | 147.3 | 53 | | page send (TCP/IP) | 677.0 | 12 (96.8 Mbit/s) | |__handle_signal_and_change_protection__|____________360.1______|_______________________| Table 2: Operation costs (per page) on Alpha/AN1. mprotect system call to change page protection, by ite parts are visited. When a composite part is vis- storing to a read-only page to generate a protection ited, its atomic-part graph is traversed and updates fault, delivering the signal to a user-level procedure, are performed. There are three variants of the traver- calling mprotect again to enable writing, and return- sals (A, B, and C) that update different numbers of ing from the signal handler. We also measured the atomic parts. In A, one atomic part per composite time to copy and compare pages (we use the cold cache part is modified; in B, every atomic part is modified; times in the figures). For page-grain DSM (Page), up- and in C, every atomic part is modified four times. dated pages are not copied or scanned, so we assume no An atomic part is updated by changing an eight-byte collection overhead. However, network I/O overhead field. The difference between T2 and T3 is that T3 for Page is greater because entire pages are transmit- updates the atomic part's index field. Each time this ted instead of just the modified bytes. This time is field is changed, the part index is updated by deleting estimated from the measured TCP throughput given the index entry for the old value and adding an entry in Table 2. Communication overhead for Cpy/Cmp is for the new value. This results in an average of seven assumed to be the same as the measured times for log- index updates for each atomic-part update. based coherency (labeled Log), since both send only We added a third update traversal, called T12. T12 the modified bytes. Both Log and Cpy/Cmp also in- differs from the other update traversals in that it per- cur overhead at the receiver to apply the updates to forms a sparse traversal of the database. It is similar to the cache; however, this cost is too small to be clearly read-only traversal T6; for each composite part it vis- distinguished in any of the graphs below. its only one atomic part. In T12, a higher percentage of overall running time is related to updating objects. 4.1 The OO7 Benchmark Thischighlightsothehperformanceecostsrassociatedewithncy over* *head. The OO7 database is composed of a design library In our RVM-based OO7 benchmark, the database and an assembly hierarchy. The design library, which elements are heap-allocated C++ objects and a makes up the bulk of the database, is a set of 500 com- threaded AVL-balanced tree is used for the part in- posite parts. A composite part corresponds to a design dex. The atomic parts associated with a particular primitive, such as a register cell in a VLSI CAD ap- composite part tend to be clustered on the same page plication or a procedure in a CASE application. Each while atomic parts from different composite parts are composite part is itself made up of a graph of 20 atomic usually on different pages. We ran each of the OO7 parts. Composite and atomic part objects are each update traversals under our prototype. Each test con- roughly 200 bytes long. Each atomic part contains an sists of a single transaction (and a single segment lock) index field; a part index is maintained using a self- in which one node performs the traversal and another balancing tree. receives the log tail and installs the updates, bringing Objects in the assembly hierarchy correspond to its copy of the database up to date. higher-level design constructs in the application. The assembly hierarchy is a tree of assembly objects with 4.2 Results 729 leaf nodes. Each leaf node, called a base assem- bly, points to three composite parts that are chosen at Figures 1, 2, and 3 show the results of running the random from the design library when the database is various OO7 traversals, along with the associated co- constructed. herency overhead. Write overhead consists of detect- There are two traversals in OO7 that update the ing and collecting updates and the network I/O costs database, T2 and T3. Both traverse the assembly hi- of transmitting those updates to the other node. Ta- erarchy to visit all of the composite parts pointed to ble 3 summarizes the characteristics of these traversals, by each base assembly; thus, a total of 2187 compos- listing the number of updates performed by each, the _____________________________________________________________________________ Traversal | Updates | Bytes | Message | Pages | ||________________|____________|_Updated__|_______Bytes____|___Updated__|____ | T12-A | 2,187 | 4,000 | 6,000 | 500 | | T12-C | 8,748 | 4,000 | 6,000 | 500 | | T2-A | 2,187 | 4,000 | 6,000 | 500 | | T2-B | 43,740 | 80,000 | 120,000 | 618 | | T2-C | 174,960 | 80,000 | 120,000 | 618 | | T3-A | 16,924 | 31,300 | 39,000 | 552 | | T3-B | 248,632 | 114,650 | 163,300 | 667 | |__T3-C________|__1,502,708__|______115,100__|_____163,800__|__________670__|_ Table 3: Summary of OO7 update-traversal characteristics number of unique bytes updated, the number of bytes sent over the network, and the number of pages up- dated. (The difference between the number of bytes updated and the number transmitted is due to range- message overhead; each range is preceded by a header that describes the address of the range and its size.) Based on the Alpha-OSF/1 measurements, our analysis shows that for the anticipated application workload, in which updates affect a large number of pages, software write detection (e.g., compiler- generated set_range calls) performs better than any form of hardware-based write detection. Our ap- Figure 1: OO7 Sparse-update traversals T12-A and proach performs significantly better for traversals T12- T12-C. A, T12-C, T2-A, and T3-A because they perform rela- tively few updates per page. Traversals T2-B and T2- C perform 71 and 283 update per page respectively; for these traversals, our approach performs about as well as Cpy/Cmp. However, the index-update traversals T3-B and T3-C perform significantly more updates per page, 373 and 2,243; as a result our approach performs poorly. This shows that when there are many updates per page, a page-based systems such as TreadMarks is preferred to a software-based approach such as log- based coherency. Figure 2: OO7 Full-update traversals T2-A, T2-B, and 4.3 Analysis T2-C, and index-update traversal T3-A. There is a performance tradeoff between the three ap- proaches evaluated above that is a function of (1) the number of modified bytes per page and (2) the num- ber of individual updates per page. The overhead of Log and Cpy/Cmp depends on the number of modified bytes, Cpy/Cmp and Page depend on the number of modified pages, and Log alone depends on the num- ber of updates per page. Which approach works best depends on the workload. As we have seen, log-based coherency is preferred when there are few updates per page; similarly, Page performs best when most of a page is modified. Figures 4-7 below show where the breakeven points occur. Figure 3: OO7 Index-update traversals T3-B and Figure 4 shows coherency overhead as a function of T3-C . the number of modified bytes per page. This compares the per-byte overhead of log-based coherency with the total overhead of copy/compare and page approaches. When more than 1037 bytes are modified per page, Page outperforms Cpy/Cmp. When there are few up- dates per page, Log outperforms the alternatives no matter how many bytes are modified. However, this graph does not include the per-update overhead asso- ciated with Log; this is shown in Figures 5-7. Figures 5 and 6 show the log-based coherency over- head associated with detecting and collecting a single update as the number of updates per transaction in- Figure 4: Comparison of overhead as the number of creases. This is a measurement of the performance modified bytes per page increases. For log-based co- of RVM operations set_range and rvm_commit. The herency, per-update overhead is not included. middleselinequisenthececostofforseant_updaterainngane ordered* *calls (taking advantage @ optimization described in Section 3.1). The lower line is the cost of detecting an update to a range that was modified previously in the same transaction. Figure 7 shows the breakeven point at which log- based coherency and Cpy/Cmp have equivalent per- formance. For a given average per-update cost on the x-axis, the y-axis shows the maximum number of up- dates per page possible before Cpy/Cmp outperforms log-based coherency. For example, using Figures 5 and 7, we can determine that if there are 1000 updates per transaction, log-based coherency performs better when there are 45 or fewer updates per page (55 if the up- Figure 5: The overhead associated with a single update dates are ordered). for log-based coherency as the number of updates per Recent work [Thekkath & Levy 94] has shown an transaction increases. order-of-magnitude reduction in exception-handling cost, which would make hardware-based write detec- tion more attractive. Figure 7 shows how the per- formance tradeoff would be affected if signal handling overhead were 10 ~sec instead of the 340 ~sec mea- sured for Alpha-OSF/1. 4.3.1 Increasing the Number of Nodes Another important performance concern for log-based coherency is the effect of increasing the number of Figure 6: Log-based coherency update overhead up to nodes using a segment to be kept coherent. In our 300,000 updates per transaction. prototype the network I/O overhead of the writer in- creases linearly with the number of peer nodes, be- cause the writer node issues separate writev system calls for each peer. Since this overhead is relatively small, our approach will scale to a moderate number of nodes. Systems with a very large number of clients will perform better with multicast hardware or lazy coherency. 4.3.2 Impact of Coherency on Recoverability Figure 7: Breakeven point for log-based coherency. For different update overheads, the number of updates per Finally, we wanted to determine the impact of our page at which log-based coherency performs better than modifications on the performance of standard RVM. Cpy/Cmp. Figure 8 shows four measurements of the coherency and recoverability overheads for the T12-A bench- a main-memory database [Li & Naughton 88]. The standby keeps a complete copy of the database in its memory and receives updates from the primary in the form of log records sent at commit points. Their pur- pose is to allow checkpointing to take place in the standby, off-line, without interfering with clients exe- cuting on the primary copy of the database. Delis and Roussopoulos conducted a simulation study of client- server relational databases using a log-based approach for updating client caches [Delis & Roussopoulos 92]. Updates are centralized at the server; the server main- Figure 8: Comparison of log-based coherency, disk log- tains a recovery log and a separate update log. Before ging, and standard RVM. Optimized RVM is standard a client accesses a data item in its cache, it first con- RVM with our optimizations to set_range. tacts the server to retrieve log records generated since the cached copy of the item was last updated. While several of these systems send log records from clients mark. The first column is the overhead for log-based to one or more servers, log-based coherency sends log coherency presented in the previous section. The next records from client to client. column measures the overhead when disk logging is enabled; from this we see that the only additional Several groups have integrated coherency and re- overhead is due to writing the log tail to disk. The coverability by adding checkpointing to page-based third and fourth columns were taken using standard DSM, without using transactions. This allows non- RVM, without our log-based coherency modifications; transactional applications such as parallel programs the third column is standard RVM with our optimized to be made recoverable. The main issue is to coordi- set_range modification. This shows that the over- nate individual node checkpoints to attain a consistent head added to RVM by log-based coherency is directly global checkpoint, using a combination of dependency related to sending the modified bytes to peer nodes. tracking, message logging, and replication. The first This validates our assertion that there is a high degree such system is due to Wu and Fuchs [Wu & Fuchs 90]. of overlap between the mechanisms for recoverability Stumm and Zhou describe a system that tolerates the and coherency. failure of a single node by ensuring that every page resides in the caches of at least two nodes [Stumm & Zhou 90]. Richard and Singhal use page logging 5 Related Work [Richard & Singhal 93]; a copy of a page is written to a local volatile log each time it is acquire for reading In the context of RVM, transaction logs have been or writing. A node writes its volatile log to disk before used in a similar fashion to propagate updates between transferring a modified page to another node. data spaces in a system with concurrent replicating Janssens and Fuchs added checkpointing to relaxed- garbage collection [O'Toole et al. 93]. Our contri- consistency DSM [Janssens & Fuchs 93]. Instead of bution is to use this idea for maintaining coherency of requiring checkpointing or other recoverability actions client database caches. Log propagation has been used each time an application gains access to a page, their to maintain the consistency of replicas in replicated system checkpoints only when a node releases or ac- database systems. Replicated systems differ from log- quires a lock. based coherency in that the replicas are used to ensure availability of the data. In replicated systems, log Neves et al. added checkpointing to an entry- records flow in one direction: from clients to servers consistent DSM system similar to Midway, using (or from a client to a server, and then on to other object-grain locking [Neves et al. 94]. Their system servers). Servers cooperate to ensure that log records tolerates single node failures by keeping old versions of are applied in a consistent order at all locations. In our modified objects in the volatile memory of the nodes system, the replicas are caches that keep only enough that modify them. Each node checkpoints indepen- data to meet the local client's needs; update propa- dently; a failed node recovers by replaying its execu- gation may be delayed until a client requests the new tion starting with its most recent checkpoint. Infor- data. mation recorded at other nodes is sufficient for them Harp [Liskov et al. 92] file servers log received up- to supply the recovering node with the same version of dates to peer servers in order to remove stable storage objects as it saw during normal execution. The impact writes from the commit path. Similarly, Naughton and on failure-free execution is minimized by piggy-backing Li have used log propagation to keep a hot standby of recoverability information on coherency messages. 6 Conclusion Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium The key points of this paper are: (1) DSM techniques on Operating Systems and Principles, pages such as fine-grained client-client transfers are appro- 152-164, October 1991. priate for maintaining cache coherency for distributed [Cockshott et al. 84] Cockshott, W., Atkinson, M., persistent stores, (2) there is a commonality between Chisholm, K., Bailey, P., and Morrison, R. the implementation techniques for recoverability and Persistent object management system. Soft- coherency, and (3) this synergy can be exploited when ware Practice and Experience, 14(1), Jan- both properties are supported together. uary 1984. We have presented a new DSM approach called log- [Delis & Roussopoulos 92] Delis, A. and Roussopou- based coherency that uses recoverability mechanisms los, N. Performance and scalability of from persistent object systems as the basis for main- client-server database architectures. In Pro- taining coherency of distributed objects. Our work ex- ceedings of the 18th International Confer- tends previous work on DSM to exploit the notion of ence on Very Large Databases, pages 610- commit points in which a group of related updates be- 623, August 1992. come visible atomically. We extend work on persistent stores to support the fine-grained sharing made possi- [Dig 92] DigitalMEquipmentACorporation,.Maynard,Alpha Arc* *hitecture Handbook, 1992@ ble for parallel applications by DSM systems. In par- ticular, log-based coherency separates the coherency [Eppinger et al. 91] Eppinger, J., Mummert, L., and grain from the synchronization grain. This is impor- Spector, A. Camelot and Avalon. Morgan tant for collaborative-design applications, where large Kaufmann, 1991. data regions are shared among engineers at different [Feeley & Levy 92] Feeley, M. J. and Levy, H. M. nodes, but where updates are sparse and infrequent. Distributed shared memory with versioned With log-based coherency, locking overhead can be re- objects. In Proceedings of the Confer- duced by using coarse-grain locks without increasing ence on Object-Oriented Programming Sys- coherency overhead; i.e., coarse-grain locks can sup- tems, Languages, and Applications, Octo- port fine-grain sharing. ber 1992. While log-based coherency provides an alternative [Gharachorloo et al. 90] Gharachorloo, K., Lenoski, to other DSM systems, the approaches are not mu- D., Laudon, J., Gibbons, P., Gupta, A., and tually exclusive. Our measurements show that appli- Hennessy, J. Memory consistency and event cation behavior determines the best approach; e.g., if ordering in scalable shared-memory multi- updates are highly clustered within a page, standard processors. In Proc. 17th Annual Sympo- DSM techniques will perform better, while for sparse sium on Computer Architecture, Computer updates, the log-based approach will perform better. Architecture News, pages 15-26. ACM, Therefore, adaptive hybrid approaches maybe be pos- June 1990. sible where application behavior can be predicted. [Hagmann 86] Hagmann, R. B. A crash recovery scheme for a memory-resident database sys- tem. IEEE Transactions on Computers, C- References 35(9):839-843, September 1986. [Butterworth et al. 92] Butterworth, P., Otis, A., and [Hornick & Zdonik 87] Hornick, M. F. and Zdonik, Stein, J. The GemStone object database S. B. A shared, segmented memory system management system. Communications of for an object-oriented database. A ACM the ACM, 34(10):64-77, October 1992. Transactions on Office Informations Sys- tems, 5(1), January 1987. [Carey et al. 93] Carey, M. J., Dewitt, D. J., and Naughton, J. F. The OO7 benchmark. 1993 [Hosking & Moss 93] Hosking, A. L. and Moss, J. ACM SIGMOD. International Conference E. B. Protection traps and alternatives for on Management of Data, 22(2):12-21, May memory managemnt of an object-oriented 1993. language. In Proceedings of the 14th ACM Symposium on Operating Systems Princi- [Carey et al. 94] Carey, M. J., Dewitt, D. J., and ples, December 1993. Franklin, M. J. Shoring up persistent ap- plications. 1994 ACM SIGMOD. Interna- [Janssens & Fuchs 93] Janssens, B. and Fuchs, W. K. tional Conference on Management of Data, Relaxing consistency in recoverable dis- May 1994. tributed shared memory. In Proceedings of the Twenty-Third Annual International [Carter et al. 91] Symposium on Fault-Tolerant Computing: Carter, J., Bennet, J., and Zwaenepoel, W. Digest of Papers, pages 155-163, June 1993. [Keleher et al. 92] [O'Toole et al. 93] O'Toole, J., Nettles, S., and Gi* *f- Keleher, P., Cox, A., and Zwaenepoel, W. ford, D. Concurrent compacting garbage Lazy release consistency for software dis- collection of a persistent heap. In Proceed- tributed shared memory. In Proceedings of ings of the Fourteenth ACM Symposium on the 19th Annual Symposium on Computer Operating Systems Principles, pages 161- Architecture, pages 13-21, May 1992. 174, December 1993. [Keleher et al. 94] Keleher, P., [Richard & Singhal 93] Richard, III, G. G. and Sing- Cox, A. L., Dwarkadas, S., and Zwaenepoel, hal, M. Using logging and asynchronous W. TreadMarks: Distributed shared mem- checkpointing to implement recoverable dis- ory on standard workstations and operat- tributed shared memory. In Proceedings of ing systems. In Proceedings of the Winder the 12th Symposium on Reliable Distributed 1994 USENIX Conference, pages 115-132, Systems, pages 58-67, October 1993. January 1994. [Rodeheffer & Schroeder 91] Rodeheffer, [Lamb et al. 91] Lamb, C., Landis, G., Orenstein, J., T. and Schroeder, M. D. Automatic recon- and Weinreb, D. The ObjectStore database figuration in autonet. In Proceedings of the system. Communications of the ACM, Thirteenth ACM Symposium on Operating 34(10):50-63, October 1991. Systems Principles, pages 183-197, October 1991. [Li & Hudak 89] Li, K. and Hudak, P. Memory co- [Rosenberg & Abramson 85] Rosenberg, herence in shared virtual memory systems. J. and Abramson, D. A. MONADS-PC: ACM Transactions on Computer Systems, A capability-based workstation to support 7(4):321-359, November 1989. software engineering. In Proceedings of the [Li & Naughton 88] Li, L. and Naughton, J. F. Mul- 18th Hawaii International Conference on tiprocessor main memory transaction pro- System Sciences, 1985. cessing. In Proceedings of the International [Satyanarayanan et al. 94] Satyanarayanan, Symposium on Databases in Parallel and M., Mashburn, H. H., Kumar, P., Steere, Distributed Systems, pages 177-187, De- D. C., and J.Kistler., J. Lightweight recov- cember 1988. erable virtual memory. ACM Transactions [Liskov 88] Liskov, B. Distributed programming in onaComputerrSystems,y12(4):33-57,1Febru-994. Argus. Communications of the ACM, 31(3):300-312, March 1988. [Stumm & Zhou 90] Stumm, M. and Zhou, S. Fault tolerant distributed shared memory algo- [Liskov et al. 92] Liskov, B., Ghemawat, S., Gruber, rithms. In Proceedings of the Second IEEE R., Johnson, P., and Shrira, L. Replica- Symposium on Parallel and Distributed tion in the Harp file system. In Proceedings Processing, pages 719-724, December 1990. of the Thirteenth ACM Symposium on Op- erating Systems Principles, pages 226-238, [Thekkath & Levy 94] Thekkath, C. and Levy, H. October 1992. Hardware and software support for efficient exception handling. In Proceedings of the [Moss 90] Moss, J. E. B. Design of the Mneme per- 6th International Conference on Architec- sistent object store. ACM Transactions on tural Support for Programming Languages Information Systems, 8(2):103-139, April and Operating Systems, October 1994. 1990. [Wu & Fuchs 90] Wu, K.-L. and Fuchs, W. K. Re- [Mueller 93] Mueller, F. A library implementation of coverable disributed shared virtual mem- POSIX threads under Unix. In Proceedings ory. IEEE Transactions on Computers, of the Winter 1993 USENIX Conference, 39(4):460-469, April 1990. pages 29-41, January 1993. [Zekauskas et al. 94] Zekauskas, M. J., [Neves et al. 94] Neves, N., Castro, M., and Guedes, Sawdon, W. A., and Bershad, B. N. Soft- P. A checkpoint protocol for an entry con- ware write detection for distributed shared sistent shared memory system. In Pro- memory. In Proceedings of the First Sym- ceedings of the 13th ACM Symposium on posium on Operating Systems Design and Principles of Distributed Computing, Au- Implementation, November 1994. gust 1994. [O. Deux 92] O. Deux. The O2 system. Communi- cations of the ACM, 34(10):34-48, October 1992.