Integrating  Coherency  and  Recoverability  in  Distributed  Systems

 Michael J. Feeley, Jeffrey S. Chase, Vivek R. Narasayya, and Henry M. Levy

            Department of Computer Science and Engineering, FR-35
                       University of Washington
                         Seattle, WA 98195
                {feeley,chase,nara,levy}@cs.washington.edu

                          Abstract     94], and Midway [Zekauskas et al.  94].  These DSM

    We propose a technique for maintaining coherency                  systems maintain the illusion of a single shared mem-
of a transactional distributed shared memory, used by                 ory by synchronizing data access and moving data be-
applications accessing a shared persistent store.  Our                tween nodes when required, transparently to the appli-
goal is to improve support for fine-grained distributed               cation.  DSM is useful in this context, because it sim-
data sharing in collaborative design applications, such               plifies programming of these distributed-parallel pro-
as  CAD  systems  and  software  development  environ-                grams.
ments. In contrast, traditional research in distributed                  Parallel programs are not the only applications that
shared memory has focused on supporting parallel pro-                 can  benefit  from  the  concept  of  distributed  shared
grams;  in  this  paper,  we  show  how  distributed  pro-            memory; DSM can be applied to other application do-
grams can benefit from this shared-memory abstrac-                    mains as well. Our goal is to support coherent virtual
tion as well.                                                         memory for programs that perform transactional  up-
    Our approach, called log-based coherency, integrates              dates to their shared memory space.  While this style
coherency support with a standard mechanism for en-                   of programming is not ordinarily seen in parallel pro-
suring  recoverability  of  persistent  data.   In  our  sys-         grams, it is standard for applications using a persistent
tem, transaction logs are the basis of both recoverabil-              store, a system that supports storage and retrieval of
ity and coherency.  We have prototyped log-based co-                  virtual memory data structures in disk files.
herency as a set of extensions to RVM [Satyanarayanan                    Our  work  explores  the  interactions  between  co-
et al.  94], a runtime package supporting recoverable                 herency and transactions.  We describe log-based co-
virtual memory.  Our prototype adds coherency sup-                    herency,  a  simple  technique  for  maintaining  consis-
port to RVM in a simple way that does not require                     tency of a transactional distributed virtual memory.
changes to existing RVM applications.  We report on                   The key idea behind log-based coherency is that log
our prototype and its performance, and discuss its re-                records used to support atomic transactions are also
lationship to other DSM systems.                                      used  as  the  basic  mechanism  for  updating  cached
                                                                      copies  of  the  data  on  peer  nodes.   This  unification
                                                                      of mechanisms allows us to add DSM support to sys-

1     Introduction                                                    tems that support persistence, without modifying ex-
                                                                      isting persistent programs and without adding signifi-
Existing  distributed  shared  memory  (DSM)  systems                 cant software overhead.
support parallel programming  on distributed-memory
multicomputers  and  workstation  networks.    Exam-                  1.1      Persistent Stores
ples  of  such  systems  include  IVY  [Li  &  Hudak  89],
Munin [Carter et al.  91], TreadMarks [Keleher et al.                 Persistent storage systems have evolved to meet the
   _____________________________                                      data  management  needs  of  design  applications,  in-
    This work is supported in part by the National Science            cluding electronic and mechanical CAD systems and
Foundation (Grants No. CDA-9123308 and CCR-9200832), the
Washington Technology Center,  Digital Equipment Corpora-             software  development  environments.   These  applica-
tion,  Boeing Computer Services,  Intel Corporation,  Hewlett-        tion  environments  consist  of  collections  of  programs
Packard  Corporation,  and  Apple  Computer.    Both  Feeley          that operate on persistent data structures represent-
and  Chase  have  been  supported  by  Intel  Foundation  Grad-       ing design artifacts and derived information.  Client
uate  Fellowships.   Chase's  present  address  is:   Department
of  Computer  Science,  Duke  University,  Durham,  NC  27706         programs navigate through the stored data by follow-
(chase@cs.duke.edu).                                                  ing pointers directly in virtual memory, with the sys-
tem moving data between memory and the persistent                     tributed file systems, the caches should be viewed as
store as needed.  A number of persistent stores have                  a distributed shared memory, since each client process

been built; some are research systems [Cockshott et al.               accesses the cache directly in its virtual address space.
84, Moss 90, Carey et al.  94] and others are commer-                 There are other important similarities with DSM sys-
cial object-oriented database (OODB) products (e.g.,                  tems. Fine-grained sharing is common in collaborative
[Butterworth et al.  92, Lamb et al.  91, O. Deux 92]),               design environments, where changes made by one en-
augmented with database features such as query pro-                   gineer must be quickly integrated into structures that
cessors, schema languages, and indexing facilities.                   are  readable  by  the  others.   The  caches  of  different
    To prevent failures from corrupting persistent struc-             clients  will  overlap  considerably  when  clients  access
tures,  updates  to  a  persistent  store  are  grouped  to-          the same artifacts, more so as physical memories grow
gether and applied "all or nothing" at commit points                  and data remains in client memory for longer periods.
designated by the program during its execution.  For                  For these reasons, DSM techniques such as fine-grained
our purposes,  a transaction  is a period of execution                client-client transfers are appropriate for maintaining
ending in a commit point. The key property of trans-                  coherency for cached persistent stores. In the past, dis-
actions  is  that  each  commit  point  atomically  enters            tributed persistent systems have supported coherency
a  set  of  updates  that  together  transform  the  store            in a manner similar to most distributed file systems:
from  one  durable  consistent  state  to  another.   Fail-           by reading and writing shared data through the server,
ures may cause some uncommitted updates to be lost,                   usually  in  fixed-size  blocks,  and  invalidating  cached
but  the  last  committed  state  can  always  be  recov-             blocks when another client acquires the file token or
ered.   Database  systems  typically  combine  these  re-             lock.
quirements of atomicity and durability with additional                   In addition, there is commonality between the key
assumptions  about  how  concurrent  transactions  are                implementation aspects for coherency and recoverabil-
synchronized,  but this is not essential to our notion                ity.  To implement either property,  the system must
of a transaction.                                                     capture  the  updates  made  by  the  application,  and
    Our work explores techniques for extending a per-                 propagate  those  updates  _  either  to  durable  mem-
sistent  store  to  allow  network  access.   For  example,           ory (e.g., disk) or to other memories in the network _
a  persistent  store  that  supports  design  applications            in such a way that all clients see a consistent view of
can be extended to allow a group of collaborating de-                 the data at all times. The key performance factors are
signers  to  run  CAD  tools  at  their  workstations,  ac-           the same:  (1) the cost of capturing updates, and (2)
cessing the shared store through the network.  In the                 the precision with which those updates are captured,
simplest configuration, updates are written atomically                which determines the amount of traffic to the disk or
to a centralized server that maintains the authorita-                 network.   This  synergy  can  be  exploited  when  both
tive copy of the data in the store.  Clients fetch data               properties are supported together, and optimizations
from  the  server  in  bulk,  cache  it  locally,  and  oper-         that improve the performance of one may also improve
ate directly on the cached image in virtual memory.                   the performance of the other.
In the database community this architecture is some-                     This paper develops a combined approach that in-
times called a client/server database.  We refer to it                tegrates coherency support with a mechanism for en-
as a cached persistent store since the ideas generalize               suring recoverability of persistent data. It differs from
to persistent virtual memory systems that do not pro-                 other DSM systems in that it accommodates a trans-
vide full database features.  Note that cached persis-                actional programming model and exploits the captur-
tent stores are distinct from transactional systems for               ing of updates for a transaction log.  Section 2 out-
reliable distributed programming (e.g., Argus [Liskov                 lines our approach and its variations, discusses some
88]  and  Camelot  [Eppinger  et  al.   91])  in  that  the           of the design issues, and sets our work in context with
database itself is not distributed; each transaction up-              other systems that support distributed database access
dates a cached image of a centralized database in vir-                and distributed shared memory.  Section 3 describes
tual memory on a single node.  In particular, there is                our prototype implementation, and Section 4 presents
no need for two-phase commit, since each transaction                  some performance results. In Section 5 we discuss ad-
commits or aborts within a single process.                            ditional related work, and we conclude in Section 6.


1.2      Combining  Coherency  and  Recov-                            2     Log-Based  Coherency  for  a  Cached

         erability                                                          Persistent Store

A  key  problem  for  cached  persistent  stores  is  main-           Log-based  coherency  is  an  extension  to  write-ahead
taining consistency of the data cached in memory on                   redo  logging,  a  common  mechanism  for  implement-
each  client.  While  these  systems  are  similar  to  dis-          ing atomic and durable transactions. This mechanism
works  by  recording  new  values  of  data  items  modi-                o  how updates are synchronized with client threads
fied by a transaction,  and writing them to a log on                        executing locally, and, in a recipient's cache.

durable memory when the transaction commits.  The
system  ensures  that  commits  are  atomic  by  writing                 In our prototype,  described in Section 3,  we have
the log before writing the updated objects back to the                made choices for ease of implementation and for the
permanent database file.  In the event of failure, a re-              workloads we expect.  However, our approach is flex-
covery procedure restores the database to a consistent                ible enough to accommodate variations.  The follow-
state by replaying the committed log records into the                 ing subsections discuss underlying concepts, the design
permanent  database  file.   Many  recoverable  systems               choices we made, and some alternative possibilities.
use write-ahead redo logging; for example, it is funda-
mental to the pin/update/log commit protocol used by                  2.1      The Role of Synchronization
Camelot [Eppinger et al. 91].
    When  multiple  clients  are  accessing  the  store               In  DSM  systems,  the  method  and  timing  of  update
through a network,  the log records are written to a                  propagation is closely tied to synchronization events.
logically centralized storage service that also holds the             The first DSM systems (Monads [Rosenberg & Abram-
permanent  database  file.   A  client  failure  aborts  all          son 85] and IVY [Li & Hudak 89]) used virtual page
uncommitted transactions executing in that client.  A                 protections and reference traps to capture updates and
server failure may abort uncommitted transactions in                  synchronize access to shared pages.  In these systems,
all clients and initiate the recovery procedure to bring              page-grain  locking  and  page-grain  coherency  led  to
the permanent database file to a consistent state, re-                performance problems caused by false sharing. Newer
flecting  the  committed  updates  made  by  all  clients.            high-performance DSM systems based on release con-
(Note that the storage service could be transparently                 sistency [Gharachorloo et al.  90] (e.g., Munin), lazy
replicated to reduce the probability of a server failure.)            release consistency [Keleher et al.  92] (TreadMarks)
    The key to log-based coherency is that the redo log               and  entry  consistency  (Midway)  have  reduced  this
generated by each client holds exactly the information                problem by supplying synchronization primitives that
needed  to  maintain  consistency  of  distributed  mem-              function independently of the coherency protocol, and
ory. The system need only transmit the committed log                  drive the propagation of updates.  These systems re-
tails to peer nodes that are sharing the modified ob-                 duce false sharing by allowing concurrent accesses to
jects; the recipients apply the log records they receive              a shared page. However, they assume a fully synchro-
to update their cached data images.  There is no ex-                  nized (properly labeled) application program;  that is,
tra runtime cost for collecting the information needed                the application must acquire and release the correct
to maintain coherency, since it is already collected to               locks at the correct times. For all of these systems, the
support recoverability.  Thus, a common implementa-                   coherency algorithm will fail for programs that make
tion technique for a recoverable store is extended to                 synchronization errors.
support coherency in a straightforward way.  In fact,                    Log-based coherency uses a similar approach resting
our approach can be viewed as transaction logging to                  on similar assumptions.  Our prototype supplies stan-
remote memory instead of (or in addition to) the disk.                dard mutex primitives that are acquired as a trans-
    As an example, suppose that several nodes are shar-               action  executes  and  released  at  transaction  commit
ing a persistent object, X; each has a copy of X cached               (we currently assume that transactions use strict two-
in its primary memory.  When one of the nodes exe-                    phase locking).  The system propagates updates only
cutes a transaction modifying X, the updates are per-                 after a writer has released all relevant locks at commit,
formed and logged locally; other nodes are neither up-                and it ensures that all relevant updates are applied lo-
dated nor invalidated at that time.  When the trans-                  cally before a reader is permitted to acquire a lock.
action is complete, the generated log record contains                 Our system differs from release consistency, and is sim-
exactly the information needed to transition to the cur-              ilar to entry consistency (Midway), in that coherency
rent state of X  from its initial state.  Thus, to bring              operations initiated by a lock operation are restricted
the other nodes up to date, it is sufficient to propagate             to the data under the scope of that lock.  The store
this portion of the log to the peer nodes.                            is partitioned into segments, each under the control of
    Within this broad framework, systems that use log-                a separate lock.  Segments can be large or small, pre-
based  coherency  could  vary  in  a  number  of  details,            senting  an  obvious  tradeoff  between  synchronization
including:                                                            overhead and false lock conflicts.  We expect locks to
                                                                      be relatively coarse-grained; most commercial persis-
    o when log records are transmitted to peer nodes,                 tent stores support coarse-grained locking as natural

    o which peers receive the log records,                            for collaborative design environments.  Regardless of
                                                                      the  locking  grain,  the  updates  made  while  a  lock  is
    o how updates are applied to a recipient's cache,                 held often involve only a small number of the bytes
controlled by that lock.  In fact, we expect that most                larly on networks with point-to-point links that do not
updates will be fine-grained (e.g., "change this and to               support broadcast or multicast.

an or"), but that read operations will consume large                     We believe that alternative policies could be imple-
amounts  of  data  for  input  to  functions  such  as  de-           mented transparently to the application by replacing
sign analysis or graphical display. For this reason, log-             the synchronization primitives and their embedded co-
based  coherency  separates  the  coherency  grain  from              herency code.  For example, the system could propa-
the  synchronization  grain:  the  updates  sent  to  peer            gate segment updates lazily, using an embedded call
nodes are determined not by the locks acquired, but                   in the acquire primitive to fetch and apply pending
by the values logged at transaction commit.                           log records.  Segment updates could be fetched from
    Transaction log records include information about                 the server, where all log records are cached in mem-
the locks acquired during the transaction.  Each lock                 ory  for  a  time,  or  passed  with  the  lock  by  the  last
has  a  unique  lock  number  and  a  sequence  number                writer.  In Midway, for example, the acquire primi-
(timestamp)  that  is  incremented  on  each  acquire.                tive retrieves current versions of any stale objects in
When a transaction acquires a lock, its log record is                 the requester's cache from the current lock holder. The
tagged with the sequence number and lock number of                    acquire request includes a timestamp for the last time
that lock. The synchronization primitives contain em-                 the lock was held by the requester; the lock holder de-
bedded calls to logging routines to generate these tags.              termines which modified objects to return by compar-
The tags are used to determine which nodes must re-                   ing the timestamp against a modification timestamp
ceive the log records and when they must be applied.                  for each object under the lock's scope.
For example, updates for a segment need only be sent
to nodes that have previously acquired the lock for the                  For  log-based  coherency,  lazy  update  propagation
segment.  The sequence numbers are also used to en-                   raises the question of how to determine when pending
sure that all relevant data has been rendered coherent                log records are no longer needed by peer nodes and
before an acquire can succeed,  and to preserve the                   can be discarded.  One solution is to pass information
global ordering of updates from multiple nodes during                 about how many log records to hold for each segment
recovery, as described in Section 3.4.                                along with the lock token, as each node acquires the
    We believe that alternative synchronization models                lock in turn.  Each node holds all log records up to
can be implemented with a modest effort, by replacing                 and including the oldest records needed by the most
the synchronization primitives and the calls they make                out-of-date peer. Acquiring a lock brings new records,
to underlying logging and coherency routines. Several                 as well as an opportunity to discard records held lo-
relaxed models have been proposed by researchers ar-                  cally.  Note that discarding pending update records is
guing that strict serializability is inappropriate for de-            not a concern with Midway's lazy propagation scheme,
sign transactions (e.g.,  [Hornick & Zdonik 87]).  We                 because no log records or captured updates are held;
are exploring a read/write model that permits readers                 nodes do not save intermediate versions of objects that
to operate on a previous consistent version of the data               are modified repeatedly. In our case, these log records
while an update is in progress elsewhere; readers use                 must be held in order to support the nonserializable
an accept primitive to explicitly signal their willing-               read/write model described above.  That is, a reader
ness to move forward to a newer consistent version. In                may acquire a previous version even if uncommitted
this scheme, pending log records must be buffered in                  writes are present in the writer's cache;  the reader's
the recipient until they can be applied. We have used                 cache  must  be  updated  to  reflect  the  previous  com-
a similar version-based consistency model in the past                 mitted version.
for a range of parallel applications [Feeley & Levy 92].


2.2      Propagating Log Records

Our prototype uses a simple eager policy for propa-                   2.3      Summary
gating updates.  At each commit point, after the log
records have been written to the storage server,  the                 In this section we presented an overview of log-based
in-memory copies of the log records for each segment                  coherency. Built-in synchronization primitives are fun-
are eagerly propagated to all clients that have recently              damental to our approach. Alternative coherency pro-
acquired the locks for the modified segments.  We use                 tocols _ rules for propagating, applying, and releas-
eager updates because they are simple (i.e., no buffer-               ing log records _ can be realized by embedding calls
ing of log records), they are tolerant of client failures,            to logging routines in the synchronization primitives.
and they reduce the latency of data access on a client.               The locking routines can collect and maintain infor-
However, eager updates may increase network traffic                   mation about which peers must receive a given set of
and cannot scale to large numbers of clients, particu-                updates, and when.
3     Prototype Implementation                                        3.1      Capturing Updates

                                                                      Both  DSM  and  recoverable  systems  must  determine
To  experiment  with  our  approach,  we  prototyped                  which bytes are modified by the application using ei-
a  simple  implementation  of  log-based  coherency  by               ther VM write faults, a write barrier inserted by the
adding  distribution  support  to  CMU's  Recoverable                 compiler, or explicit calls from the application.  Like
Virtual  Memory  (RVM)  package  [Satyanarayanan                      the RVM system, our prototype assumes explicit run-
et  al.   94].   RVM  is  a  logging  facility  that  supports        time  calls  to  a  set_range  procedure  in  the  runtime
transactional update and recovery of virtual-memory                   package. A call to this function indicates an intent to
resident  data  structures.   RVM  is  designed  to  be  a            update a particular range of bytes.  We expect that
lightweight  and  portable  package  for  use  with  small            this range corresponds to an object, and that the call
databases that easily fit in physical memory: an RVM                  is made by code generated explicitly by the language
client copies the entire database into virtual memory                 compiler (such as the ML compiler that has been used
when it starts up. This avoids the need to pin modified               with  RVM  [O'Toole  et  al.   93]).   In  contrast,  most
pages, but for large databases it causes double paging                DSM systems use virtual page access faults to captue
and unnecessary pageouts of clean pages by the vir-                   updates, though software-based write detection is also
tual memory system. This limits RVM's usefulness for                  used in Midway [Zekauskas et al.  94].  This issue is
the collaborative design environments that interest us;               discussed in more detail in Section 4.
however, it is an expedient vehicle for experimenting                    RVM  coalesces  modified  ranges  that  are  adjacent
with log-based coherency.                                             or  overlapping  in  order  to  avoid  writing  redundant

    In keeping with its minimalist philosophy, RVM does               bytes to the disk log.  Overlapping ranges,  however,
not support or rely upon any particular synchroniza-                  are unlikely when calls to set_range are generated by
tion scheme. Though updates are transactional, multi-                 a  compiler,  as  is  anticipated.   To  improve  common-
threaded updates may or may not be serializable. In a                 case performance, we modified set_range to coalesce
similar spirit, our RVM-based prototype separates the                 only when there is an exact match with a previously
synchronization aspects of coherency and recoverabil-                 added range.  Thus, objects that are modified multi-
ity from the mechanisms for collecting,  propagating,                 ple times during a transaction are still coalesced but
and  applying  redo  log  records.  Our  intent  is  to  ac-          with a simpler and more efficient implementation; this
commodate  various  policies  for  propagating  updates               reduces set_range overhead by a factor of five. The re-
as part of a synchronization scheme "plugged in" to                   maining per-update overhead is dominated by search-
RVM.                                                                  ing the binary tree that stores modified ranges (in or-
                                                                      der by their address).  As a second optimization, we
    We use RVM as a client/server distributed database                avoid this search in the special case where a sequence
by placing the transaction log files and the database                 of set_range calls is ordered by address.
file on a central NFS server. Clients maintain caches of
the central database in their local virtual memories by
reading the entire database into memory at startup (as                3.2      Propagating Log Records
in centralized RVM), in this case using the NFS pro-                  In response to set_range calls, the RVM runtime li-
tocol.  As write transactions execute, RVM produces                   brary builds a data structure that describes the modi-
redo log records that are written to the NFS server.                  fications made by a transaction. When the transaction
Log-based  coherency  is  implemented  by  sending  the               commits, this data structure is used to build I/O vec-
log tails to other nodes that have the region mapped,                 tors for the Unix   writev system call.  The writev
where they are applied to each recipient's cache. Each                causes  the  new  values  of  all  modified  objects  to  be
node recording such a transaction produces a separate                 copied from virtual memory to a system buffer for writ-
log.  We added an RVM utility that merges these into                  ing to disk.  RVM thus avoids building an object log
a single log for recovery (see Section 3.4).                          in virtual memory that contains copies of the modified

    Our application interface is summarized in Table 1.               objects.
The left-hand column describes the procedures the ap-                    We modified the RVM commit procedure to broad-
plication uses to initialize, begin, and commit a trans-              cast the same new-value information that is written to
action; to acquire a segment lock; and to describe the                disk.  Coherency data is broadcast using TCP/IP by
data that is modified by the transaction.  The right-                 issuing a writev system call for each node that has the
hand column shows the RVM calls made by each of                       current region mapped.  We use the OSF/1 PThreads
these procedures.  We added a new procedure, called                   facility [Mueller 93] to create receiver threads for each
rvm_setlockid_transaction,  to  the  standard  RVM                    communication  channel  that  connects  a  node  to  its
interface.   This  procedure  is  called  by  the  acquire            peers. These threads block in readv system calls wait-
primitive, as described in Section 3.3.                               ing for coherency messages and applying the updates
     _______________________________________________________________________________________________________________________________
     |__Log-Based_Coherency_Operation__|_______________RVM_Routine_Called_______________________________________________|___________

     |  Trans.Init()                                 | tid=rvm_malloc_tid()                                                   |
     |  Trans.Begin(rvm_mode)                  |       rvm_begin_transaction(tid, rvm_mode)                              |
     |  Trans.Commit(rvm_mode)               |         rvm_end_transaction(tid, rvm_mode)                                |
     |  Trans.Acquire(lock)                       |    rvm_setlockid_transaction(tid, lock.lockId, lock.sequenceNum)  |
     |__Trans.SetRange(addr,_size)______________|______rvm_set_range(addr,_size)______________________________________________|_____

                                            Table 1: Log-based coherency interface.


when they arrive.                                                     rent sequence number is passed along with the lock to-
    The  format  of  the  coherency  records  data  differs           ken when ownership of the lock changes. The acquire
from the data sent to disk in two respects. First, some               primitive calls RVM to associate the lock with the cur-
records that are needed only for recovery and log trim-               rent transaction.  We added a new procedure to the
ming are not included in the broadcast data; i.e., only               RVM  interface,   rvm_setlockid_transaction(tid,
new-value range records are needed for coherency. Sec-                lockId,  sequenceNum),  for  this  purpose.   Because
ond, the header information for each range record is                  locks follow a strict two-phase protocol,  each lock is
compressed from 104 bytes to between 4 and 24 bytes.                  acquired at most once during a transaction.
The standard RVM header contains fields that are not
needed  for  coherency;  our  header  contains  only  the
range's type, address, and size.  The header is further               3.4      Preserving Ordering
compressed when ranges are small (less than 4 Kbytes)                 While  synchronization  is  mostly  independent  of  co-
or close together (fewer than 256 Kbytes apart) by us-                herency and recovery,  there is an important interac-
ing smaller fields and by replacing the range's address               tion.  When a transaction commits, the lock informa-
with its offset from the preceding range; our modified                tion provided to RVM is used to generate lock records
set_range orders modified ranges by their address.                    that are inserted in the log entry for the transaction.

                                                                      Lock records contain the lock identifier and sequence

3.3      Synchronization                                              number.  These lock records are used by both the co-
                                                                      herency and recovery code to preserve the ordering of
We  added  distributed  locks  that  provide  mutually-               updates generated by different nodes.
exclusive  access  to  non-overlapping  portions  of  an                 For  coherency,  the  lock  records  ensure  that  log
RVM region. Locks are acquired inside of transactions                 records applied by a receiver thread are properly in-
in a two-phase manner; all locks are automatically re-                terleaved  with  those  sent  by  other  nodes.   The  lock
leased when a transaction commits. Applications must                  records included with a received update are used to
acquire a lock before reading or writing the data pro-                set the receiver's local sequence number for the lock.
tected by that lock.                                                  If  necessary,  receiver  threads  hold  log  records  until
    The lock implementation is token based with a cen-                the  updates  for  the  immediately  preceding  sequence
tralized lock manager and a distributed waiter queue,                 number have been applied.  Also, the sequence num-
an approach used in many distributed systems includ-                  ber must match the sequence number passed with the
ing TreadMarks.  At all times, exactly one node owns                  lock  token  before  the  lock  can  be  acquired  on  that
the lock token. The lock can be acquired on that node                 node.   An  application  that  tries  to  acquire  the  lock
without remote communication; nodes hold onto the                     prematurely will wait on a condition variable until sig-
token until they are requested to pass it to another                  naled by the receiver thread that applies the latest up-
node.  Acquire operations at other nodes send a mes-                  date.  This interlock between coherency and synchro-
sage  to  the  lock  manager  to  request  the  lock.   The           nization  is  necessary  because  updates  are  broadcast
node number of the lock manager is determined from                    asynchronously; the lock token could arrive at a node
the lock identifier number.  The manager maintains a                  before all necessary coherency updates have been re-
distributed queue of nodes waiting to acquire the lock.               ceived and applied.  For example, if the token passes
It adds the requester to the tail of the queue and for-               in order through nodes A, B, and C, C might receive
wards its request to the previous tail which responds                 the token from B before A's updates arrive at C. The
either by sending the lock token to the requester,  if                protocol ensures that, for updates made to data pro-
the lock is available, or by queueing the request until               tected by the same lock, (1) none of B's updates are
the lock is released.                                                 applied at C until C has applied A's updates and (2)
    Each  lock  has  a  unique  identifier  and  a  sequence          C is not allowed to acquire the lock until B's updates
number that is incremented on each acquire. The cur-                  have been applied.
    For recovery, lock records are used to merge the in-              different  traversals,  updates,  and  queries  of  a  syn-
dividual RVM redo logs produced by each node. When                    thetic object-oriented database. The database and the

applications share a segment, their logs may record in-               traversal tests are intended to be suggestive of typical
terleaving updates to the same data. Thus, before any                 engineering database applications.
of these logs can be used by the standard RVM recov-                     We modified OO7 to run with RVM in standard vir-
ery procedure, they must be merged into a single log.                 tual memory (i.e., no OODB) and integrated it with
We built a new RVM utility to do this.  Our merge                     our log-based coherency prototype. We report timings
utility reads input logs from head to tail; transaction               and related overheads for several OO7 traversals us-
records from different logs are compared by comparing                 ing our prototype. These experiments were conducted
their lock records.  Our current merge utility exploits               using two Digital 3000-400 Alpha APX (133 Mhz, 74
the fact that our current synchronization model guar-                 SPECints) workstations with 8-Kbyte pages and sep-
antees strictly serializable transactions.  When merg-                arate 512-Kbyte direct-mapped instruction and data
ing records from different logs, it is sufficient to order            caches  [Dig  92].   The  machines  are  connected  by  a
transactions so that if two transactions acquired the                 100  Mbit/s  AN1  network,  an  experimental  switch-
same lock,  the transaction with the earlier sequence                 based network capable of sending message packets of
number for that lock is ordered first.                                up to 64 Kbytes [Rodeheffer & Schroeder 91]. Elapsed

                                                                      time  measurements  were  taken  using  the  Alpha  cy-

3.5      Distributed Log Trimming                                     cle counter.  For these experiments, we disabled RVM
                                                                      disk logging so that we could isolate the costs associ-
The current RVM log-trimming algorithm has an un-                     ated with coherency from the synchronous disk writes
fortunate interaction with our distribution implemen-                 needed to support recovery. This is important in part
tation. In the current version of RVM, log records are                because optimizations using non-volatile RAM can be
trimmed from the head of the log using the standard                   used  to  eliminate  synchronous  disk  writes  from  the
recovery procedure to compute a new checkpoint. This                  commit critical path [Hagmann 86].
operation,  which  runs  asynchronously  with  normal                    The figures also depict estimated lower bounds for
transactions, is triggered when the number of records                 alternative  DSM  implementations  that  use  page  ac-
in the log reaches a high-water point. In our system, it              cess  faults  to  capture  updates.    For  log-based  co-
is no longer possible to use the log generated by a single            herency,  updates  are  captured  by  calls  to  the  RVM
node to compute a new checkpoint; instead, log records                set_range  procedure.   These  calls  are  coded  explic-
from all nodes must first be merged. Our current pro-                 itly  in  our  OO7  benchmarks,  although  other  RVM
totype performs log trimming offline using the recovery               applications have used compiler-generated set_range
procedure described above. Online trimming could be                   calls [O'Toole et al.  93].  In contrast, most DSM sys-
implemented using the merging procedure by coordi-                    tems use page access faults to capture updates without
nating checkpointing; one node would checkpoint at a                  involvement from the compiler or application.  Early
time, broadcasting to other nodes when done to inform                 page-locking DSM systems, such as Monads and IVY,
them of their new log head.                                           use page access faults to grant a writer exclusive access
    An improved log-trimming scheme for RVM is de-                    to a page while updates are in progress, then trans-
scribed in [Satyanarayanan et al.  94].  In this scheme,              mit the entire contents of each modified page to other
nodes checkpoint a page at a time by writing the cur-                 nodes  accessing  the  page.   In  newer  multiple-writer
rent  version  of  a  page  to  the  checkpoint  file.   Log          "copy/compare"  DSM  systems,  such  as  Munin  and
records for updates made to a page before it was check-               TreadMarks,  updates  are  also  detected  a  page  at  a
pointed can be discarded.  This checkpointing scheme                  time, but multiple nodes are permitted to make con-
could be more easily incorporated into our prototype,                 current non-conflicting updates to any given page.  In
because it does not require logs to be merged.                        these  systems,  the  first  store  to  a  unmodified  page

                                                                      on each node delivers a write-access fault to the co-

4     Performance                                                     herency software, which makes a copy of the page be-
                                                                      fore enabling write access.  Updates are collected by

    This section presents performance measurements of                 later comparing the modified page with its copy.  The
our prototype.  While most DSM systems have used                      copy/compare  technique  could  improve  performance
parallel programs for evaluation, our application do-                 for  some  OODBs  that  use  page-grained  locking  and
main  _  collaborative  design  applications  accessing               updates today.
a  distributed  persistent  store  _  requires  a  different             Our lower-bound estimates for page-locking (labeled
benchmark.  For this reason,  we have chosen to use                   Page)  and  multiple-writer  (labeled  Cpy/Cmp)  DSM
OO7  [Carey  et  al.   93],  a  standard  object-oriented             systems are computed from the measurements listed
database  benchmark.   OO7  consists  of  a  number  of               in Table 2. We measured the cost of using the OSF/1
                        ________________________________________________________________________________________
                                          Operation                 |        Cost      |    Throughput   |
                        ||_______________________________________________|(~sec/page)__|_____(MBytes/s)_____|___

                        |  page copy (cold cache)                  |         171.9      |          43          |
                        |  page copy (warm cache)                |           57.8       |         135          |
                        |  page compare (cold cache)             |           281.0      |          28          |
                        |  page compare (warm cache)            |            147.3      |          53          |
                        |  page send (TCP/IP)                    |           677.0      | 12 (96.8 Mbit/s)  |
                        |__handle_signal_and_change_protection__|____________360.1______|_______________________|

                                     Table 2: Operation costs (per page) on Alpha/AN1.


mprotect system call to change page protection,  by                   ite parts are visited.  When a composite part is vis-
storing to a read-only page to generate a protection                  ited,  its atomic-part graph is traversed and updates
fault, delivering the signal to a user-level procedure,               are performed. There are three variants of the traver-
calling mprotect again to enable writing, and return-                 sals (A, B, and C) that update different numbers of
ing  from  the  signal  handler.   We  also  measured  the            atomic  parts.   In  A,  one  atomic  part  per  composite
time to copy and compare pages (we use the cold cache                 part is modified; in B, every atomic part is modified;
times in the figures). For page-grain DSM (Page), up-                 and  in  C,  every  atomic  part  is  modified  four  times.
dated pages are not copied or scanned, so we assume no                An atomic part is updated by changing an eight-byte
collection overhead.  However, network I/O overhead                   field.  The difference between T2 and T3 is that T3
for Page is greater because entire pages are transmit-                updates the atomic part's index field.  Each time this
ted instead of just the modified bytes.  This time is                 field is changed, the part index is updated by deleting
estimated from the measured TCP throughput given                      the index entry for the old value and adding an entry
in Table 2. Communication overhead for Cpy/Cmp is                     for the new value.  This results in an average of seven
assumed to be the same as the measured times for log-                 index updates for each atomic-part update.
based coherency (labeled Log),  since both send only                     We added a third update traversal, called T12. T12
the modified bytes.  Both Log and Cpy/Cmp also in-                    differs from the other update traversals in that it per-
cur overhead at the receiver to apply  the updates to                 forms a sparse traversal of the database. It is similar to
the cache; however, this cost is too small to be clearly              read-only traversal T6; for each composite part it vis-
distinguished in any of the graphs below.                             its only one atomic part.  In T12, a higher percentage
                                                                      of overall running time is related to updating objects.

4.1      The OO7 Benchmark                                            Thischighlightsothehperformanceecostsrassociatedewithncy over*
 *head.

The  OO7  database  is  composed  of  a  design  library                 In our RVM-based OO7 benchmark,  the database
and an assembly hierarchy.  The design library, which                 elements  are  heap-allocated  C++  objects  and  a
makes up the bulk of the database, is a set of 500 com-               threaded AVL-balanced tree is used for the part in-
posite parts. A composite part corresponds to a design                dex.   The  atomic  parts  associated  with  a  particular
primitive, such as a register cell in a VLSI CAD ap-                  composite part tend to be clustered on the same page
plication or a procedure in a CASE application. Each                  while atomic parts from different composite parts are
composite part is itself made up of a graph of 20 atomic              usually on different pages.  We ran each of the OO7
parts.   Composite  and  atomic  part  objects  are  each             update traversals under our prototype. Each test con-
roughly 200 bytes long. Each atomic part contains an                  sists of a single transaction (and a single segment lock)
index  field;  a  part  index  is  maintained  using  a  self-        in which one node performs the traversal and another
balancing tree.                                                       receives the log tail and installs the updates, bringing
    Objects  in  the  assembly  hierarchy  correspond  to             its copy of the database up to date.
higher-level design constructs in the application.  The
assembly hierarchy is a tree of assembly objects with                 4.2      Results
729 leaf nodes.  Each leaf node, called a base assem-
bly, points to three composite parts that are chosen at                  Figures 1, 2, and 3 show the results of running the
random from the design library when the database is                   various OO7 traversals, along with the associated co-
constructed.                                                          herency overhead.  Write overhead consists of detect-
    There  are  two  traversals  in  OO7  that  update  the           ing and collecting updates and the network I/O costs
database, T2 and T3.  Both traverse the assembly hi-                  of transmitting those updates to the other node.  Ta-
erarchy to visit all of the composite parts pointed to                ble 3 summarizes the characteristics of these traversals,
by each base assembly; thus, a total of 2187 compos-                  listing the number of updates performed by each, the
                              _____________________________________________________________________________
                                 Traversal  |   Updates  |       Bytes    |   Message  |       Pages    |
                              ||________________|____________|_Updated__|_______Bytes____|___Updated__|____

                              |  T12-A       |       2,187  |        4,000  |      6,000  |          500  |
                              |  T12-C       |       8,748  |        4,000  |      6,000  |          500  |
                              |  T2-A        |       2,187  |        4,000  |      6,000  |          500  |
                              |  T2-B        |      43,740  |      80,000  |     120,000  |          618  |
                              |  T2-C        |    174,960  |       80,000  |     120,000  |          618  |
                              |  T3-A        |      16,924  |      31,300  |      39,000  |          552  |
                              |  T3-B        |    248,632  |      114,650  |     163,300  |          667  |
                              |__T3-C________|__1,502,708__|______115,100__|_____163,800__|__________670__|_

                                 Table 3: Summary of OO7 update-traversal characteristics


                                                                      number of unique bytes updated, the number of bytes
                                                                      sent over the network, and the number of pages up-
                                                                      dated.  (The difference between the number of bytes
                                                                      updated and the number transmitted is due to range-
                                                                      message overhead; each range is preceded by a header
                                                                      that describes the address of the range and its size.)
                                                                         Based  on  the  Alpha-OSF/1  measurements,  our
                                                                      analysis  shows  that  for  the  anticipated  application
                                                                      workload,  in  which  updates  affect  a  large  number
                                                                      of  pages,  software  write  detection  (e.g.,  compiler-
                                                                      generated set_range calls) performs better than any
                                                                      form  of  hardware-based  write  detection.    Our  ap-
Figure  1:  OO7  Sparse-update  traversals  T12-A  and                proach performs significantly better for traversals T12-
T12-C.                                                                A, T12-C, T2-A, and T3-A because they perform rela-

                                                                      tively few updates per page. Traversals T2-B and T2-
                                                                      C perform 71 and 283 update per page respectively; for
                                                                      these traversals, our approach performs about as well
                                                                      as  Cpy/Cmp.   However,  the  index-update  traversals
                                                                      T3-B and T3-C perform significantly more updates per
                                                                      page, 373 and 2,243; as a result our approach performs
                                                                      poorly. This shows that when there are many updates
                                                                      per page,  a page-based systems such as TreadMarks
                                                                      is preferred to a software-based approach such as log-
                                                                      based coherency.


Figure 2: OO7 Full-update traversals T2-A, T2-B, and                  4.3      Analysis
T2-C, and index-update traversal T3-A.                                There is a performance tradeoff between the three ap-

                                                                      proaches evaluated above that is a function of (1) the
                                                                      number of modified bytes per page and (2) the num-
                                                                      ber of individual updates per page.  The overhead of
                                                                      Log and Cpy/Cmp depends on the number of modified
                                                                      bytes, Cpy/Cmp and Page depend on the number of
                                                                      modified pages, and Log alone depends on the num-
                                                                      ber of updates per page.  Which approach works best
                                                                      depends on the workload.  As we have seen, log-based
                                                                      coherency is preferred when there are few updates per
                                                                      page;  similarly,  Page  performs  best  when  most  of  a
                                                                      page is modified.  Figures 4-7 below show where the
                                                                      breakeven points occur.
Figure  3:   OO7  Index-update  traversals  T3-B  and                    Figure 4 shows coherency overhead as a function of
T3-C   .                                                              the number of modified bytes per page. This compares
                                                                      the per-byte overhead of log-based coherency with the
                                                                      total overhead of copy/compare and page approaches.

                                                                      When  more  than  1037  bytes  are  modified  per  page,
                                                                      Page outperforms Cpy/Cmp. When there are few up-
                                                                      dates per page,  Log outperforms the alternatives no
                                                                      matter how many bytes are modified.  However, this
                                                                      graph does not include the per-update overhead asso-
                                                                      ciated with Log; this is shown in Figures 5-7.
                                                                         Figures 5 and 6 show the log-based coherency over-
                                                                      head associated with detecting and collecting a single
                                                                      update as the number of updates per transaction in-

Figure 4:  Comparison of overhead as the number of                    creases.   This  is  a  measurement  of  the  performance
modified  bytes  per  page  increases.   For  log-based  co-          of RVM operations set_range and rvm_commit.  The
herency, per-update overhead is not included.                         middleselinequisenthececostofforseant_updaterainngane ordered*
 *calls (taking advantage @

                                                                      optimization described in Section 3.1). The lower line
                                                                      is the cost of detecting an update to a range that was
                                                                      modified previously in the same transaction.
                                                                         Figure 7 shows the breakeven point at which log-
                                                                      based coherency and Cpy/Cmp have equivalent per-
                                                                      formance. For a given average per-update cost on the
                                                                      x-axis, the y-axis shows the maximum number of up-
                                                                      dates per page possible before Cpy/Cmp outperforms
                                                                      log-based coherency. For example, using Figures 5 and
                                                                      7, we can determine that if there are 1000 updates per
                                                                      transaction, log-based coherency performs better when
                                                                      there are 45 or fewer updates per page (55 if the up-

Figure 5: The overhead associated with a single update                dates are ordered).
for log-based coherency as the number of updates per                     Recent work [Thekkath & Levy 94] has shown an
transaction increases.                                                order-of-magnitude  reduction  in  exception-handling
                                                                      cost, which would make hardware-based write detec-
                                                                      tion  more  attractive.   Figure  7  shows  how  the  per-
                                                                      formance tradeoff would be affected if signal handling
                                                                      overhead were 10 ~sec instead of the 340 ~sec mea-
                                                                      sured for Alpha-OSF/1.


                                                                      4.3.1     Increasing the Number of Nodes

                                                                      Another important performance concern for log-based
                                                                      coherency  is  the  effect  of  increasing  the  number  of
Figure 6:  Log-based coherency update overhead up to                  nodes  using  a  segment  to  be  kept  coherent.   In  our
300,000 updates per transaction.                                      prototype the network I/O overhead of the writer in-

                                                                      creases  linearly  with  the  number  of  peer  nodes,  be-
                                                                      cause the writer node issues separate writev system
                                                                      calls for each peer.  Since this overhead is relatively
                                                                      small, our approach will scale to a moderate number
                                                                      of nodes. Systems with a very large number of clients
                                                                      will  perform  better  with  multicast  hardware  or  lazy
                                                                      coherency.


                                                                      4.3.2     Impact of Coherency on Recoverability
Figure 7: Breakeven point for log-based coherency. For
different update overheads, the number of updates per                 Finally,  we  wanted  to  determine  the  impact  of  our
page at which log-based coherency performs better than                modifications on the performance of standard RVM.
Cpy/Cmp.                                                              Figure  8  shows  four  measurements  of  the  coherency
                                                                      and  recoverability  overheads  for  the  T12-A  bench-
                                                                      a  main-memory  database  [Li  &  Naughton  88].   The
                                                                      standby keeps a complete copy of the database in its

                                                                      memory and receives updates from the primary in the
                                                                      form of log records sent at commit points.  Their pur-
                                                                      pose  is  to  allow  checkpointing  to  take  place  in  the
                                                                      standby, off-line, without interfering with clients exe-
                                                                      cuting on the primary copy of the database. Delis and
                                                                      Roussopoulos conducted a simulation study of client-
                                                                      server relational databases using a log-based approach
                                                                      for updating client caches [Delis & Roussopoulos 92].
                                                                      Updates are centralized at the server; the server main-
Figure 8: Comparison of log-based coherency, disk log-                tains a recovery log and a separate update log. Before
ging, and standard RVM. Optimized RVM is standard                     a client accesses a data item in its cache, it first con-
RVM with our optimizations to set_range.                              tacts the server to retrieve log records generated since
                                                                      the cached copy of the item was last updated.  While
                                                                      several of these systems send log records from clients
mark.  The first column is the overhead for log-based                 to one or more servers, log-based coherency sends log
coherency presented in the previous section. The next                 records from client to client.
column  measures  the  overhead  when  disk  logging  is
enabled;  from  this  we  see  that  the  only  additional               Several  groups  have  integrated  coherency  and  re-
overhead is due to writing the log tail to disk.  The                 coverability  by  adding  checkpointing  to  page-based
third and fourth columns were taken using standard                    DSM,  without  using  transactions.   This  allows  non-
RVM, without our log-based coherency modifications;                   transactional  applications  such  as  parallel  programs
the third column is standard RVM with our optimized                   to be made recoverable.  The main issue is to coordi-
set_range  modification.   This  shows  that  the  over-              nate individual node checkpoints to attain a consistent
head added to RVM by log-based coherency is directly                  global checkpoint, using a combination of dependency
related to sending the modified bytes to peer nodes.                  tracking, message logging, and replication.  The first
This validates our assertion that there is a high degree              such system is due to Wu and Fuchs [Wu & Fuchs 90].
of overlap between the mechanisms for recoverability                  Stumm and Zhou describe a system that tolerates the
and coherency.                                                        failure of a single node by ensuring that every page
                                                                      resides  in  the  caches  of  at  least  two  nodes  [Stumm
                                                                      &  Zhou  90].   Richard  and  Singhal  use  page  logging

5     Related Work                                                    [Richard & Singhal 93]; a copy of a page is written to
                                                                      a local volatile log each time it is acquire for reading

In  the  context  of  RVM,  transaction  logs  have  been             or writing. A node writes its volatile log to disk before
used in a similar fashion to propagate updates between                transferring a modified page to another node.
data  spaces  in  a  system  with  concurrent  replicating               Janssens and Fuchs added checkpointing to relaxed-
garbage  collection  [O'Toole  et  al.   93].   Our  contri-          consistency DSM [Janssens & Fuchs 93].  Instead of
bution is to use this idea for maintaining coherency of               requiring checkpointing or other recoverability actions
client database caches. Log propagation has been used                 each time an application gains access to a page, their
to  maintain  the  consistency  of  replicas  in  replicated          system checkpoints only when a node releases or ac-
database systems. Replicated systems differ from log-                 quires a lock.
based coherency in that the replicas are used to ensure
availability  of  the  data.   In  replicated  systems,  log             Neves  et  al.  added  checkpointing  to  an  entry-
records flow in one direction:  from clients to servers               consistent  DSM  system  similar  to  Midway,  using
(or  from  a  client  to  a  server,  and  then  on  to  other        object-grain locking [Neves et al.  94].  Their system
servers).  Servers cooperate to ensure that log records               tolerates single node failures by keeping old versions of
are applied in a consistent order at all locations. In our            modified objects in the volatile memory of the nodes
system, the replicas are caches that keep only enough                 that  modify  them.   Each  node  checkpoints  indepen-
data to meet the local client's needs;  update propa-                 dently; a failed node recovers by replaying its execu-
gation may be delayed until a client requests the new                 tion starting with its most recent checkpoint.  Infor-
data.                                                                 mation recorded at other nodes is sufficient for them
    Harp [Liskov et al.  92] file servers log received up-            to supply the recovering node with the same version of
dates to peer servers in order to remove stable storage               objects as it saw during normal execution. The impact
writes from the commit path. Similarly, Naughton and                  on failure-free execution is minimized by piggy-backing
Li have used log propagation to keep a hot standby of                 recoverability information on coherency messages.
6     Conclusion                                                                   Implementation and performance of Munin.
                                                                                   In Proceedings of the 13th ACM Symposium

The key points of this paper are: (1) DSM techniques                               on Operating Systems and Principles, pages
such as fine-grained client-client transfers are appro-                            152-164, October 1991.
priate for maintaining cache coherency for distributed                [Cockshott et al. 84]     Cockshott,  W.,  Atkinson,  M.,
persistent stores, (2) there is a commonality between                              Chisholm, K., Bailey, P., and Morrison, R.
the implementation techniques for recoverability and                               Persistent object management system. Soft-
coherency, and (3) this synergy can be exploited when                              ware Practice and Experience,  14(1),  Jan-
both properties are supported together.                                            uary 1984.
    We have presented a new DSM approach called log-                  [Delis & Roussopoulos 92]        Delis,  A.  and  Roussopou-
based coherency that uses recoverability mechanisms                                los,  N.    Performance  and  scalability  of
from persistent object systems as the basis for main-                              client-server database architectures. In Pro-
taining coherency of distributed objects. Our work ex-                             ceedings  of  the  18th  International  Confer-
tends previous work on DSM to exploit the notion of                                ence on Very Large Databases, pages 610-
commit points in which a group of related updates be-                              623, August 1992.
come visible atomically. We extend work on persistent
stores to support the fine-grained sharing made possi-                [Dig 92]     DigitalMEquipmentACorporation,.Maynard,Alpha Arc*
 *hitecture Handbook, 1992@
ble for parallel applications by DSM systems.  In par-
ticular,  log-based coherency separates the coherency                 [Eppinger et al. 91]     Eppinger,  J.,  Mummert,  L.,  and
grain from the synchronization grain.  This is impor-                              Spector, A.  Camelot and Avalon.  Morgan
tant for collaborative-design applications, where large                            Kaufmann, 1991.
data regions are shared among engineers at different                  [Feeley & Levy 92]      Feeley,  M.  J.  and  Levy,  H.  M.
nodes, but where updates are sparse and infrequent.                                Distributed shared memory with versioned
With log-based coherency, locking overhead can be re-                              objects.    In  Proceedings  of  the  Confer-
duced by using coarse-grain locks without increasing                               ence on Object-Oriented Programming Sys-
coherency overhead;  i.e.,  coarse-grain locks can sup-                            tems,  Languages,  and  Applications,  Octo-
port fine-grain sharing.                                                           ber 1992.
    While log-based coherency provides an alternative                 [Gharachorloo et al. 90]      Gharachorloo,  K.,  Lenoski,
to  other  DSM  systems,  the  approaches  are  not  mu-                           D., Laudon, J., Gibbons, P., Gupta, A., and
tually exclusive.  Our measurements show that appli-                               Hennessy, J. Memory consistency and event
cation behavior determines the best approach; e.g., if                             ordering in scalable shared-memory multi-
updates are highly clustered within a page, standard                               processors.   In  Proc.  17th  Annual  Sympo-
DSM techniques will perform better, while for sparse                               sium on Computer Architecture, Computer
updates, the log-based approach will perform better.                               Architecture  News,   pages  15-26.  ACM,
Therefore, adaptive hybrid approaches maybe be pos-                                June 1990.
sible where application behavior can be predicted.                    [Hagmann 86]       Hagmann,  R.  B.    A  crash  recovery
                                                                                   scheme for a memory-resident database sys-
                                                                                   tem. IEEE Transactions on Computers, C-
References                                                                         35(9):839-843, September 1986.


[Butterworth et al. 92]      Butterworth, P., Otis, A., and           [Hornick & Zdonik 87]        Hornick,  M.  F.  and  Zdonik,
              Stein,  J.   The  GemStone  object  database                         S. B. A shared, segmented memory system
              management  system.   Communications  of                             for  an  object-oriented  database.   A  ACM
              the ACM, 34(10):64-77, October 1992.                                 Transactions  on  Office  Informations  Sys-
                                                                                   tems, 5(1), January 1987.
[Carey et al. 93]     Carey,  M.  J.,  Dewitt,  D.  J.,  and
              Naughton, J. F. The OO7 benchmark. 1993                 [Hosking & Moss 93]        Hosking,  A.  L.  and  Moss,  J.
              ACM  SIGMOD.  International  Conference                              E. B.  Protection traps and alternatives for
              on Management of Data, 22(2):12-21, May                              memory managemnt of an object-oriented
              1993.                                                                language.  In Proceedings of the 14th ACM
                                                                                   Symposium  on  Operating  Systems  Princi-
[Carey et al. 94]     Carey,  M.  J.,  Dewitt,  D.  J.,  and                       ples, December 1993.
              Franklin, M. J.  Shoring up persistent ap-
              plications.  1994 ACM SIGMOD. Interna-                  [Janssens & Fuchs 93]       Janssens, B. and Fuchs, W. K.
              tional Conference on Management of Data,                             Relaxing  consistency  in  recoverable  dis-
              May 1994.                                                            tributed  shared  memory.   In  Proceedings
                                                                                   of  the  Twenty-Third  Annual  International
[Carter et al. 91]                                                                 Symposium  on  Fault-Tolerant  Computing:
              Carter, J., Bennet, J., and Zwaenepoel, W.                           Digest of Papers, pages 155-163, June 1993.
[Keleher et al. 92]                                                   [O'Toole et al. 93]     O'Toole,  J.,  Nettles,  S.,  and  Gi*
 *f-
              Keleher, P., Cox, A., and Zwaenepoel, W.                             ford,  D.   Concurrent  compacting  garbage
              Lazy  release  consistency  for  software  dis-                      collection of a persistent heap.  In Proceed-
              tributed shared memory.  In Proceedings of                           ings of the Fourteenth ACM Symposium on
              the 19th Annual Symposium on Computer                                Operating  Systems  Principles,  pages  161-
              Architecture, pages 13-21, May 1992.                                 174, December 1993.


[Keleher et al. 94]     Keleher,                             P.,      [Richard & Singhal 93]       Richard, III, G. G. and Sing-
              Cox, A. L., Dwarkadas, S., and Zwaenepoel,                           hal,  M.   Using  logging  and  asynchronous
              W.  TreadMarks:  Distributed shared mem-                             checkpointing to implement recoverable dis-
              ory on standard workstations and operat-                             tributed shared memory.  In Proceedings of
              ing systems.  In Proceedings of the Winder                           the 12th Symposium on Reliable Distributed
              1994 USENIX Conference, pages 115-132,                               Systems, pages 58-67, October 1993.
              January 1994.                                           [Rodeheffer & Schroeder 91]         Rodeheffer,


[Lamb et al. 91]     Lamb, C., Landis, G., Orenstein, J.,                          T. and Schroeder, M. D. Automatic recon-
              and Weinreb, D. The ObjectStore database                             figuration in autonet. In Proceedings of the
              system.    Communications  of  the  ACM,                             Thirteenth ACM Symposium on Operating
              34(10):50-63, October 1991.                                          Systems Principles, pages 183-197, October
                                                                                   1991.
[Li & Hudak 89]       Li,  K. and Hudak,  P.  Memory co-              [Rosenberg & Abramson 85]           Rosenberg,
              herence in shared virtual memory systems.                            J.  and  Abramson,  D.  A.   MONADS-PC:
              ACM Transactions on Computer Systems,                                A capability-based workstation to support
              7(4):321-359, November 1989.                                         software engineering.  In Proceedings of the


[Li & Naughton 88]        Li, L. and Naughton, J. F.  Mul-                         18th  Hawaii  International  Conference  on
              tiprocessor main memory transaction pro-                             System Sciences, 1985.
              cessing. In Proceedings of the International            [Satyanarayanan et al. 94]       Satyanarayanan,
              Symposium  on  Databases  in  Parallel  and                          M.,  Mashburn,  H.  H.,  Kumar,  P.,  Steere,
              Distributed  Systems,  pages  177-187,  De-                          D. C., and J.Kistler., J. Lightweight recov-
              cember 1988.                                                         erable virtual memory.  ACM Transactions


[Liskov 88]    Liskov,  B.   Distributed  programming  in                          onaComputerrSystems,y12(4):33-57,1Febru-994.
              Argus.     Communications  of  the  ACM,
              31(3):300-312, March 1988.                              [Stumm & Zhou 90]         Stumm, M. and Zhou, S.  Fault
                                                                                   tolerant  distributed  shared  memory  algo-
[Liskov et al. 92]    Liskov,  B.,  Ghemawat,  S.,  Gruber,                        rithms.  In Proceedings of the Second IEEE
              R.,  Johnson,  P.,  and  Shrira,  L.   Replica-                      Symposium  on  Parallel  and  Distributed
              tion in the Harp file system. In Proceedings                         Processing, pages 719-724, December 1990.
              of the Thirteenth ACM Symposium on Op-
              erating Systems Principles, pages 226-238,              [Thekkath & Levy 94]        Thekkath,  C.  and  Levy,  H.
              October 1992.                                                        Hardware and software support for efficient
                                                                                   exception handling.  In Proceedings  of  the
[Moss 90]     Moss, J. E. B.  Design of the Mneme per-                             6th  International  Conference  on  Architec-
              sistent object store.  ACM Transactions on                           tural  Support  for  Programming  Languages
              Information  Systems,  8(2):103-139,  April                          and Operating Systems, October 1994.
              1990.
                                                                      [Wu & Fuchs 90]       Wu,  K.-L.  and  Fuchs,  W.  K.   Re-
[Mueller 93]    Mueller, F.  A library implementation of                           coverable  disributed  shared  virtual  mem-
              POSIX threads under Unix. In Proceedings                             ory.    IEEE  Transactions  on  Computers,
              of  the  Winter  1993  USENIX  Conference,                           39(4):460-469, April 1990.
              pages 29-41, January 1993.                              [Zekauskas et al. 94]     Zekauskas,           M.         J.,


[Neves et al. 94]    Neves, N., Castro, M., and Guedes,                            Sawdon, W. A., and Bershad, B. N.  Soft-
              P.  A checkpoint protocol for an entry con-                          ware write detection for distributed shared
              sistent  shared  memory  system.    In  Pro-                         memory.  In Proceedings of the First Sym-
              ceedings  of  the  13th  ACM  Symposium  on                          posium  on  Operating  Systems  Design  and
              Principles  of  Distributed  Computing,  Au-                         Implementation, November 1994.
              gust 1994.


[O. Deux 92]     O. Deux.  The O2  system.  Communi-
              cations of the ACM, 34(10):34-48, October
              1992.