Experiences  with  Locking
                          in  a  NUMA  Multiprocessor
                            Operating  System  Kernel

       Ronald C. Unrau*   Orran Krieger   Benjamin Gamsa   Michael Stumm

                Department of Electrical and Computer Engineering
                         Department of Computer Science
                               University of Toronto
                          Email:  unrau@eecg.toronto.edu

                                  Abstract

We describe the locking architecture of a new operating system, Hurricane,
designed for large scale shared-memory multiprocessors.  Many papers al-
ready describe kernel locking techniques, and some of the techniques we use
have been previously described by others. However, our work is novel in the
particular combination of techniques used, as well as several of the individual
techniques themselves. Moreover, it is the way the techniques work together
that is the source of our performance advantages and scalability.  Briefly, we
use:

 o  a hybrid coarse-grain/fine-grain locking strategy that has the low latency
    and space overhead of a coarse-grain locking strategy while having the
    high concurrency of a fine-grain locking strategy;

 o  replication of data structures to increase access bandwidth and improve
    concurrency;

 o  a clustered kernel that bounds the number of processors that can com-
    pete for a lock so as to reduce second order effects such as memory and
    interconnect contention;

 o  Distributed Locks to further reduce second order effects, with modifi-
    cations that reduce the uncontended latency of these locks to close to
    that of spin locks.
______________________________________________*
     currently at IBM Canada

1      Introduction


The question of how to structure locks within an operating system is important, be-
cause it directly affects both the available concurrency and the latency of operating
system services.  The correct choice of locking strategy for a particular data struc-
ture or subsystem depends on the expected access pattern and the overall system
workload. In a shared-memory multiprocessor environment, we need to efficiently
support a workload consisting of either parallel applications or multiple sequential
applications  or  both.   These  workloads  result  in  four  types  of  access  behaviors
for  operating  system  data  structures:  1)  non-concurrent  accesses,  2)  concurrent
accesses to independent data structures, 3) concurrent, read-shared accesses, and
4) concurrent, write-shared accesses.
      A Unix application workload consisting of many sequential applications will
primarily  induce  the  first  two  types  of  access  behaviors.   Much  of  the  existing
work on operating system locking issues has focused on these types of workloads.
Parallel  applications,  on  the  other  hand,  primarily  induce  the  second  and  third
type of behavior.   The third and fourth types of behaviors are in some ways the
most important, however, as these can induce worst-case behavior in the operating
system.
      In this paper, we describe a locking architecture that addresses all four types
of  access  behaviors.   It  uses  a  hybrid  approach,  which  combines  properties  of
both  coarse-grained  and  fine-grained  locks.   The  coarse-grained  locks  minimize
the  number  of  atomic  operations  needed  in  the  critical  path  of  non-concurrent
operations.   Minimizing  the  latency  of  uncontended  locks  in  the  critical  path  is
important,  because it can constitute a significant portion of the overall response
time of an operation.  In our system, for example, the measured time for a simple
page fault is 160 sec, of which 40 sec is attributable to lock overhead.  Fine-
grained locks, on the other hand, provide the high degree of concurrency needed
for  concurrent,  independent  operations.   Further,  we  employ  a  technique  called
hierarchical clustering, which replicates data that is primarily read-shared so as to
increase overall lock bandwidth, and bounds the contention on shared structures
by constraining the number of processors that can access the structure.   Finally,
we  make  extensive  use  of  Distributed  Locks  proposed  by  Mellor-Crummey  and
Scott [19  ], in order to reduce second-order effects for those cases where contention
cannot  be  otherwise  avoided.    We  have  improved  on  the  basic  algorithm  and
optimized Distributed Locks for use in a kernel environment.
      The design of a locking architecture is heavily dependent on the parameters
of  the  system  environment  for  which  it  is  targeted.   In  our  case,  the  design  is
influenced by the fact that 1) atomic operations are expensive relative to normal


                                                         2


memory accesses, 2) swap is the only atomic operation supported, 3) our operating
system is an exception-based micro-kernel (as opposed to process-based [6 ]), and
4)  many  of  the  kernel  data  structures  are  left  uncached  because  our  hardware
does not support cache-coherence.  Nevertheless, we believe that elements of our
architecture are relevant to a wide variety of system architectures. It should also be
noted that many of the techniques we use have been proposed previously. However,
the strength of our approach lies in the particular combination of techniques used
to efficiently support the four access patterns described above.
      Section 2 of this paper describes our general locking architecture. Section 3 de-
scribes our improvements to, and experiences with, Distributed Locks. In Section 4
we present performance results from our system. This is followed in Section 5 by a
discussion of how our approach to locking might generalize to other systems, and a
discussion of our ongoing work. Finally, we conclude in Section 6 with a summary
of this paper.


2      Locking Architecture


This section describes the locking architecture of the Hurricane [8 , 13  , 25  , 26  ]
operating system.  Three key features distinguish this architecture:  a mix of coarse
and fine grained locks are used to achieve low latency while still supporting high
concurrency  for  independent  operations;  hierarchical  clustering  is  used  to  limit
contention by replicating data structures and constraining the number of processors
that  can  directly  access  a  particular  data  structure;  and  an  optimistic  deadlock
avoidance protocol is used to reduce common case latency.


2.1      A Hybrid Approach


Our Hybrid approach uses coarse-grained locks, where a single lock may be used
to protect several data structures but may only be held for short periods of time, and
it uses #light-weight# fine-grained locks to protect data for longer periods of time
but at a much finer granularity.
      Consider the chained hash table of Figure 1.  In a system using fine-grained
locks  (Figure  1a),  each  bin  would  have  its  own  lock  to  serialize  updates  to  the
hash  chains,  and  each  hash  entry  would  have  one  or  more  locks  to  protect  the
data therein.  With our hybrid approach, the entire hash table might be protected
by a single coarse-grained lock.  Using coarse-grained locks in this way has both
advantages and disadvantages with respect to space and time.  Clearly the number
of locks required to access a data structure is minimized, but that alone does not


                                                         3


Figure 1:  A chained hash table in a) a system with fine-grained locks, and b) the hybrid
locking of Hurricane. The boxes marked with the dark shading indicate locks that are set
with atomic operations. The boxes marked with light shading are reserve bits.


minimize the locking time, except in the case of no contention.  The challenge is
thus to keep locking time low as concurrency is increased, while still minimizing
the number of locks held.  The remainder of this section considers the trade-offs
involved.
      One disadvantage of a coarse-grained lock is that concurrent accesses to dif-
ferent elements are protected by the same lock, causing unnecessary contention for
independent operations.  Our first step to resolve this issue was to use Distributed
Locks. Distributed Locks allow processors to spin locally while waiting for a lock,
thereby removing the second order contention effects caused by spinning over the
inter-connection network. The additional traffic on the network and memory caused
by remote spinning not only slows down other non-contending processors, but also
slows  the  processor  that  is  holding  the  lock,  extending  the  length  of  its  critical
section  and  exacerbating  the  contention  problem.   Although  a  Distributed  Lock
requires more space than a spin lock (an additional two words per actively spinning
processor), a fine-grained approach would require one spin lock per hash element,
a much higher total cost.
      Figure 1b) shows how we allow increased concurrency by holding the main


                                                         4


Figure 2:  A chained hash table in a system with 2 clusters.  Each cluster has a separate
instantiation of the hash table and the locks that protect it. Deadlock must be avoided when
a cluster requires a lock in two clusters simultaneously, for example, to instantiate a new
copy of element Z on cluster 1.


lock only long enough to search the hash table and set a reserve bit in the required
element, after which the coarse-grained lock is released.1  Other processors waiting
for the reserved element spin on the reserve bit (with exponential back-off).  When
the bit is released, the waiting processors re-acquire the coarse-grained lock and
search again.2
      The reserve bit is in essence a fine-grained spin lock, except that 1) it requires
only one bit of storage instead of a whole word and is typically co-located with
other needed status information, 2) multiple reserve bits can be acquired while the
main lock is held, and atomic operations are not required to do so, and 3) it plays a
special role in our deadlock avoidance strategy (described in Section 2.3). The fine-
grained spin locks, however, can be subject to the contention effects that result from
the bursty accesses we expect in a multiprocessor environment. We counteract this
problem by controlling the number of processors that can simultaneously access
the data structure and by replicating the data structure, as described next.
______________________________________________1
     Our hybrid locking approach is in some ways similar to the locking strategy used by Peacock, et
al, for locking cache elements in a multiprocessor version of System V Release 4 [12 , 21 ].
    2Currently in our kernel, memory used for an object is always reused for objects of the same

type. Hence, there is no danger that a process could spin indefinitely because the memory has been
reallocated for another type of object that happens to have the same bit permanently set. If memory
was arbitrarily recycled, then it would be necessary to either (i) periodically check, while spinning,
that the data structure is still the one being sought, or (ii) use reference counts [1 ] to ensure that
objects are not deallocated while a process is spinning on them.


                                                         5


2.2      Hierarchical Clustering


To control the concurrency demands on the kernel data structures, we use a tech-
nique called hierarchical clustering. Briefly, hierarchical clustering is a framework
for managing locality in a scalable shared-memory multiprocessor [25  ]. Instead of
having one set of system data structures shared by all processors in the system, the
processors are grouped into clusters and a complete set of system data structures
are  instantiated  within  each  cluster.   With  this  framework,  the  read-mostly  data
is replicated onto each cluster,  and write-shared data exists on only one cluster,
possibly being migrated from one cluster to another.
      Figure 2 shows how the hash table of Figure 1 might appear in a two cluster
system.   A separate instance of the hash table,  each with its own lock,  exists in
each cluster.  The hash tables are used to hold both replicated entries (if they are
primarily  read-only)  and  non-replicated  entries.   In  Figure  2,  X  is  replicated  to
cluster 1 and cluster 2, while there is only a single copy of Y  and Z in the system.
Only processors in the cluster may access the cluster-local hash table.  To access
entries of a hash table in another cluster, a remote procedure call (RPC) must be
made to that cluster.3
      Replicas  of  primarily  read-shared  data  are  typically  made  on  demand.   Hi-
erarchical clustering supports efficient replication management by organizing the
system into a tree.  In the simplest case, the tree has three levels, with processors
at the leaves, clusters forming intermediary nodes, and a logical top of tree for the
entire system. The act of replication is made more efficient by combining multiple
simultaneous requests from the processors in one cluster into a single request to the
target cluster.  This is accomplished by always first creating a local (reserved) in-
stance before performing the replication. This local instance acts as a place-holder
until the real data is obtained, preventing redundant replication requests from being
issued. Combining is important because it is common that the many processes of a
large SPMD program make concurrent demands that require the same (read shared)
data.
      The hierarchical clustering tree is also used to efficiently support the broadcast
of modifications or invalidations when a replicated data structure is changed.  For
most data structures, the change is simply sent in parallel to each cluster.  For data
structures that are replicated per-processor, such as page tables in our system, the
change is further broadcast within each cluster to each processor affected.
      From the point of view of locking, hierarchical clustering provides two major
______________________________________________3
     In theory, any processor of the target cluster may be used to execute the RPC. In our implemen-
tation, RPCs from the ith processor in the source cluster are always directed to the ith processor in
the target cluster so as to roughly balance the RPC load.


                                                         6


benefits.  First, it bounds contention for both coarse-grained locks and reserve bits,
since RPCs are used instead of shared memory to access remote entries (Chaves
et al discuss some of the trade-offs between using RPC and shared-memory [3 ]).
Second,  it  increases  lock  bandwidth  by  i)  instantiating  per-cluster  system  data
structures (such as the hash table in Figure 2), each with its own coarse-grained
lock, and by ii) replicating read-shared objects (such as X in Figure 2), each with
its own reserve bit.
      Although  hierarchical  clustering  bounds  lock  contention  and  increases  lock
bandwidth, it does complicate locking protocols [26  ].  For example, a needed data
structure may not be present in the local cluster, requiring a remote operation to
get it.  If the data is replicated, then it must be kept consistent, which also requires
remote operations. If the local kernel is holding any locks when initiating a remote
operation, then some protocol must be established so that deadlock does not occur.


2.3      Deadlock Management Protocols


This section describes the protocol used to prevent deadlock across the clusters of a
hierarchically clustered system. The protocol applies only to locks within the same
class (e.g., the class of process descriptor locks);  we use a lock hierarchy across
classes.
      Consider again the clustered hash table of Figure 2.  Assume a local copy of
element Z  is required by cluster 1.  A search of the local hash table reveals that
it is not present, at which point a data specific location resolution technique [25  ]
indicates that the item resides in cluster 2.  The kernel on cluster 1 then allocates a
local instance of the element, Z0, and initiates an RPC call to cluster 2 to retrieve
the data.  The local instance is created before the remote access is started so that
other processors within the cluster that try to access Z while the remote operation
is in progress will not also initiate (redundant) remote requests.
      Because of the symmetric relationship between clusters, deadlock can result at
this point if the RPC to retrieve Z is started without releasing the locally held locks.
Our initial pessimistic deadlock prevention protocol required the initiator to release
all locks before initiating the RPC, and then to re-acquire them once the remote
operation had completed.  This added considerable overhead to the code.  Because
the data structures were unprotected and could be modified or even removed while
the local locks were released, the kernel had to search the hash table again to re-
establish the continued existence of element Z  after the RPC had completed.  In
addition, the kernel had to be prepared to handle the case where the data was no
longer present. In the common case this re-establishment of state was unnecessary,
since the probability of the data having been migrated or destroyed was very small.


                                                         7


      To avoid the overhead of re-establishing state for the common case, we have
implemented  an  optimistic  algorithm  that  avoids  deadlock  and  is  similar  to  that
described by Paciorek [20  ].  Before releasing the local locks, a reserve bit is set in
any structure that might be needed after completing the call. The reserve bit may act
as an exclusive lock or a reader-writer lock, depending on the data it protects.  The
local locks are then released and the RPC is initiated. If a reserve bit is encountered
during processing in the remote cluster, the RPC operation fails and returns with
an indication that a potential deadlock situation exists.  The local reserve bits set
earlier are then released and the remote operation is retried until it succeeds.  Note
that the local state must be re-established only if a retry is necessary.
      Our optimistic approach saves us from having to re-establish state in the com-
mon  case,  although  not  for  every  case.   In  practice,  we  have  found  that  retries
are seldom needed, and the need to restart an operation because of modified state
occurs even less frequently.  Our approach does have two disadvantages, however.
First, we have found that the deadlock avoidance algorithm required us to have two
versions of code in many cases: one version has release/retry handling (used when
executing as the target of a RPC); and one allows spinning without release/retry
handling.   Second,  by requiring a remote request to continuously retry the RPC
until it succeeds,  the probability of being able to acquire a remote reserve bit is
lower than for requesters that are in the target cluster and hence can spin directly
on the reserve bit.  Remote processors therefore have a greater potential of being
starved for resources that are over-committed.
      Remote operations must be retried even when the local locks and reserve bits
have all been released, because the processor itself is effectively a locked resource
(that could participate in a deadlock cycle) in an exception-based kernel.  This is
particularly apparent in our use of RPCs.  Consider once again the hash table in
Figure 2, and suppose processor P1  in cluster 1 would like to modify X  globally.
Consider the situation where P1 has dropped its local locks and reserve bits in order
to retry the request, but the processor it directs the RPC to, say processor P2 , has
the reserve bit for X set and is currently performing an RPC to another cluster.  If
the RPC from processor P1  now executing on processor P2  were to spin on X 's
reserve bit on processor P2 , deadlock would result.  The deadlock cycle is caused
by processor P1 's RPC holding the processor as it spins waiting for the reserve
bit, while processor P2  holds the reserve bit while waiting for the processor to be
released.
      Although our algorithm has been presented in the context of a clustered system,
it is important to note that the same protocol could be applied to any system that
requires multiple locks simultaneously. Also, we chose not to use the more common
global ordering approach [22  ] within a particular class of locks, because the only


                                                         8


ordering that makes sense in a clustered system is by cluster number. Since remote
cluster operations are in general uniformly distributed across clusters, we would
still have to release locks to preserve the correct ordering, and we would still require
the ability to roll-back and restart.
      We have observed two situations where modifications occur often to data re-
quired for the completion of the operation while their locks are released:  copy-
on-write page faults and program destruction.  Both situations occur infrequently
relative to other kernel operations, and of course are only a concern with large appli-
cations that span multiple clusters.  Nevertheless, they will perform less efficiently
on average because of the overhead of the retries.


2.4      Advantages of our Approach


The  four  classes  of  workload  an  operating  system  must  handle  are:   1)  non-
concurrent requests, 2) concurrent independent requests, 3) concurrent requests to
read-shared operating system resources, and 4) concurrent requests to write-shared
operating system resources.  We have found that our hybrid locking approach, in
conjunction with hierarchical clustering, allows us to effectively address all four
workload classes.


Non-concurrent requests:                The only important goal for this workload is to mini-
         mize latency. With our hybrid locking strategy, many kernel requests require
         only a single atomic operation. Hence, Hurricane is able to achieve uncon-
         tended response times comparable to uniprocessor systems [14  ].


Concurrent independent requests:                     The  important  goal  for  this  workload  is  to
         maximize concurrency, so an optimal strategy would be to use fine grain locks
         for these requests.  With our hybrid locking strategy, the reserve bits serve
         as fine grain locks to maximize concurrency and minimize the time that the
         coarse-grained locks are held.  By using hierarchical clustering, the number
         of processes concurrently contending for a coarse-grained lock is bounded,
         so, for this workload, Hurricane is able to achieve performance comparable
         to systems structured using only fine grain locks. This is demonstrated in the
         results section.


Concurrent requests to read-shared resources:                          Hierarchical clustering allows us
         to instantiate multiple instances of read-mostly data structures to increase the
         access  bandwidth  to  the  data.   Since  requests  from  SPMD applications  to
         shared resources can be bursty, it is important that the replication be done


                                                         9


         efficiently; hierarchical clustering creates a combining tree to reduce the de-
         mand  on  the  source  data  structure,  should  many  processors  wish  to  make
         copies of the data structure simultaneously.


Concurrent requests to write-shared resources:                          Although kernel resources are
         seldom actively write-shared in our system, it is still important to minimize
         second order effects for those cases where write-sharing does occur.  Since
         accesses to shared data in remote clusters typically occur via RPC calls, the
         number of processors competing for a write-shared data structure is bounded
         by the number of processors in the cluster.


2.5      Experiences using our approach


In  the  previous  sections,  we  discussed  our  general  locking  methodology  from
an architectural perspective.  Naturally, when applying a general methodology to a
particular situation, one often makes adaptations and optimizations to accommodate
particular uses. In this section we describe the more interesting lessons we learned
from applying our general techniques to a full operating system implementation.


Pessimistic vs.  Optimistic


In a number of cases we found it advantageous to use a pessimistic (i.e., release
all  locks,  including  reserve  bits,  prior  to  making  a  remote  request)  rather  than
an optimistic locking strategy,  primarily for reasons of simplicity.  For example,
although we use an optimistic strategy for many data structures when creating local
replicas, we typically use a pessimistic strategy for global updates.  The optimistic
approach is preferable for the former case, since it allows us to use the combining
tree  approach  discussed  in  Section  2.2.   The  pessimistic  approach  is  generally
preferable for updates that may be broadcast to many clusters: if a processor in one
cluster asks a processor in another cluster to broadcast modifications to data for
which it also has a copy, then it is obviously better to have the local copy unlocked
from the start.


Hybrid Compromises


Although we generally use the hybrid coarse-grain/fine-grain locking strategy as
described,  we  do  not  follow  it  religiously.   Our  kernel  has  some  data  structures
that are protected by coarse-grained locks and have no fine-grained locks, and in
some cases, we have found it advantageous to split coarse-grained locks to achieve
somewhat greater concurrency.


                                                        10


Retries


The rationale for using an optimistic approach is to trade off performance under
contention (possibly requiring retries) for performance under light load (allowing
locks to be safely held during remote operations).  We found not only that retries
are rarely required, but in those cases where they are required they would still have
been required using a pessimistic approach.
      For example, with SPMD programs, simultaneous faults to copy-on-write pages
raise a number of potential deadlock situations that require retries with the optimistic
approach.   However,  because  a  copy-on-write  fault  requires  instantiating  a  new
private  page  to  replace  the  current  shared  page,  the  pessimistic  approach  would
likely  find  that  its  copy  of  the  page  had  disappeared  by  the  time  it  completed
its remote operation,  requiring it to re-search its data structures and re-issue the
request.
      A second example can be found in the destruction of parallel programs contain-
ing many processes.  Hurricane maintains a family tree of processes in the system,
where the links of the tree run through the process descriptors.  When a process
in the application is to be destroyed, multiple process descriptors in different clus-
ters must be updated to remove the process from the tree.  Since all processes of
an application are destroyed at approximately the same time, retries are common,
independent of the strategy chosen.


Data structure design


One lesson we learned from the case of program destruction was that combining
two structures with different locking characteristics into a single entity can lead
to many concurrency control problems.  In this particular case, the problem came
from the fact that program destruction can involve up to three process descriptors
and has a natural lock ordering that follows the structure of the tree, while process
descriptors are also used to implement message passing which always involves two
arbitrarily related processes, with no natural ordering.  Had the family tree been
implemented as a separate data structure, it would have been possible to exploit the
hierarchy of the tree to enforce a lock ordering that would have allowed us to avoid
the RPC retries described above.


3      Using Distributed Locks


Distributed  Locks  [19  ]  are  used  in  our  system  primarily  for  per-cluster  coarse-
grained  locks,  since  cross-cluster  interactions  most  often  occur  through  RPCs.


                                                        11


                         Figure 3:  Locking algorithms used by Hurricane


Distributed Locks are particularly well-suited for NUMA shared-memory multi-
processors and can substantially reduce the second-order effects stemming from
the memory and inter-connection network contention that occurs when processors
spin on remote memory.  Distributed Locks build a queue of processors waiting
to  acquire  a  lock.   Second-order  contention  effects  are  reduced  because  waiting
processors spin on their local queue elements, instead of across the interconnection
network.  The queue also has the benefit that accesses to the lock are distributed
fairly, since processors are queued in order of arrival. The remainder of this section
describes several interesting lessons we have learned from using Distributed Locks.


3.1      Latency in the uncontended case


The  high  uncontended  latency  of  Distributed  Locks  relative  to  spin  locks  was
originally a concern to us,  since other researchers had found that it could be as
much  as  twice  as  high  as  that  of  simple  spin  locks  [15  ].   One  way  to  address
this problem is to use an adaptive technique, where the locks switch between spin
and  distributed  locks,  depending  on  the  amount  of  contention  observed  [2 ,  15 ].
We instead found that two simple modifications to the original distributed locking
algorithm could improve the uncontended latency to make it competitive with that
of simple spin locks (on our system), while preserving the advantages of distributed
locks in the contended case.


                                                        12


      The  original  and  modified  distributed  locking  algorithms  are  shown  in  Fig-
ures 3a and 3b, respectively. The first modification removes the code that initializes
the  per-processor  local  structure  from  the  critical  path  of  the  uncontended  case
(i.e.,  the  first  dashed  box  in  Figure  3a).   This  was  done  by  requiring  the  per-
processor queue structure to be initialized prior to the first request to the lock, and
by re-initializing the structure when it is modified, which occurs only when there
is contention for the lock.  The code added is highlighted in Figure 3b.
      The second modification to the Distributed Lock algorithm removes the con-
dition in release__lock,  which determines whether another processor has since
added  itself  to  the  queue,  and  which  is  executed  just  prior  the  execution  of  the
compare__and__swap to release the lock in the uncontended case (i.e., the second
dashed box in Figure 3a). The check was there as an optimization for the contended
case,  assuming  local  memory  accesses  are  much  cheaper  than  remote  accesses.
However, this check degrades the performance of the common case where the lock
is uncontended. Removing the check does not affect the scalability of the algorithm,
since it adds only a constant overhead to the case where there is contention.
      With these two modifications the uncontended latency on Hector improved
from 5.40 sec to 3.69 sec # an improvement of 32%.  The optimized time now
compares favorably to the uncontended spin lock time of 3.65 sec, the algorithm
of  which  is  shown  in  Figure  3c.   These  results  are  described  in  more  detail  in
Section 4.1.1.


3.2      TryLock


As described by Mellor-Crummey and Scott, Distributed Locks do not support a
TryLock operation.  TryLock makes a single attempt to obtain a lock, and returns
either with the lock held, or with a failure code if the lock is not free.  In operating
system kernels, TryLocks are typically used by the interrupt handlers, which cannot
wait for a lock in case it is held by the pre-empted process. In our system, interrupts
are used not only for devices, but also for invoking RPCs.  In the case of an RPC,
if a TryLock fails then the invoking processor is returned an error and retries the
operation.
      Our first attempt to extend the basic Distributed Locking algorithm to support
TryLock took advantage of the fact that the local queue structures could be pre-
allocated on a per-processor basis, one for each coarse-grained lock.  The interrupt
handler checks whether the pre-allocated local queue element is in use before it
enqueues itself; if the queue element is free, then it is certain it did not interrupt a
current holder of the lock and can therefore safely wait for the lock to be released.
While this does not implement a true TryLock (because the interrupt handler will


                                                        13


enqueue itself and wait rather than returning immediately if the lock is held),  it
does prevent deadlock and has the advantage of allowing the interrupt handler to
acquire the lock under all conditions except when it clearly cannot, namely when
it has interrupted the lock holder.  Unfortunately, this implementation of TryLock
required  a  flag  in  the  local  queue  structure  that  had  to  be  modified  both  when
acquiring  and  releasing  the  lock,  and  hence  had  a  negative  impact  on  the  base
performance of our distributed locks in the uncontended case.
      We developed a second variant of the Distributed Locking algorithm,  which
also  supported  TryLock  (this  time  a  true  TryLock)  but  which  only  added  over-
head  to  release__lock  in  the  contended  case.   The  new  algorithm  is  similar  to
acquire__lock of Figure 3b, except that it uses a separate local queue structure just
for interrupt handlers. If an interrupt handler discovers that the lock is already held
after having enqueued itself, then it returns with an error code (rather than spin-
ning), leaving its local queue structure still in the queue. The queue structures from
failed TryLock requests are garbage collected by Release__lock operations.  This
implementation of TryLock is similar to the timeout mechanism for the queueing
lock, developed independently by Craig [5 ].
      Unfortunately,  we  found  that  this  second  variant  of  TryLock  discriminated
against RPC operations and favored local operations. In hindsight, we realized that
this use of TryLock was fundamentally incompatible with Distributed Locks, since
Distributed Locks are inherently fair, while retry-based locking is only probabilis-
tically fair. That is, if a lock is saturated, then a Distributed Lock's release__lock
operation will always hand-off the lock to some local processor that is waiting in
the queue, keeping the lock permanently held; remote requests using TryLock will
never see the lock free.
      An  alternative  to  using  TryLock  for  RPCs  is  to  disable  interrupts  while  the
lock  is  held,  thus  preventing  RPCs  from  getting  through.   This  way,  the  RPC
interrupt handler can be sure it cannot deadlock with the processor it interrupted.
Unfortunately,  our  hardware  only  provides  the  ability  to  enable  and  disable  all
interrupts,  and for a number of reasons the Hurricane kernel always runs with
interrupts  on.   We  therefore  adapted  a  strategy  first  suggested  by  Stodolsky  et
al [23  ].
      Inter-processor interrupts are treated as a separate interrupt class that can be
logically masked. A per-processor flag is set whenever a lock is about to be acquired
that could cause deadlock with an interrupt handler.  An interrupt handler always
first checks the flag, and if clear, can safely queue for the lock. If, on the other hand,
the flag is set, then the interrupt handler enqueues a record of the work to be done
on a per-processor work queue. When the flag is cleared, the queue is checked and
any pending work is immediately completed.  Because the flag and the queue are


                                                        14


accessed strictly locally, they can be cached effectively.
      The per-processor flag acts as a lock for the processor, placed at the top of the
lock hierarchy:  it must be acquired before any other lock can be acquired.   For
RPCs,  it allows fair access to the processor,  because work is enqueued for later
execution whenever the interrupt handler finds the processor locked in a manner
similar to the way processors enqueue themselves on Distributed Locks.
      In  retrospect,  it  may  have  been  better  to  combine  the  work  queue  with  our
second TryLock variant, rather than adding the additional per-processor flag to the
top of the lock hierarchy.


4      Experimental Results


In this section, we use synthetic stress tests to demonstrate the performance of our
locking architecture.  The experiments were run on a fully configured version of
Hurricane with all servers active, but with no other applications running at the
time. The operating system was running on a 16 processor Hector prototype with
16 MHz MC88100 processors [27  ].  The particular hardware configuration used in
our experiments consists of 4 processor-memory modules per station (a shared bus)
and 4 stations connected by a ring. This causes access times to vary from 10 cycles
for a local (on-board) access, to 19 cycles for an on-station access, and 23 cycles
for a cross-ring access.


4.1      Basic locking performance


We first present performance results for the three locking algorithms of Figure 3
in  the  absence  of  contention,  and  then  show  their  performance  as  the  locks  be-
come  contended.   Our  processors  only  support  fetch__and__store  instructions
(and  not  compare__and__swap).   Therefore,  we  use  Mellor-Crummey  and  Scott's
fetch__and__store  variant  of  their  Distributed  Lock  algorithm  in  these  experi-
ments.   Using this variant of the algorithm only impacts the performance of the
contended case, as described in Section 4.1.2.


4.1.1      Uncontended performance


We measured the performance of the three locking algorithms by measuring the
average time to acquire and release a lock 106  times.  The uncontended latency of
exponential backoff spin locks (Figure 3c) varies between 3.65 sec and 4.63 sec,
depending on the distance between the process requesting the lock and the lock
variable.   The  latency  of  the  unmodified  Distributed  Locks  (Figure  3a)  varies


                                                        15


                           ______________________________________________________________
                           |_________________|Atomic__|_____Mem.__|_____Reg.__|___Br.__|_

                           |__MCS________|_____________2__|________2__|______3__|____5__|_

                           |__H1-MCS__|________________2__|________1__|______3__|____5__|_

                           |__H2-MCS__|________________2__|________0__|______3__|____4__|_

                           |__Spin________|____________2__|________0__|______1__|____3__|_

Figure 4: Instruction counts required to execute a lock/unlock pair for the various routines
in the absence of contention.   MCS is the unmodified Mellor-Crummey and Scott Dis-
tributed Lock algorithm; H1-MCS is the MCS algorithm with our first modification, that
removes the initialization code; H2-MCS is the H1-MCS algorithm with the conditional
test in the unlock removed; Spin is the exponential backoff spin lock algorithm. Atomic are
atomic read__modify__write instructions (swap instructions in our case); Mem are loads
or stores to memory; Reg are single-cycle register-to-register instructions; Br are branch
instructions (including return).


between 5.40 sec and 6.02 sec.  With our first modification that eliminates the
initialization code, latency improves to between 4.56 sec and 5.33 sec, and with
our  second  modification  that  also  removes  the  condition  code,  the  uncontended
latency further improves to between 3.69 sec and 4.63 sec.
      The instruction counts for the three locking algorithms, obtained by inspecting
the assembly code, are shown in Figure 4.4  While the modified Distributed Lock
algorithm  (H2-MCS)  has  the  same  number  of  atomic  operations  and  memory
accesses as the spin lock algorithm, it should have five additional cycles of latency
due  to  branch  instructions  and  register  to  register  instructions.   This  expected
latency is not reflected in the measured performance results, because the execution
of these instructions is overlapped with the store part of the fetch__and__store
instructions (the MC88100 processor can proceed as soon as the fetch portion
of the fetch__and__store has completed).  Hence, our modified Distributed Lock
algorithm performs almost as well as the spin lock algorithm on our system.


4.1.2      Performance under contention


Figure  5  compares  the  response  times  of  the  different  locking  algorithms  under
contention,  when  p  processors  continuously  acquire  and  release  the  same  lock.
Figure  5a  and  5b  show  the  performance  for  the  case  where  the  lock  is  held  for
0 sec and 25 sec, respectively.
______________________________________________4
     On our system, all stores to a variable that might be modified with a fetch_and_store instruction
must also occur using a fetch_and_store instruction.  For this reason, the unlock operation for a
spin lock releases the lock using a fetch_and_store rather than a store instruction.


                                                        16


                    a) zero lock hold time                                            b)  25 usec lock hold time

            400        o     o  Spin (max backoff  35 usec)                   o                                                    *
 *              o       8@

                       o     o  Spin (max backoff  2 msec)             o  o                                                        *
 *       o  o

            300        2    2   H2-MCS                             o                                                               *
 *   o         2        6@
                                                                                                                                  o
Response                        H1-MCS                      o  o                                                               o   *
 *      2   2   .o

    Time    200        .     .  MCS                     o                                                              o   o  2   2*
 *o 2.o  .o .o          4@
   (usec)                                           o                     2   2                                     o  2   2o  .o .
                                                 o                 2o 2o  o   o                                 2o 2.o .o  .
            100                           o  o  2   2o  2o 2o  2o                                       2oo 2.oo.o                 *
 *                      2@
                                  2o  2o 2o  2o  o          .  .   .   .  .   .                  2.oo2.oo.
                           2.o2ooo.o  .o  .  .   .  .   .                                 2.o2.ooo
                0      2.oo    .                                                                                                   *
 *                      0


                       1   2      4       6      8     10      12     14     16           1   2      4      6       8     10     12*
 *      14     16

                                    Number of Processors                                               Number of Processors


                   Figure 5:  The average response times of a lock/unlock pair when p processes repeatedly
                   access  a  critical  section.   The  curves  show  the  original  Distributed  Locking  algorithm
                   (MCS), the original algorithm with the initialization code removed (H1-MCS), the second
                   algorithm with the conditional test also removed (H2-MCS), the exponential backoff spin
                   lock with a maximum backoff of 35 sec, and the exponential backoff spin lock with a
                   maximum backoff of 2 msec.


                                                                           17


      Because we use the fetch__and__store variant of Distributed Locks, it is pos-
sible that a nil will be stored to the lock variable in release__lock, even if there is
some successor waiting for the lock. In this case, a performance penalty is incurred
to repair the queue. From Figure 5a and 5b, we can see that the first modification we
made to Distributed Locks does not degrade performance in the case of contention,
while the second modification adds a constant overhead to release__lock, which is
shown by that fact that the latency increases linearly with the number of contending
processors.   The  extra  latency  for  the  second  variant  is  a  result  of  not  checking
for successors in the unlock operation, requiring the queue to always be repaired
if there is a successor.  If the lock is held for zero time, then this degradation has
a significant effect on performance (Figure 5a), but if the lock is held for as little
as 25 sec, then the extra latency is much less significant (Figure 5b).  Note that
if compare__and__swap were available, then the performance differential would be
significantly lower, although it would not be eliminated.
      The Distributed Locks are compared against two variants of the exponential
backoff spin locks in Figure 5,  one where the maximum backoff is 35 sec and
the other where the maximum backoff is 2 msec.  The former value is intended for
lightly contended locks to reduce the latency in the case where the lock could not
be acquired immediately, and is the value used internal to our operating system (for
a cluster size of 4). The latter value was chosen because it yields optimal results for
the experiments presented. With a maximum backoff of 2 msec, the performance of
the spin locks is competitive with that of the Distributed Locks, since the memory
contention becomes negligible.  However, using this value makes the lock highly
susceptible to starvation:  with 16 processors contending for the lock and a lock
hold time of 25 sec, it took over 2 msec to acquire the lock in over 13% of the
acquisition attempts.


4.2      General locking results


We  use  two  synthetic  page  fault  tests  to  demonstrate  the  effects  of  our  locking
architecture. In particular, the tests use soft page faults (i.e., faults to pages already
in core), since such faults are fairly common in our system, both for mapping in
cached files, and to support page-level cache coherence, page migration, and page
replication.   The tests model particular phases of real applications,  stressing the
boundary cases. Using synthetic tests instead of real applications has the advantage
that it allows us to focus our attention on results of interest to this paper. Application
results on our system are presented in [25  ].
      The two synthetic tests used are:

Independent faults (Figure 6a):                  p  processes  repeatedly  fault  on  a  per-process


                                                        18


Figure  6:  A schematic depiction of the programs that stress the memory management
sub-system.  (a) p processes repeatedly fault on a region of local memory.  (b) p processes
simultaneously write to the same small number of shared pages.


         private region of local memory.  Because the faults are to different physical
         resources (i.e., different pages) the only lock contention in this experiment is
         due to unnecessary locking conflicts in the kernel.


Shared faults (Figure 6b):               p processes repeatedly 1) write to the same small num-
         ber of shared pages, 2) barrier, and 3) unmap the pages from the processes'
         page tables.   Because the faults from the different processes are all to the
         same shared pages, lock contention is implicit in the application demands.


      Figures 7a and 7b show the response time of a page fault for the two tests on a
single cluster of 16 processors, as p is varied from 1 to 16 processors. The different
curves represent performance when either Distributed Locks or exponential backoff
spin locks are used.
      For  the  independent  fault  test  (Figure  7a),  there  is  little  difference  between
the performance of Distributed and exponential backoff spin locks if the number
of contending processors is between 1 and 4.  However, if p is increased beyond
four, then the use of spin locks degrades performance substantially, indicating that
the  coarse  grain  locks  are  a  source  of  contention.   With  16  processors  faulting
concurrently,  the latency to handle a page fault is over twice as high when spin
locks are used instead of Distributed Locks. These results demonstrate the dramatic
impact that second order effects can have on the performance of kernel operations,
since the latency increases are due almost entirely to contention at the memory and
interconnection network.
      For  the  shared  fault  test  (Figure  7b),  the  difference  in  latency  between  Dis-
tributed Locks and spin locks is much smaller.  This is because processes contend
more for reserve bits, and less for the coarse grain locks.  However, it is apparent


                                                        19


                   a) Faults to Independent Pages                                 b) Faults to Shared Pages
                                                                                                                                   *
 *            o
             1000      o - Spin Locks                                         o                                                    *
 *    o            1100
                       . - Distributed Locks                                                                                       *
 *            .
Response      800                                                      o                                                        o  *
 *    .            900
                                                                                                                         o      .
    Time      600                                              o                                                 o       .         *
 *                 700
   (usec)                                                                                                 .o     .
              400                                       o              .      .                    .o                              *
 *                 500
                                                 o      .      .                        .o .o  .o
              200      .o  .o  .o .o      .o     .                                                                                 *
 *                 300

                       1   2      4       6      8     10      12     14     16         1  2       4      6      8      10     12  *
 *   14      16

                                    Number of processors                                             Number of processors


                   c) Faults to Independent Pages                                 d) Faults to Shared Pages
           1200                                                                                                                    *
 *                  220@
                                                                                          .
           1000                                                             o             o                                        *
 *                  200@

Response     800                                                                                                                   *
 *                  180@

    Time     600                                                                                                                   *
 *                  160@
                                                                                                       o.                          *
 *                  140@
   (usec)    400                                                            .                                                      *
 *         o        120@
                                                               o
             200          o.          o.           o.          .                                                   o.           o. *
 *         .        100@

                 0                                                                                                                 *
 *                  800
                          1           2            4           8           16             1            2           4            8  *
 *        16

                                           Cluster Size                                                     Cluster Size


                 Figure 7:  Page fault response times for the two synthetic page fault tests. Figures (a) and
                 (b) demonstrate the effect of different amounts of contention in a fixed cluster size of 16
                 processors.  Figures (c) and (d) demonstrate the impact of different cluster sizes given a
                 fixed degree of contention caused by 16 processors.


                 that there is at least some contention for the coarse grain locks as the number of
                 contending processors increases.  This stems from the fact that multiple processors
                 simultaneously attempt to acquire the coarse grain lock that protects the reserve bit
                 they are waiting on whenever a reserve bit is cleared, so there are bursts of heavy
                 demand on the coarse grain locks.
                       Figures 7c and 7d show the response time with 16 processors as a function of
                 the cluster size, which is varied between 1 and 16.  For the independent page fault
                 test, we expect small cluster sizes to result in the best performance.  For the shared
                 fault test, however, the situation is not so clear.  We expect the sharing in this test
                 to give larger cluster sizes an advantage.  On the other hand, having a cluster size


                                                                         20


smaller than the system size means that the page descriptors are replicated to each
cluster, increasing the lock bandwidth and bounding the contention on each page
descriptor to the number of processors in the cluster.
      Figure 7c shows that smaller cluster sizes do indeed lead to better performance
for independent requests. For this experiment, performance does not degrade under
contention if the cluster size is 4 or less.5  Hence, it is clear that the coarse-grained
locks that protect the reserve bits do not constrain concurrency; i.e., they do as well
as a fine grained locking strategy.  Since this is a stress test that exerts demands on
the kernel that are more extreme than that of any application, one can expect cluster
sizes larger than 4 to perform as well with our hybrid locking strategy as with a fine
grained locking strategy, assuming the requests are independent.
      By comparing the results of Figure 7c to those of Figure 7a, we can see that the
performance of 16 processes, faulting independently in 4 clusters of 4 processors
each, is as good as the performance of 4 processes faulting in a single 16 processor
cluster.   From  this  we  can  conclude  that  hierarchical  clustering  is  effective  in
localizing requests, allowing independent requests on different clusters to proceed
concurrently without interference.
      Figure 7d shows that moderate cluster sizes yield the best performance in the
case of the shared fault test.  For very small cluster sizes,  the overhead of inter-
cluster operations dominates performance.6  As the cluster size increases, the cost
of obtaining a local copy of the page descriptor with an RPC is amortized over more
processors, since only one processor in each cluster must do so; the other processors
in the same cluster can then service the fault using the local replica of the page
descriptor. This is an example of increasing the lock bandwidth through replication,
and merging requests using a combining tree. For very large cluster sizes, the page
descriptors become shared by too many processors, and lock contention within the
cluster becomes a problem.
      A  workload  with  a  mix  of  real  applications  can  be  expected  to  have  both
independent and non-independent faults.  The performance behavior can therefore
be  expected  to  be  a  combination  of  the  behavior  of  both  tests.   From  this,  we
conclude that a cluster size somewhere in the range of 4 to 16 processors could be
optimal for our system.  It is also interesting to note that results similar to those
presented above were obtained for stress tests that exercised other portions of the
kernel, such as the message passing subsystem.
______________________________________________5
     The difference between the performance of the 2 processor cluster and 4 processor cluster is due
to NUMA affects rather than contention.
    6A null Remote Procedure Call (RPC) requires 27 sec, while the cost to perform a cluster-wide

page lookup and replicate a page descriptor is approximately 88 sec.


                                                        21


5      Generalization of Experiences


Our  experiences  relate  to  one  particular  system.   In  this  section  we  attempt  to
generalize our results to other hardware and software system architectures.


5.1      Other Operating System Designs


Process-based operating systems


Because our operating system is exception-based, as opposed to process-based [6 ],
non-blocking  locks  are  the  most  natural  approach  to  locking,  since  there  is  no
process  in  the  kernel  to  block.   While  our  approach  can  be  applied  directly  to
process-based  kernels  in  those  places  where  they  use  spin  locks,  process-based
kernels can also use blocking locks which open up new opportunities to improve on
our techniques.  For example, by creating a process to handle interrupts and letting
them block on locks, it is possible to remove the processor from the lock cycle and
to provide greater fairness for remote requests, eliminating the problem described
in Section 3.2.


Monolithic operating systems


Another  question  concerns  how  our  locking  strategy  might  apply  to  monolithic
operating  systems.   We  have  applied  the  techniques  described  in  this  paper  to
several of our system servers,  in particular the file system [13  ],  and have found
the benefits of reduced latency and increased concurrency that stem from the use
of  both  hierarchical  clustering  and  our  hybrid  locks  apply.   This  gives  us  some
confidence that our approach would be just as effective when applied to monolithic
kernels.


5.2      Other Multiprocessor Designs


The system we used to test our ideas has only 16 relatively slow processors, does
not support hardware cache coherence, and provides locking support that is rela-
tively slow (requiring two remote memory accesses for a single read-modify-write
sequence) and inflexible (supporting only atomic-swap).   It is natural,  therefore,
to question how our results might apply to more modern systems, with faster pro-
cessors  and  interconnection  networks,  cache-coherence  or  COMA  (Cache-Only
Memory Access [7 , 9 ]) support, and more powerful cache-based atomic primitives.
Although such architectural changes will shift the tradeoff points, we believe that


                                                        22


many of the techniques described in this paper would still apply to these systems.
We address each of these issues in turn.


Larger and faster systems


Recent progress in processor design and manufacturing have allowed CPU perfor-
mance to advance at a significantly greater rate than other components in multipro-
cessor systems.  Although higher bandwidth can be achieved in other components
by widening the data paths, the latency to access memory or transfer data between
processor caches is a difficult problem to overcome.  If one additionally considers
the effect of larger systems, it becomes clear that contention for shared resources
and the locks that protect them will only get worse.  Therefore, techniques such
as hierarchical clustering to bound the number of contending processors and lock
replication to increase lock bandwidth, RPCs to increase locality, and distributed
locks to reduce second-order effects, will become all the more important for such
systems.


Cache-coherent systems


One of the more significant differences between our system and many others is our
lack of hardware cache-coherence. As a general effect, the lack of cache-coherence
makes physical locality more important in our kernel than in others, since we run
with most of the kernel data uncached and hence cannot take advantage of temporal
locality.
      Cache-coherent NUMA and COMA systems have higher data access bandwidth
and lower access latency when data is read shared, which suggests that our software
techniques for increasing data bandwidth and reducing access latency would be less
important in such systems.  However, for many kernel data structures we do not
believe this to be the case. For example, the Hurricane memory manager modifies
the reference count on page descriptors when a page is mapped into an application
address space,  and hence page descriptors cannot be replicated efficiently using
hardware cache coherence; with hierarchical clustering, a separate local reference
count is maintained for each instance of the page descriptor replicated by software.
Also, it is still important to limit the number of processors that can contend for a
lock, and distribute kernel data structures so that they are near requesting processors.
Hence  we  believe  that  hierarchical  clustering  can  help,  even  for  systems  with
hardware cache coherence.
      (As  an  aside,  we  believe  that  one  might  want  to  run  with  some  portions  of
the  kernel  data  uncached,  even  in  a  hardware  cache-coherent  system.   Running


                                                        23


uncached  eliminates  cache-line-based  false  sharing,  and  with  it  the  cache-line
ping-pong effects that often occur when data with different access patterns share
cache  lines.   Also,  temporal  and  spatial  locality  in  current  operating  systems  is
often quite poor [4 , 24  ], leading to very low cache hit rates, reducing the benefits
of cache-coherence.)


Advanced atomic primitives


The atomic-swap operation supported by our processor requires two main memory
accesses and is hence relatively slow compared to cache-based atomic operations
which permit a lock to be acquired without going to memory (provided the cache-
line is currently held in the exclusive state). Newer systems also provide more pow-
erful atomic primitives, such as compare-and-swap (CAS) and load-linked/store-
conditional (LL/SC), which allow a number of additional optimizations that were
not available to us in our system.
      The benefits of our hybrid locking strategy come primarily from the reduced
number  of  atomic  primitives  required,  their  reduced  space  overhead,  and  their
natural support for performing multiple simple atomic operations under a single
lock.    However,  cache-based  atomic  primitives  can  reduce  the  cost  of  atomic
operations  to  close  to  that  of  regular  memory  accesses,  bringing  into  question
our focus on reducing the number of atomic operations required for locking.  The
trade-off between atomic and regular memory accesses depends, however, on subtle
implementation issues that can change from year to year. We believe that reducing
the number of atomic operations will likely remain beneficial, although not at the
expense of larger numbers of regular loads and stores,  and hence this benefit of
hybrid locking should still apply.
      Both CAS and LL/SC instructions allow single bit locks to be implemented that
can share a word with other data, thus eliminating another advantage of the hybrid
locks. However, the hybrid-lock technique of using coarse-grained locks to protect
large data structures has the advantage that multiple operations (such as dequeuing
and locking an element) can be performed atomically under a single lock.  Hence
hybrid  locks,  with  Distributed  Locks  as  the  coarse-grain  locks,  would  remain  a
good choice.
      Distributed  Locks  are  affected  by  cache-based  locks  in  a  number  of  ways.
The trade-off between regular spin locks, our version of MCS Distributed Locks,
and  newer  cache-based  queueing  locks  which  are  optimized  for  the  contended
case [5 , 17  ] depends on three primary factors: 1) the degree of sharing of the locks
(and thus its hit rate);  2) the amount of steady-state contention expected;  and 3)
the probability of bursts of very high contention. For low sharing, low steady-state


                                                        24


contention, and low burstiness, spin locks would be the better choice, since they
have the lowest latency.   With higher degrees of sharing the savings from using
spin locks are likely to be minimal, since the cost of the cache misses will swamp
any savings. In addition, if occasional burstiness is a problem, spin locks must also
be ruled out because of the second-order effects from cache-coherence traffic.  If
the steady-state contention is expected to be low, our modified MCS locks have the
advantage, since they have lower latency than other queueing locks. Finally, if high
contention is common, the cache-based queueing locks would be the better choice,
since their contended-case performance is better than the MCS locks.
      Finally,  LL/SC  or  CAS  instructions,  whether  cached  or  not,  can  be  used  to
implement lock-free operations, which can remove the need for locks entirely [18  ].
Lock-free data structures have a number of benefits, both in terms of performance
(by removing the extra space and time cost of locks) and in terms of functionality
(they eliminate deadlock), but also have a number of disadvantages.  Because only
a  single  word  (or  double  word)  can  be  updated  atomically,  modifications  often
become more complex:  either an entire data structure is copied, changes made to
the copy, and a pointer to the copy atomically swapped in (provided the previous
pointer still points to the original copy [11  ]); or the changes can be performed as
a series of atomic operations on single words, but only if each change leaves the
full data structures in a valid, consistent state [18  ].  The first approach can be very
expensive if the data structure is large, while the second approach requires finding
safe states for each atomic change, which can be difficult and error prone.  Even
when atomic modifications can be done with a small number of atomic primitives,
it may still be more efficient to use regular locks depending on the true relative cost
of the atomic primitives compared to regular loads and stores. Finally one must be
careful about the possibility of starvation using the lock-free approach.7


5.3      Current Directions


We are currently in the process of redesigning our locking strategy for our next
operating system, Tornado, targeting a new, T5-based multiprocessor called NU-
MAchine.  This multiprocessor will have an order of magnitude faster processors,
cache-coherence  support  in  hardware,  cache-based  LL/SC  instructions,  and  net-
work caches.  Our initial design considerations include:


      o  Operating  systems  have  traditionally  had  poor  caching  behavior  [4 ,  24 ].
         However,  we believe this is primarily because caching and multiprocessor
         cache-coherence effects have been largely ignored in the design of operating
______________________________________________7
     An alternative is to use a wait-free approach, but this is generally much more expensive [10 ].


                                                        25


         systems.  Today's processor speeds relative to memory speeds make it im-
         perative to seriously consider the caching effects. We believe it is possible to
         design the data structures and the locking architecture of an operating system
         to be cache friendly.  Since 10 to 20 lock operations can be performed in the
         processor's primary cache in the time it takes to service a single cache miss,
         improving  locality  and  reducing  the  sharing  of  locks  is  likely  to  be  more
         important than reducing the number of locks.


      o  We  are  considering  using  lock-free  data  structures  for  simple  leaf  locks,
         particularly for data structures that are required by interrupt handlers and if
         the data to be modified is contained in a single word.


      o  Clustering to bound contention and increase lock bandwidth is a clear neces-
         sity and should prove to be even more beneficial in our new, larger and faster
         system.


      o  Although some of the benefits of our hybrid locks observed in our current sys-
         tem will no longer apply to our new system, their ability to reduce the number
         of critical sections and to simplify atomic operations involving multiple data
         structures is still valuable.


      o  We  are  currently  investigating  alternative  deadlock  management  schemes
         such as the timestamp based approach used in the OSF/1 UFS implementa-
         tion [16  ], to be used in conjunction with hierarchical clustering.  We hope
         to  be  able  to  preserve  the  simplicity  of  the  pessimistic  approach,  and  the
         performance of the optimistic approach.


      o  Finally, we are starting with a more process-oriented kernel, in part to remove
         some of the complications of clustering and deadlock, and in part because we
         believe dynamic process creation can be made to be very fast [8 ]. We will be
         reducing our reliance on spin locks, choosing instead to use either lock-free
         data structures or spin-then-block locks, depending on the situation. As such,
         the benefits of distributed spin locks will likely be reduced, although it should
         be possible to support process blocking under distributed locks by building
         on some of the techniques described in Section 3.2 for handling TryLock.


6      Concluding Remarks


In  this  paper,  we  have  described  a  new  locking  architecture  designed  for  large-
scale shared-memory multiprocessors.  This architecture consists of a number of


                                                        26


components that together provide high performance and scalability. First, a hybrid
coarse-grain/fine-grain locking strategy is used that has the low latency and space
overhead of a coarse-grained locking strategy, while having the high concurrency
of a fine-grained locking strategy.  The coarse-grained locks protect large amounts
of data,  but may only be held for short periods of time.   The fine-grained locks
must be set under the protection of coarse-grained locks, but can protect individual
objects, require only a single bit of storage, and may be held for longer periods of
time.
      Second, Hierarchical Clustering extends the effectiveness of the hybrid locking
strategy to large systems.  It organizes the processors into clusters, with separate
instances of data structures and the locks that protect them on each cluster.  Pri-
marily  read-shared  data  is  replicated  as  needed  to  accessing  clusters,  increasing
concurrency.  Because only processors local to the cluster may access the data in
the  cluster  (requiring  an  RPC  to  access  remote  data),  the  number  of  processors
contending for a lock is bounded, limiting second-order contention effects on the
fine-grained locks.
      Finally, Distributed Locks are used to further reduce the second-order effects of
lock contention.  Our modifications to Distributed Locks bring their uncontended
cost close to that of spin locks.
      The results of our performance experiments clearly demonstrate the effective-
ness of our strategy, at least for our current hardware base.  The independent fault
test showed little contention for the coarse-grained locks up to 4 processors, sug-
gesting that this aspect of the hybrid locking strategy is appropriate for clusters with
up to 8 or even 16 processors under more realistic workloads.  However, for non-
independent faults, which require greater cross-processor interactions, cluster sizes
larger than 4 provided the best performance.  Taken together these results suggest
that with a mix of real applications having both independent and non-independent
demands,  a cluster size somewhere in the range of 4 to 16 processors would be
optimal for our system.
      Overall, we have found that the design of a locking architecture is largely an
exercise in global optimization, as one tries to balance the strengths and weaknesses
of both the techniques and the underlying hardware. However, we believe that many
of the techniques presented in this paper will also apply to other systems.


References


  [1]   David  L.  Black,  Avadis  Tevanian  Jr.  ,  David  B.  Golub,  and  Michael  W.
        Young.  Locking and reference counting in the mach kernel.  In Proc. 1991


                                                        27


        ICPP, volume II, Software,  pages II#167#II#173,  Boca Raton,  FL, August
        1991. CRC Press.


  [2]   H.H.Y. Chang and B. Rosenburg. Experience porting mach to the RP3 large-
        scale shared-memory multiprocessor.  Future Generation Computer Systems,
        7(2#3):259#267, April 1992.


  [3]   E. Chaves, P.C Das, T. J. LeBlanc, B. D. Marsh, and M. L. Scott. Kernel-kernel
        communication in a shared-memory multiprocessor.  Concurrency:  Practice
        and Experience, 5(3):171#191, May 1993.


  [4]   J.  Bradley  Chen  and  Brian  N.  Bershad.   The  impact  of  operating  system
        structure on memory system performance.  In Proc. 14th ACM SOSP, pages
        120#133, 1993.


  [5]   Travis S. Craig.  Building FIFO and priority-queuing spin locks from atomic
        swap.  Technical Report TR 93-02-02,  University of Washington,  02 1993.
        (ftp tr/1993/02/UW-CSE-93-02-02.PS.Z from cs.washington.edu).


  [6]   R. P. Draves, B. N. Bershad, R. F. Rashid, and R. W. Dean. Using continuations
        to implement thread management and communication in operating systems.
        In Proc. 13th ACM SOSP, page 122, Paci#c Grove, CA, October 1991.


  [7]   S. Frank, J. Rothnie, and H. Burkhardt. The KSR1: Bridging the gap between
        shared memory and MPPs.  In IEEE Compcon 1993 Digest of Papers, pages
        285#294, 1993.


  [8]   B.  Gamsa,  O.  Krieger,  and  M.  Stumm.   Optimizing  IPC  performance  for
        shared-memory multiprocessors.  In Proc. 1994 ICPP, pages 208#211, Boca
        Raton, FL, August 1994. CRC Press.


  [9]   Erik  Hagersten,  Anders  Landin,  and  Seif  Haridi.   #DDM  #  A  Cache-Only
        Memory Architecture#.  IEEE Computer, pages 44#54, September 1992.


[10]    Maurice Herlihy. Wait-free synchronization. ACM TOPLAS, 13(1):124#149,
        January 1991.


[11]    Maurice Herlihy. A methodology for implementing highly concurrent objects.
        ACM TOPLAS, 15(5):745#770, November 1993.


[12]    J. Kent Peacock, S. Saxena, D. Thomas, F. Yang, and W. Yu.  Experiences
        from multithreading system V release 4. In SEDMS III, pages 77#91. Usenix
        Assoc, March 1992.


                                                        28


[13]    Orran Krieger. HFS: A #exible #le system for shared memory multiprocessors.
        PhD thesis, Department of Electrical and Computer Engineering, University
        of Toronto, Toronto, Canada, 1994.


[14]    Orran Krieger,  Michael Stumm,  and Ronald Unrau.  The Alloc Stream Fa-
        cility:  A redesign of application-level stream I/O.  Technical Report CSRI-
        275, Computer Systems Research Institute, University of Toronto, Toronto,
        Canada, M5S 1A1, October 1992.


[15]    Beng-Hong Lim and Anant Agarwal.  Reactive synchronization algorithms
        for multiprocessors.  In ASPLOS-VI, 1994.  To appear.


[16]    Susan LoVerso, Noemi Paciorek, Alan Langerman, and George Feinberg. The
        OSF/1 UNIX #lesystem (UFS).  In USENIX Conference Proceedings, pages
        207#218, Dallas, TX, January 21-25 1991. USENIX.


[17]    Peter Magnussen, Anders Landin, and Erik Hagersten. Queue locks on cache
        coherent multiprocessors.  In 8th IPPS, pages 26#29, 1994.


[18]    H. Massalin and C. Pu. A lock-free multiprocessor OS kernel. Technical Re-
        port CUCS-005-91, Department of Computer Science, Columbia University,
        February 1991.


[19]    J.M. Mellor-Crummey and M.L. Scott. #Algorithms for Scalable Synchroniza-
        tion on Shared-Memory Multiprocessors#.  ACM Transactions on Computer
        Systems, 9(1):21#65, February 1991.


[20]    Noemi Paciorek, Susan Lo Verso, and Alan Langerman.  Debugging multi-
        processor operating system kernels. In SEDMS II, pages 185#202. USENIX,
        Atlanta GA, March 21 - 22 1991.


[21]    J. Kent Peacock.  File system multithreading in system V release 4 MP.  In
        USENIX Conference Proceedings, pages 19#30, San Antonio, TX, Summer
        1992. USENIX.


[22]    Abraham Silberschatz, James L. Peterson, and Peter Galvin.  Operating Sys-
        tems Concepts.  Addison-Wesley, third edition edition, 1991.


[23]    Daniel  Stodolsky,  J.  Bradley  Chen,  and  Brian  N.  Bershad.   Fast  interrupt
        priority management in operating system kernels.  In USENIX Microkernels
        Workshop. USENIX, 1993.


                                                        29


[24]    Josep  Torrellas,  Anoop  Gupta,  and  John  L.  Hennessy.   Characterizing  the
        caching and synchronization performance of a multiprocessor operating sys-
        tem.   In  ASPLOS-IV  Proceedings,  pages  162#174,  Boston,  Massachusetts,
        1992.


[25]    R.  Unrau,  M.  Stumm,  O.  Krieger,  and  B.  Gamsa.   Hierarchical  clustering:
        A structure for scalable multiprocessor operating system design.  Journal of
        Supercomputing. To appear. Also available as technical report CSRI-268 from
        ftp.csri.toronto.edu.


[26]    Ronald C. Unrau. Scalable Memory Management through Hierarchical Sym-
        metric Multiprocessing.  PhD thesis, Department of Electrical and Computer
        Engineering, University of Toronto, Toronto, Canada, January 1993.


[27]    Zvonko G. Vranesic, Michael Stumm, Ron White, and David Lewis.  #The
        Hector Multiprocessor#.  Computer, 24(1), January 1991.


                                                        30