################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	    Proceedings of the USENIX SEDMS IV Conference
      (Experiences with Distributed and Multiprocessor Systems)
	     San Diego, California, September 22-23, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


(This is an ASCII version of the paper presented at the
Symposium on Experiences with Distributed and Multiprocessor
Systems (SEDMS IV), San Diego, California, September, 1993.
Figures were omitted from this version.)

                    ****************


   Distributed Shared Abstractions (DSA) on Large-Scale Multiprocessors  
				     by

      Christian Clemencon   Bodhisattwa Mukherjee    Karsten Schwan 
    
  				College of Computing  
			   Georgia Institute of Technology  
				 Atlanta, GA  30332  
	e-mail: clmenco@lse.epfl.ch  bodhi,schwan @cc.gatech.edu
 

ABSTRACT:

Any parallel program has abstractions that are shared by
the program's multiple processes, including data structures containing
shared data, code implementing operations, 
type instances used for synchronization or communication, etc.
Such shared abstractions can considerably affect parallel program 
performance, on distributed and on shared memory multiprocessors.
As a result, their implementation must be efficient, and such efficiency
should be achieved without unduly compromising program portability
and maintainability. 

The primary contribution of the DSA library is the
encapsulation of shared abstractions as objects that may be internally 
distributed across different nodes of the parallel machine. Such
distributed shared abstractions (DSA) are encapsulated so that program 
portability can be maintained across parallel architectures
ranging from small-scale multiprocessors, to large-scale shared and
distributed memory machines, to networks of computer workstations. 
This paper demonstrates an implementation of the DSA library
on shared memory multiprocessors. The library is evaluated using
a parallel implementation of a branch-and-bound algorithm for solving 
the Traveling Salesperson Problem (TSP). This evaluation is performed
on a 32-node GP1000 BBN Butterfly multiprocessor, and such experimental
results are compared to measurements attained 
on a 32-node Kendall Square Supercomputer.

1. Introduction: 

A parallel program can be viewed as a set of independent processes
interacting via shared abstractions. Such abstractions include
shared data, shared types like work queues and locks, and globally 
executed operations like global sums, merging scan-lines into coherent
bitmaps, etc. Since the shared abstractions used in a parallel program
represent the program's global information, their efficient 
implementation can be crucial to the program's performance, its 
scalability to different size machines, and its portability to different
target architectures . For example, 
the differences in local to remote memory access costs in most large-scale
parallel machines (ie., the NUMA properties of such 
machines )
require substantive changes in the implementation of synchronization 
constructs for small-scale parallel machines  
to large-scale parallel machines . 
 
The contribution of our work toward increased scalability and portability
of parallel programs is the provision of the DSA library for the
efficient implementation of shared abstractions. Portability is
attained by encapsulation of shared abstractions as objects with
well-defined operational interfaces. Scalability is achieved by
implementation of such objects as sets of object 
fragments  linked by a user-defined communication 
structure, which we term a    topology . 
Such    distributed shared abstractions (DSA)  permit programmers 
(1) to take advantage of localities of reference to locally vs. remotely 
stored object state and (2) to encode their application-level knowledge 
about the communication patterns among object fragments.
Furthermore, contention of access to an object can be reduced by
distribution of its representation across multiple nodes, since
many operations on the object will access only locally stored copies
of its distributed state. 
 
Distributed representations of shared abstractions can result in 
significant performance benefits, as demonstrated for many 
implementations of
higher level operating system services in distributed systems
(e.g., file systems ) and for application-specific
services on distributed memory machines . For shared memory
multiprocessors, similar results have been attained for RPC
implementations on NUMA machines like the BBN Butterfly 
multiprocessor  and are demonstrated in this paper for
a program-specific abstraction (ie., a shared queue) in a parallel 
branch-and-bound application executed on a 32-node GP1000 BBN Butterfly
and a 32-node Kendall Square Research 
Supercomputer. In this shared queue, alternative fragmentation of the
object make use of application-level information about 
both the specific pattern and the rates of communications 
between multiple queue fragments. 
Sample application-level knowledge includes desirable or acceptable 
global or local orderings among queue elements, tolerable delays
regarding the propagation of information among queue fragments, etc.
 
The functionality of the DSA library 
(1) permits programmers to define and create encapsulated, 
fragmented objects, and 
(2) offers low-level mechanisms for implementing 
efficient, abstraction-specific communications among object
fragments.
Since the library is built as an extension of a
Mach-compatible Cthreads package developed by our 
group , a parallel program written with the 
DSA library
consists of a set of independent threads interacting via DSA
objects. 
Portability of the
DSA library to different shared memory multiprocessors is due to
the portability of the underlying Cthreads library. Portability of 
DSA-based programs from shared to distributed memory machines
(including workstation networks) is due to the use of an easily ported
remote invocation mechanism  for communication 
among object fragments.
 
The remainder of this paper first compares our work with 
related research (Section 2). Section 3 presents a sample parallel application
and the abstractions shared by its concurrent processes.
Next, the performance effects of alternative, shared memory 
implementations of such shared abstractions are evaluated experimentally
on a 32-node BBN Butterfly multiprocessor. Section 4 describes
the DSA library. The library is evaluated in detail in Section 5
and finally, Section 6 describes our conclusions and future research.
 
1.1 Related Research: 
There are several differences of our research to current
work on distributed shared abstractions. First, in contrast to 
recent research in cache architectures for parallel machines  
and in weakly consistent distributed shared memory , 
we do not assume some fixed model (or limited number of models) of 
consistency between object fragments. Instead, 
programmers can implement object-specific protocols for state consistency
among object fragments, using the low-level remote invocation mechanism
offered by the DSA library. Second, since communications among 
objects fragments are explicitly programmed, shared abstractions
implemented with the library are not subject to some of 
the performance penalties in distributed shared memory systems arising 
from sharing multiple, small abstractions allocated on a single shared 
page (ie., false sharing leading to additional and/or excessively large 
communications). Conversely, by adding calls like ``invalidate(page)'' 
and ``get(page)'' , etc. to our current low-level 
communication calls,
distributed shared memory (DSM) abstractions may be implemented and 
compared with alternative representations within the existing DSA library.
Third, in contrast to the research of Shapiro on fragmented 
objects , we explicitly consider the communication structure 
linking object fragments in order to exploit application-specific
knowledge of the object's communication patterns.
Fourth, in contrast to our previous implementation of the DSA library on 
hypercube machines , the layering of DSA objects on a basic 
remote invocation mechanism has resulted in library portability to
various target platforms, including the aforementioned shared memory
platforms and a recently completed implementation using Cthreads and
PVM . Last, we note that shared abstractions are easily
instrumented, evaluated , and even dynamically 
adjusted  without exposing such instrumentation
to application programs.


2. Shared Abstractions in Parallel Programs:
 
Programmers use a variety of methods for decomposition of programs into
concurrently executable processors, including the static or dynamic
decomposition of program's data domains, divide and conquer strategies,
functional decompositions, and pipelining. Many parallel programs
resulting from such decompositions exhibit coordinator/server
structures, where coordinator processes generate work units processed
by workers  or at least supervise a number of workers.
Sample applications structured in this fashion range from (1)
domain-decomposed scientific applications to (2) MultiLisp 
implementations on parallel machines, where ``future's' are entered into 
queues and removed and processed by available processors ,
to (3) parallel optimization codes , and (4)
even operating system services like file or I/O servers.
 
The sample parallel program used in our research is a client/server
structured application, a parallel branch-and-bound algorithm solving the 
Travelling Salesperson problem (TSP). We employ the algorithm of Little,
Murty, Sweeney and Karel (LMSK algorithm) , and we use a 
parallelization similar to the one by Mohan on the Cm*
multiprocessor .
The resulting parallel LMSK algorithm is implemented as a collection of 
asynchronous,
cooperating searcher threads each of which independently conducts a
search  in a 
dynamically constructed search space until the best tour is found. 
The searchers cooperate using two shared abstractions:
(1) a work sharing queue (``work queue'') storing the leaf nodes of the
tree representing the search space and permitting the dynamic 
distribution of work among searchers,
and (2) a shared integer value (``best tour'') representing the current 
best tour found so far and used by searchers to prune the search space.
A TSP computation is initiated by a single thread, by first enqueuing a 
representation of the initial problem (the root node) in
the work sharing queue, and then forking some predefined number of
searcher threads. 
A complete description of the LMSK algorithm and its parallel implementation 
can be found in   and   respectively.
 
The parallel LMSK algorithm is of interest for two reasons. First,
branch-and-bound algorithms are commonly used in the solution of
optimization problems and have therefore, been frequently studied and
evaluated on parallel machines. Second, experimental evaluations of the
algorithm's implementation on distributed memory
platforms  and on workstation
networks  have already demonstrated the importance of 
the work sharing and tour abstractions to parallel program performance,
where different implementations of the queue itself and of load 
balancing among queue fragments significantly effect speedup and 
scalability. 

Shared Memory Implementation of TSP Abstractions 
When using shared memory to implement the TSP ``tour'' and ``work queue''
abstractions, two important factors affect the resulting parallel
program's performance: (1) contention due to
concurrent abstraction access (synchronization overhead)
and (2) remote memory access costs (communication overhead).
Specifically, in the BBN Butterfly, the ratio of access costs to local
vs. remote memory is approximately 1:12, which implies that the cost
of executing an operation on a shared abstraction strongly depends on 
the number of remote references performed by the operation . 
As a result and in order to reduce contention and take advantage of
locality, implementations of shared abstractions in NUMA machines like
the BBN Butterfly often explicitly 
distribute their state and code data to participating processors' memory
units, and then use abstraction-specific communication structures to
maintain consistency among such distributed information. Alternative
implementations of the shared work queue abstraction are described and 
evaluated next. We continue to use shared memory for implementation of
the ``tour'' abstraction, since the performance impacts of alternative
implementations of this abstraction are small in NUMA multiprocessors
(this may not be the case for distributed memory machines, 
as described in  ).
 
The work sharing queue stores the current leaf nodes of the search tree.
In a parallel implementation, this abstraction implements: (1) a 
node selection heuristic, (2) a work distribution strategy, and (3) a
protocol for program termination. 
The node selection heuristic is implemented as an ordering of queue 
elements. Queue elements (nodes) are ordered (a) by their lower bound on 
the problem's solution and (b) by their sub-problem sizes. When using a 
double-priority queue ordered by (a) and (b), retrieval and processing 
of the first node on the queue implements a best first heuristic for 
node selection. In other words, best first node selection always 
chooses for expansion the node 
with the least subproblem size from the set of nodes that have the lowest
lower bound value. This strategy favors nodes that are likely to lead to
good solutions fast. It has been shown useful in many parallel LMSK
implementations , since it tends to minimize the total 
number of nodes in the final search tree.
 
We term a queue implementation    consistent  if a global priority
ordering is maintained among queue elements. A consistent queue 
faithfully implements the `best first' node selection heuristic, whereas
queue implementations that do not maintain a total queue ordering -- 
termed    inconsistent  -- decrease the effectiveness of the 
node selection heuristic. Such decreased effectiveness is undesirable 
since it leads to substantial additional computations in the parallel 
algorithm due to the expansion of nodes that would not be expanded by
the sequential algorithm -- termed    additional nodes .
 
Since the TSP's search space is constructed dynamically, another 
important role of the work sharing abstraction is to ensure the equal
distribution of work (ie., nodes) among searchers. This is trivially 
ensured when using a global queue. In fragmented queue implementations,
however, load-balancing must be performed among queue fragments. Since
such load balancing must take into account both the sizes of queue
fragments (number of nodes per fragment) and the ordering among nodes,
it is henceforth termed    quality balancing . An effective quality
balancing strategy, then, ensures both a global ordering of nodes and
an equal distribution of nodes among queue fragments. Tradeoffs in 
effectiveness vs. efficiency of queue implementation and quality 
balancing are apparent in the three alternative queue implementations
and their measurements described next.
 
a) Global queue representation:  
A first  implementation using a single queue copy takes advantage of the 
BBN Butterfly's shared 
memory. Each searcher thread allocates new nodes in its
processor's local memory. However, all nodes are linked into a single
queue that spans all processors. A predetermined, single processor 
maintains the queue's head as well as a spinlock for mutual exclusion in 
queue access. In this implementation, no work distribution strategy is 
needed, and the termination protocol is implicit: searchers terminate 
when the queue is empty.

b) Distributed representation without quality-balancing:  
A second implementation attempts to maximize locality of access to
queue elements, while performing minimal load balancing.
Specifically, the global priority queue is split 
into several subqueues, which are interconnected with a unidirectional 
ring. Each searcher thread owns a local queue fragment, which is
implemented as a priority  queue and protected by a local spinlock.
The searcher threads enters and removes nodes into/from its local 
subqueue, and allocates new nodes in local memory.
The work distribution strategy performs load-balancing as follows:
if a searcher performs a `get' operation on an empty local queue
fragment, it then simply removes the `best' node from the next 
non-empty remote queue fragment along the ring. This results in
the sharing of `good' nodes among searchers only when searchers have
exhausted their own parts of the search space.
This queue representation also requires an explicit termination protocol.
In this case, a searcher terminates when all of the queue fragments
along the ring are empty and at least one tour has been found.

c) Distributed representation with quality-balancing:  
A third implementation is like the previous one, but
also performs continuous quality-balancing. Specifically,
similar to the strategy used in  ,
every two `get' operations by a searcher thread on its local queue
trigger a move of the second best node from the local queue to the
next subqueue along the ring. As a result, `good' nodes are 
frequently shared among different searcher threads. This increases the
overall quality of nodes used by searcher threads, but it also increases
the total number of accesses made by threads to non-local node
representations  Sharing of nodes more (for every `get'
operation) or less (every four `get' operations) frequently
results in performance degradation. Similarly, the
association of node sharing with `getting' vs. `putting' nodes appears
to have no visible performance effects  . .


  The Scalability of Parallel Programs: A Case Study of Shared Queues 
 
All measurements given in this section are performed on a 32-node
GP 1000 BBN Butterfly. The measurements shown are the averages of the 
executions of 100 different, randomly generated TSP problems. 
Each TSP problem has 32 cities and is described by
an initial random cost matrix, with costs in the range of 1 to 50.
Each TSP problem is executed for each of the three work sharing 
abstractions, with the same initial cost matrix.
Each searcher thread executes on its own dedicated processor with 
local copies of its code, stack, local data, and with a local copy of 
the cost matrix. 
 
The first experimental results shown below demonstrate that the 
achievement of good    scalability  of parallel programs must use
representations of shared abstractions that take into account 
program semantics as well as program implementation 
details. Specifically, 
Figures   shows the 
speedups of the TSP application when it is executed 
with 1 to 25 processors using each of the three different queue 
implementations.    Variant a  is the global queue implementation,
where searcher threads share all subproblems ranked by
knowledge about program semantics, which is subproblem size and quality.
In contrast,    variant b  is the 
distributed queue without quality balancing, where searcher threads
share no knowledge concerning such program semantics.
Variant c  is the distributed queue with quality balancing.
Speedup is computed as the ratio between
the sequential (execution time of the sequential implementation
of TSP is 18484 milliseconds) and the parallel execution times.
The results depicted in Figure   demonstrate that
significant execution speedups are possible with the distributed
queue implementations.
Similar speedups should be achievable on larger
parallel machines as long as the problem size is increased beyond the
32 cities used in our measurements.


It is apparent from Figure  
that    variant c  -- the distributed
queue with quality balancing -- behaves best. 
While improvements in locality of access to queue
elements exist in    variant b  compared to    variant a , 
the disadvantages incurred by additional work performed by searcher 
threads outweigh the accrued performance gains. 
In other words, while the distributed implementations 
(   variants b and c ) are superior to    variant a  regarding 
the locality of access, the
complete loss of the total ordering maintained by the global queue
in    variant a  is not acceptable. Therefore, it is not
an effective strategy to implement shared abstraction without using 
information about program semantics. These results
and similar results reported for distributed memory 
machines  are our main motivation for rejecting
conceptually simpler approaches like distributed shared memory 
for the implementation of shared abstractions in parallel programs.

  
Figure   provides additional explanation of the results
depicted in Figure  , by
depicting the total number of nodes expanded in order to arrive at
a solution. As stated in the previous paragraph, the total number of
expanded nodes is highest when load sharing ignores semantic information
in    variant b  (ie., no quality balancing), whereas the number
of expanded nodes with quality balancing (   variant c ) closely 
approximates the number attained with the globally ordered priority 
queue (   variant a ).
 
The elapsed time in the queue abstraction consists of
time spent in the `get' and `put' operations, which can be decomposed
into: (1) the time spent for managing the queue, (2) the time spent
for explicit communication between queue fragments (generating messages, 
processing requests, etc.), (3) the time spent waiting for locks 
protecting the queue from concurrent accesses, and (4) the wait time 
experienced during termination detection. Of these times, (1) is
insignificant since the time spent managing the double priority queue is
only about 1.75  of total program execution time.
In    variant a , the time spent in the work sharing abstraction is
almost entirely due to (3) -- queue access contention. 
This time significantly increases with the number of processors and 
beyond 15 processors, it exceeds the time spent doing useful 
work (ie., expanding nodes). It is the main cause for the degradation of
speedup in    variant a  as shown in Figure  .
 
In    variant b , contention is insignificant, because searchers almost
always access local queue fragments and therefore, the time searchers 
spend in the queue is primarily due to queue management and termination
detection, neither of which are very time-consuming. As expected,
the quality balancing performed in    variant c  increases the time
spent in the queue abstraction, but it is outweighed by the significant
reduction in the total number of node expansions performed during
problem solution.
 
An issue not discussed above is the storage of node data,
which results in performance differences regarding the expansion of 
locally vs. remotely stored nodes. In this implementation of TSP
on the BBN Butterfly machine, such differences are not as significant
as in distributed memory 
implementations  Expansion of a locally 
vs. remotely stored node can be performed in 25ms vs. 27 ms 
on a BBN Butterfly machine.  .
 
To summarize, we have used the shared queue abstraction in a parallel
branch-and-bound program to demonstrate that the TSP program's
speedup is limited by the performance of the abstractions shared by
its processes. Interestingly, this paper shows that the most complex 
implementation of the major shared abstraction in TSP -- the fragmented,
quality balanced queue -- is superior to its alternatives. 
Unfortunately, this implementation is not simple, which reduces 
the much-heralded ease of implementation offered to application
programmers by the multiprocessor's shared memory model.
Ease of programming is the topic of the remainder of this paper.

3. The DSA Library: Implementing Distributed Objects: 
The primary objectives of the    DSA library  are to (1) 
facilitate the implementation of shared, distributed abstractions
in parallel programs, and (2) provide a unified interface for 
implementing and using such shared objects in shared memory 
vs. non-shared memory environments. The library's
structure, implementation, and performance on multiple shared memory 
machines is described next, including its performance on a 
32-node GP1000 BBN Butterfly multiprocessor 
and a 32-node Kendall Square Supercomputer. 
Additional examples regarding DSA's
use and an alternative implementation of DSA as part of the
operating system kernel on a distributed memory machine are described 
in  .
 
A    DSA object  defines a communication structure (a    topology ) 
and protocol among a number of communicating user threads. When viewed 
by other program components (ie., the end user's view), such an object
appears as a single abstraction shared among threads that potentially
execute on different processors. The object exports well-defined 
operations that may be invoked by any thread able to access it.
In contrast, the implementor of a DSA object understands the
object's internal structure to consist of a set of cooperating
object fragments (state and code), where (1) object fragments may be 
stored in different memory units, and (2) such fragments must
explicitly communicate in order to execute some (or all) of the
operations performed on the object. 
 
Three distinct view of objects are offered by the interfaces
provided by the DSA library (Figure  ). At    user-level , the library offers
routines for binding a user thread to an already defined object and
for invoking the object's operations. At    representation level ,
the library offers routines for object creation and data structures 
defining object fragments and the communication structure.
At    implementation-level , the library offers 
routines for implementing individual object fragments, including their
operations, their communications with other fragments, etc.

DSA Objects -- User-Level Interface 
   Basic definition. 
A    DSA object  consists of a set of identical fragments that
are potentially located on different processors. Such distributed
fragments are connected with a statically defined logical communication 
structure. 
As a simple example,
consider the tour value shared by all searcher threads in the TSP
application. For simplicity in
implementation, we represent this object as identical 
fragments (or vertices)  The use of different versions of
object fragments in a single DSA object is not supported in the 
current implementation of the DSA library. Such a generalization of
DSA objects can be useful, and is discussed further 
in  .  
linked by a ring 
communication structure. 


Even though an object may have a complex internal communication structure,
an end user only knows about 
the locations of specific object
fragments, each of which exports all of the object's operations and 
therefore, much like `proxies'  locally emulates the 
object's complete functionality. 
For example, the shared tour object's operations are `read tour' and
`new tour', and the private data of the object is the
current `best tour' value. 
 
Communications among fragments
are not visible to object invokers. In the case of the tour object, such
communications concern updates to the local copies of `best tour' values 
stored in object fragments. Specifically, while
`read tour' simply reads the copy of the tour value stored in the
local fragment, `new tour' initiates the propagation of a new tour
value around the ring to other object fragments. This propagation can be
performed asynchronously to the execution of additional operations on
the local or remote object fragments, so that the desired consistency
of the multiple copies of tour values around the ring can be controlled
by the tour object's implementation.
 
Object binding.  
Before using an instantiated DSA object, an application's threads 
must be bound to the object's vertices.
Each vertex may be bound to zero or multiple threads, and each thread
may be bound to multiple vertices of the same or of different
DSA object instances.  However, the current version of the library
requires that a binding is performed only between a thread and 
locally stored vertices.
Bindings are established, and broken using library routines listed
in Figure  .
 
The `top open' routine returns a handle for further accesses to
the specified object's vertex. 
The `top close' routine breaks the specified
`obj handle' binding. However, `top close' does not clean up object 
state for future use. Such cleanup operations have to be implemented
as additional services called explicitly by application programs.

Object invocation.  
The effect of `top send' is the invocation of the service
identified by `srv id' in the vertex identified by handle `obj handle'.
The required invocation parameters are assumed to be packaged in a
parameter block called `param', and `param size' indicates the size of 
this block. Each
invocation may be tagged with an arbitrary, user-provided `tag' value,
which may be used for communication of sequencing information, etc.
 
If user programs require synchronization with output generation at
the local fragment, they may invoke the vertex operation
`top send w'. This operation will block the invoker until the
invoked service and fragment have generated all of the required outputs.
 
A user thread obtains the result of a service executed by an
invoked vertex by calling `top receive'. This
routine copies the parameters returned by service `srv id' into
the buffer pointed to by `param'. 
The `tag' parameter permits a wild card value.
 
The `top receive w' routine blocks the caller thread until
the requested return value is available at the local fragment.
Since such
threads resume execution in the `top receive w' routine, they
will complete the receive upon being dispatched.
 
Given the library operations defined on object fragments, 
the procedural interface of the tour object can be easily written to
isolate end users from the implementation details of objects implemented
with the library . 
DSA Objects -- Creation and Internal Representation 
Object creation. 
The creation of DSA object instances is typically performed at the
time of program initialization.
Once created, an object instance cannot be removed until program
termination. 
A DSA object is described by (1) the size of the private
data buffer in the object's vertices, (2) the operations it
implements, (3) the object's logical communication structure 
and (4) the mapping of the logical structure to the physical nodes of 
the underlying machine. 
 
Each service implementing an object operation 
is described in a table element by (1) a unique
identifier, (2) three procedure addresses, including (a) a procedure 
performing    precondition  evaluation (explained further below), 
(b) a procedure implementing the actual operation, called a    service
routine , and (c) a procedure performing    postcondition  
evaluation (also explained below), and (3) a representation
specifier. A service executed as a procedure is represented
as an    ADT  (Abstract Data Type), whereas a service executed 
asynchronously to the invoker by its own thread is represented as 
a    TADT  (Thread Abstract Data Type).
 
The logical communication structure of an object is described as a
NxN from/to connection matrix, where N is the number of 
vertices  Routines able to generate such a matrix
at the time of program initialization  may be used in place of
the simple statically defined structure shown in this example. . Similarly,
the mapping of vertices to physical nodes is described by a table with
`N' entries. 
 
Given the matrix and table structures as above,
an application creates a mapped object instance by calling the ``top create''
routine (Figure  ).
The first seven parameters of this routine describe the new object's id,
the space required for each fragment's state, the number of service
routines whose addresses appear in the `services table', the number of
vertices and the connection matrix, and the mapping
of vertices to physical processor nodes. The last two parameters determine
the size of the pool of pre-allocated invocation blocks associated with 
each object fragment.
 
The implementor of an object can control the activation of a service
by associating a precondition routine with that service.
The role of such a routine is to define a service scheduling policy, 
based on the availability of inputs for that service in a particular vertex.
For instance, a precondition may require that inputs from all input edges 
must be present in order to activate a service, as shown useful in
synchronization objects implemented as combining trees or in certain
implementations of objects implementing global sums or minima .
Similarly, postconditions associated with service routines can control 
the propagation of values
across the object's communication structure by controlling output 
generation at vertices.
An output may be generated after each 
service execution, or after some delay required or desired by 
the application. Furthermore, the result of a service's execution may 
be sent to one, some, or all output edges of a vertex, or to a user thread 
requiring it.
For example, in a tree-structured global sum object, while each
vertex can incrementally perform its addition operations upon the 
arrival of each input, each single output
cannot be generated until all inputs have been received and added.
This requires the use of a postcondition. 
By default, a DSA service routine is activated incrementally as each 
input for that service arrives. 
 
In summary, the purpose of the precondition and postcondition procedures 
specified as part of object creation and executed with
each object invocation (if such routines have been specified) is
to determine (1) when a service routine is activated in response to
an invocation (service scheduling), (2) when control is returned to the 
user thread (invocation control), and (3) what, if any, other object 
fragments must be accessed for execution of the desired service (fragment
management). 

DSA Objects -- Implementation of Object Fragments 

Input and output queues. 
It is apparent from the discussions above that each object
fragment is constructed such that its operations (services,
pre- and postconditions) can be executed asynchronously with the
invoking program. Furthermore, a fragment's services may be
executed in response to invocations from other object fragments
or from a locally bound user-level thread. As a result, each
object fragment's implementation contains addressing information
about bound threads and connected fragment, and it contains several
queueing structures in addition to the aforementioned object state and
its user-specified services and pre- and postconditions.

These queues 

include:
(1) an    input queue  shared by all threads bound to the vertex,
(2) an    edge queue  for each edge providing input to the vertex, 


and (3) output queues for each vertex output. 
 
Service routines.  
Service routines perform the computations implementing an object's
operations . 
A service is executed either as a result of 
invocation of the operation on the local fragment, or when a 
message arrives at a fragment's input edge.  
If executed as a result of a message receipt from another fragment,
the information is contained in an invocation block `ib'
queued in the fragment's input queue. 
Each invocation block contains routing information 
(source and destination vertices), an identifier of the invoked 
service, a buffer into which the parameters required by this service 
have been packed, and a tag value.
The invocation block only contains a
pointer to the actual parameters, so that unnecessary copy operations
are avoided.
The services defined in a topology object are uniquely identified by an
integer `Tag'. 
For instance, in a ring topology, 
the application code or pre-/postconditions can use the tag value
to identify a previously sent ib that has fully traversed the ring.
A few examples of service routines, use of postconditions and application
dependent memory consistency are documented in  .
 
Remote invocations.  
The edges connecting object fragments are uni-directional, logical 
communication links. While the physical representation of such an
edge is the appropriate edge queue of the target vertex, all 
communications across
edges use a remote fragment invocation mechanism. As an example,
consider a link from vertex    v1  to    v2 . 
Whenever a service routine in    v1 
outputs a new tour value across this edge (ie., enters data into the 
appropriate edge queue of    v2 ), it also initiates the execution of 
the target vertex' service routine `new tour srv'. The resulting
remote queue access coupled with remote service routine execution 
comprises the remote invocation protocol used by the library for
fragment communications. Such remote invocations can be    immediate ,
which means that the control flow on the target vertex' processor is
interrupted (using Unix `signal' operations), or they can be 
   delayed , which means that the remote service will be executed
only when the user thread bound to the remote vertex executes one of
the vertex' operations. Both alternatives have been implemented,
and measurements of both will be shown in Section 4. 

 
Service representation.  
Since services may range from simple, low-latency message 
switching to complex computations, the DSA library offers two 
different execution modes for service routines --
(1) Small grain computations can be performed by service routines implemented
as procedures called in response to an invocation. Execution of such a 
service is atomic (non-preemptible) and multiple invocations of it are 
thus implicitly serialized.
(2) Larger grain computations can be performed by service routines represented
as preemptible threads. A new thread is created for each invocation
of such a service. Threads executing service routines are 
scheduled in a round robin fashion and have priority over user-level
threads. 


Library Support for Service Implementation 


This section reviews the DSA library's low-level support routines used
for service implementation. 
We elide details of the implementation of invocation
blocks and of the addressing information maintained in those blocks.
Instead, we assume that such blocks are the atomic units manipulated
at this level of the DSA library.
 
Preconditions.  
A local fragment invocation or an invocation from a remote fragment
initiates the execution
of the appropriate service routine when there exists no precondition,
else it calls the precondition procedure, in either case providing an
invocation block (ib). The precondition is executed non-preemptively, and
it must explicitly activate the actual service using the support routine
`top service'.
Activation of a service either involves calling
the procedure defining the service, or creating a new thread
that will execute this procedure, depending on the service's 
representation. 
 
The aforementioned queueing structures inside each vertex are required 
because services can be implemented to
execute asynchronously with the user threads requesting them. Several
routines are available for 
precondition procedures to check and manipulate those queues 
(Figure  ).
`Top dequeue input' scans the vertex's input queue 
and dequeues the first ib that matches the given service identifier and
tag value. The tag parameter admits a wild-card value.
`Top check input' checks the vertex's input queue for ib's 
that match the given parameters. `Condition' may be a combination
of different flags for specifying complex conditions like: `ib's must be
available from all input edges', `from at least one input edge', `from
a user threads', etc.
 
Service routines.  
A service routine implements the actual functionality of the 
operation performed by a service. 
The address of the vertex' private data and the
address of the parameter block referenced in the ib are generated using
the macros `top data p' and `top param p'.
Once completed, a service that wishes to send output parameters to 
other vertices, or to a user thread, can activate its postcondition
procedure with the `top postcond' routine. 
  
Postconditions.  
As with preconditions, all postconditions are executed non-preemptively.
A postcondition defines a service's output policy. Specifically,
each vertex contains an output queue for temporary storage of output 
ib's, and the DSA library offers 
access routines for queue manipulation (Figure  ).


A postcondition procedure can use these routines to 
define an output propagation policy for its vertex.
 
  
The most important action taken by postconditions is to generate vertex
output. 
`Top output edges' sends a copy of the specified ib across
all of the vertex' output edges.
For exception handling or when a vertex' output edges cannot be defined
as part of the object's creation, the precondition procedure can 
alternatively use the routine `top output vertex', which sends a copy 
of ib only to the single specified vertex. This routine is particularly 
useful when 
an object's communication structure is constructed dynamically, such
as in dynamic broadcast trees, or for message routing in distributed
systems. Finally, `top output user' is used for transmission of 
results to a user thread. Such transmissions are performed via the
vertex's output queue. Namely, the routine first enqueues the specified 
invocation block on the output queue and then checks if a user thread 
is waiting for it. If a thread is waiting, the routine 
puts the thread back in the processor's ready queue. 
 
4. Evaluation of the DSA Library 


The DSA library has been implemented on a 32-node GP1000 BBN Butterfly
multiprocessor, on a KSR supercomputer, and on a smaller-scale SGI 
multiprocessor. In this section, detailed performance results are
presented for the BBN machine, followed by a few comparative measurements
on the KSR. 
For reference, a procedure call without parameters
costs approximately 3 microseconds on the BBN Butterfly, a call to a
local abstract data type (an ADT) costs about 18 microseconds, and
a thread context switch in our lightweight threads library costs about
215 microseconds.

DSA Object Creation and Access 
   Object representation. 
The performance of DSA objects depends in part on their internal 
representation. In addition to the queueing structures used for
fragment inputs and outputs, each fragment is referenced via lists
maintained on their processors. Specifically, all vertices
located on a processor are linked via a local vertex queue. 
Furthermore, each vertex is internally described by a vertex control
block (abbreviated    vcb ). A    vcb  contains (1) an object instance identifier,
(2) the current vertex identifier, (3) a private data buffer, 
(4) an input queue for the vertex input(s), (5) an output queue to the 
bound thread(s), (6) a waiting queue, (7) a pool of free ib's, 
(8) a table describing the object's services, (9) a table describing the 
output edges (vertex id and node number of each linked vertex), 
and (10) an array of pointers to all of the object's    vcb 's. 
The latter array permits a vertex to access 
any remote    vcb  of the object by direct reference.
 
Object creation.  
Object creation has not been optimized in the current implementation, 
in part because objects are typically created at the time of program 
initialization and are deleted only when the program terminates. 
However, it is instructive to
consider the steps necessary for object creation and undertaken by
the library routine `top create'. This routine first
allocates each of the object's    vcb 's on the appropriate nodes
according to the given mapping table. It then initializes these
   vcb 's as per the object's description. Finally,
`top create' sends a creation event with the appropriate    vcb 
to each target node. Upon reception of a creation event, the event 
dispatcher calls a setup procedure, which enqueues the transmitted    vcb  
in the local vertex queue. 
 
Object binding.  
A user thread binds itself to an object's vertex using
the `top open' routine. 
Specifically, this
routine first performs a linear search
for the    vcb  of the specified vertex on the local vertex queue. 
It then allocates a user control block and binds the calling thread
to the specified vertex by storing the thread identifier and the
   vcb  address in the user control block. The latter also contains a
single dedicated invocation block for use in subsequent `top send' and
`top receive' calls by the bound thread. Finally, the `top open'
routine returns a pointer to the user control block as an object handle.


Object access.  
Since object access is critical regarding the performance of DSA
objects, the steps taken during `normal' object operations must be
highly efficient. The current implementation of
`top send' performs the following four steps:
(1) it disables events, thereby preventing other operations on the
local vertex while it is operating on it, (2) it acquires and 
initializes the single invocation block `owned' by the bound thread,
which includes noting the service id, sizes and addresses 
of the call's parameters, (3) it performs a local invocation of the 
requested service, and (4) it enables events. The parameters are 
directly accessed from within the service; they need not be copied out 
of the parameter block unless otherwise indicated. The cost of
a `top send' operation for representation of
services as `procedures' (ADTs) is 109 seconds whereas a `top send'
followed by a `top output user' cost 232 seconds.
These measurements are attained on a single processor node, using
the average latency derived from 1000 consecutive calls. 


When `top send' is performed for services represented as threads (TADTs),
additional context switch overheads arise, since the invoking thread has
to release the processor, followed by the processor's acquisition by
the thread executing the service. Two alternative implementations of
such context switching on the BBN Butterfly (1) use an
un-optimized operating system call that saves signal masks vs. (2) use
an optimized context switch for lightweight threads. The costs of (1) are
2.5 milliseconds on the BBN Butterfly, whereas (2) only requires 215
microseconds. Unfortunately, (1) must be used when remote services are
invoked in    immediate  mode on the BBN Butterfly (using Unix signals
rather than using the low-level interrupts available at kernel level).
This results in a 10 millisecond latency for invocation of threaded,
immediate services, vs. a 912 microsecond cost of service invocation
for simple threaded services.
 
Similarly, The `top receive' routine costs 123 seconds for both ADT 
and TADT service types.
The `top receive w' routine executes the same steps as the `top receive' 
routine; in addition it blocks the caller
thread if the invocation block is not present in the output queue.
These timings are made when the required invocation block is available,
and when there are no other threads are waiting to receive from the 
vertex. The cost of allocation for a single remote invocation block is 43 
microseconds.

Remote Service Invocation 
The DSA library uses remote invocation as an inter-vertex
communication primitive. While low cost of remote invocation is critical
to the performance of fragmented objects, the use of remote invocation
vs. remote access provides several performance advantages
on NUMA multiprocessors: (1) it tends to improve the locality of
reference of programs by removing remote references, and (2) it 
provides implicit synchronization for cooperating threads.
Results reported in   show that the overheads associated with 
explicit synchronization and remote references increase with 
increases of the `size' of remotely accessed data and code, 
whereas the overheads
associated with remote invocation are fixed. Therefore, remote
references outperform remote invocation only on `simple' operations.
Such results are part of our motivation for implementation of the
low-level remote invocation construct for inter-vertex communication.
  
In DSA, a remote invocation is initiated by a call to the
`top output edges' or `top output vertex' routines.
A remote service invocation consists of:
(1) extracting a free invocation block from the target vertex' pool
using remote references,
(2) copying invocation data into this block, including parameters,
and (3) sending an    event  carrying the invocation block to the
target processor. 
An    event  defines an asynchronous action that is to be performed
on a remote processor, and such events are associated with messages
that pertain to the desired actions.
When processing such an event, the target event 
dispatcher (1) performs a local invocation of the appropriate service,
and upon its completion, (2) places the invocation block back
into the pool of its home vertex.


The performance of inter-vertex communication depends on (a) the costs of
event transmission via an event transmission facility and (b) the
costs of event activation at the target in either    immediate  or    delayed  mode.
In order to reduce the costs of event generation, each
processor locally maintains an event queue and a pool of pre-allocated 
event descriptors. These two data structures are protected by a spinlock.
The generation of    immediate events  requires the use of Unix signals
or of kernel-level interrupts on the BBN Butterfly. For portability,
we use Unix signals. The performance of such signal operations is not
satisfactory. Specifically, the cost of generating an
inter-processor signal is 750  seconds on the GP1000 BBN Butterfly. 
In comparison, the additional overheads due to the implementation of 
events in the DSA library are small. Specifically, the total time for 
sending a single event is 937us when a UNIX signal is generated, and
187us otherwise. The time for a handler to dispatch an event is 153us.
However, the Unix implementation on the BBN Butterfly results in
a signal delivery time that varies from 1 to 110 milliseconds 
depending on the involved processors' current activities.
Consider an asynchronous DSA object resembling the tour object
in the TSP application. This object links one thread
to itself with a 8 vertices ring spanning 8 nodes. The evaluated
service performs only routing of incoming invocation blocks.
Due to the extreme variability of UNIX signal delivery times, we have
measured total round trip times ranging from 18 to 400 milliseconds.
 
Due to the high costs of event generation and delivery, the DSA 
library's BBN Butterfly implementation employs
several optimizations of event transmission and servicing. First,
since event generation is expensive, several simultaneous events can be
grouped into a single event, by simply generating a single event
descriptor for multiple invocation blocks entered into the target
vertex' input edges. Upon receipt of the event, the target vertex' service
routine processes all invocation blocks found in the appropriate
input edges. Second, the remote event queue
is checked prior to signal generation. An empty queue implies that the
remote vertex is currently running the event handler, so that a signal
need not be generated.


The third optimization concerns event masking. Specifically,
Since disabling and
enabling events are quite expensive on
Unix systems (Unix signal masking/unmasking system calls cost
800  seconds on the BBN Butterfly), our implementation maintains
`events enabled' and `events disabled' flags on each processor. 
These flags are set by the local event handlers and inspected at the
time of event generation. The event generation routines do no issue 
signals to the target vertex when its events are currently disabled,
since that implies that the event handler is currently running on the
target processor and will receive and process the invocation blocks
that have already been generated and added to the appropriate input
edge queues.
 
Given the costs of event generation described above, the total cost
of a single remote invocation is 1.15 milliseconds when a UNIX signal is 
generated and 406  seconds otherwise. 
These measurements are attained with an invocation block containing
a two byte parameter.
For comparison, a pair of lock/unlock calls in the Cthreads
library costs 53  seconds.
A kernel-level implementation of remote invocations like 
the one described in   
(using hardware interrupts instead of UNIX signals) would reduce remote 
invocation costs to roughly 500  seconds. 
 
To summarize, the DSA library's implementation of event generation
and delivery favors the use of simultaneous events and therefore, total
event generation and processing overheads are reduced for increasing
numbers of total events generated in the DSA abstraction's execution.

Performance of TSP with DSA Objects 
This section demonstrates the utility of the DSA library by evaluating
its use with the TSP application. The first set of measurements
reported below compare the performance of TSP's    variant c  when
using the shared memory implementation of the distributed queue vs.
using the DSA library for implementation of the same queue variant
(see Figure  ).
Despite the significant overheads of event generation and handling
experienced with the signalling implementation of DSA, performance
results indicate that the DSA library is suitable even for larger-scale
parallel systems: good speedup is maintained when using the DSA library.

  
The speedup results depicted in Figure   are
explained with additional measurements shown in Figure  .
These measurements evaluate the ratio of time spent in the work sharing
abstraction vs. the application's total execution time. It is apparent
from these numbers that the cost of DSA object use is roughly
three times higher than the cost of using the direct shared memory 
implementation of queue variant c (due to the high cost of signalling
in the BBN Butterfly's Mach implementation). However, some compensation
for those additional costs arises from increases in program locality.
Specifically, searcher threads interact only with the locally stored 
vertices, and all operations on remote vertices are performed by event
handlers on remote processors.
 
Similar results is observed when 
event activation on remote processors is performed in    delayed  mode.
This means that no signals are generated when an event is entered in a
remote event queue. Instead, the event queue is checked (polled) each
time a local thread accesses the fragment (ie., performs an operation
on the fragment) and at that time, all events found in the queue are
processed in arrival order. This polling approach works well for
frequently accessed abstractions; it does not work for abstractions with
vertices that are not bound to local threads (intermediate vertices used
for communication only) or for abstractions that exhibit widely
varying access frequencies to different fragments.
  

5. Portability of the DSA Library 
The DSA library is easily ported to any machine offering Cthreads
support. 
We have ported the DSA library to several other machines, including
Sparcstations, SGI multiprocessors, and the Kendall Square supercomputer.
Preliminary measurements of DSA performance on the KSR platform indicate
similar results to those attained on the BBN Butterfly. Specifically,
using delayed event generation (ie., event polling rather than 
signalling), it is clear that parallel programs written with the DSA
library can deliver performance improvements for large-scale
parallel applications. This is demonstrated by the measurements of
actual executions achieved for the TSP application with the DSA library
on a 32-node KSR machine shown in Figure  . Actual
execution times are comparatively smaller to those on the BBN Butterfly
due to the faster processors on the KSR machine. These results are
attained with an initial implementation of the DSA library in which
no KSR-specific optimizations have been performed.
  

6.  Conclusions and Future Research 
This paper presents the DSA runtime library for the efficient 
implementation of distributed shared abstractions in multiprocessor
systems, notably NUMA multiprocessors. Measurements of the library's
primitives and its evaluation with a sample parallel program on
a 32-node BBN Butterfly demonstrate that  
the DSA library supports the implementation
of shared abstractions such that they are efficiently executable on
large-scale parallel machines (scalability).
 
The implementation of the DSA library assumes the availability of an 
efficient remote invocation mechanism used for communication among
object fragments. The DSA library offers two implementations of this
mechanisms, one resulting in immediate fragment execution asynchronous
to the execution of other threads on the same processor, the other
delaying a fragment's execution until the fragment is accessed by a local
thread. 
 
Our future research concerns the continued use and optimization of
the DSA library on parallel machine platforms. In addition, we are now
developing a shared framework for implementation of distributed
shared abstractions and of distributed shared memory on parallel and 
distributed machines.