################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org (This is an ASCII version of the paper presented at the Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, California, September, 1993. Figures were omitted from this version.) **************** Distributed Shared Abstractions (DSA) on Large-Scale Multiprocessors by Christian Clemencon Bodhisattwa Mukherjee Karsten Schwan College of Computing Georgia Institute of Technology Atlanta, GA 30332 e-mail: clmenco@lse.epfl.ch bodhi,schwan @cc.gatech.edu ABSTRACT: Any parallel program has abstractions that are shared by the program's multiple processes, including data structures containing shared data, code implementing operations, type instances used for synchronization or communication, etc. Such shared abstractions can considerably affect parallel program performance, on distributed and on shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is the encapsulation of shared abstractions as objects that may be internally distributed across different nodes of the parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that program portability can be maintained across parallel architectures ranging from small-scale multiprocessors, to large-scale shared and distributed memory machines, to networks of computer workstations. This paper demonstrates an implementation of the DSA library on shared memory multiprocessors. The library is evaluated using a parallel implementation of a branch-and-bound algorithm for solving the Traveling Salesperson Problem (TSP). This evaluation is performed on a 32-node GP1000 BBN Butterfly multiprocessor, and such experimental results are compared to measurements attained on a 32-node Kendall Square Supercomputer. 1. Introduction: A parallel program can be viewed as a set of independent processes interacting via shared abstractions. Such abstractions include shared data, shared types like work queues and locks, and globally executed operations like global sums, merging scan-lines into coherent bitmaps, etc. Since the shared abstractions used in a parallel program represent the program's global information, their efficient implementation can be crucial to the program's performance, its scalability to different size machines, and its portability to different target architectures . For example, the differences in local to remote memory access costs in most large-scale parallel machines (ie., the NUMA properties of such machines ) require substantive changes in the implementation of synchronization constructs for small-scale parallel machines to large-scale parallel machines . The contribution of our work toward increased scalability and portability of parallel programs is the provision of the DSA library for the efficient implementation of shared abstractions. Portability is attained by encapsulation of shared abstractions as objects with well-defined operational interfaces. Scalability is achieved by implementation of such objects as sets of object fragments linked by a user-defined communication structure, which we term a topology . Such distributed shared abstractions (DSA) permit programmers (1) to take advantage of localities of reference to locally vs. remotely stored object state and (2) to encode their application-level knowledge about the communication patterns among object fragments. Furthermore, contention of access to an object can be reduced by distribution of its representation across multiple nodes, since many operations on the object will access only locally stored copies of its distributed state. Distributed representations of shared abstractions can result in significant performance benefits, as demonstrated for many implementations of higher level operating system services in distributed systems (e.g., file systems ) and for application-specific services on distributed memory machines . For shared memory multiprocessors, similar results have been attained for RPC implementations on NUMA machines like the BBN Butterfly multiprocessor and are demonstrated in this paper for a program-specific abstraction (ie., a shared queue) in a parallel branch-and-bound application executed on a 32-node GP1000 BBN Butterfly and a 32-node Kendall Square Research Supercomputer. In this shared queue, alternative fragmentation of the object make use of application-level information about both the specific pattern and the rates of communications between multiple queue fragments. Sample application-level knowledge includes desirable or acceptable global or local orderings among queue elements, tolerable delays regarding the propagation of information among queue fragments, etc. The functionality of the DSA library (1) permits programmers to define and create encapsulated, fragmented objects, and (2) offers low-level mechanisms for implementing efficient, abstraction-specific communications among object fragments. Since the library is built as an extension of a Mach-compatible Cthreads package developed by our group , a parallel program written with the DSA library consists of a set of independent threads interacting via DSA objects. Portability of the DSA library to different shared memory multiprocessors is due to the portability of the underlying Cthreads library. Portability of DSA-based programs from shared to distributed memory machines (including workstation networks) is due to the use of an easily ported remote invocation mechanism for communication among object fragments. The remainder of this paper first compares our work with related research (Section 2). Section 3 presents a sample parallel application and the abstractions shared by its concurrent processes. Next, the performance effects of alternative, shared memory implementations of such shared abstractions are evaluated experimentally on a 32-node BBN Butterfly multiprocessor. Section 4 describes the DSA library. The library is evaluated in detail in Section 5 and finally, Section 6 describes our conclusions and future research. 1.1 Related Research: There are several differences of our research to current work on distributed shared abstractions. First, in contrast to recent research in cache architectures for parallel machines and in weakly consistent distributed shared memory , we do not assume some fixed model (or limited number of models) of consistency between object fragments. Instead, programmers can implement object-specific protocols for state consistency among object fragments, using the low-level remote invocation mechanism offered by the DSA library. Second, since communications among objects fragments are explicitly programmed, shared abstractions implemented with the library are not subject to some of the performance penalties in distributed shared memory systems arising from sharing multiple, small abstractions allocated on a single shared page (ie., false sharing leading to additional and/or excessively large communications). Conversely, by adding calls like ``invalidate(page)'' and ``get(page)'' , etc. to our current low-level communication calls, distributed shared memory (DSM) abstractions may be implemented and compared with alternative representations within the existing DSA library. Third, in contrast to the research of Shapiro on fragmented objects , we explicitly consider the communication structure linking object fragments in order to exploit application-specific knowledge of the object's communication patterns. Fourth, in contrast to our previous implementation of the DSA library on hypercube machines , the layering of DSA objects on a basic remote invocation mechanism has resulted in library portability to various target platforms, including the aforementioned shared memory platforms and a recently completed implementation using Cthreads and PVM . Last, we note that shared abstractions are easily instrumented, evaluated , and even dynamically adjusted without exposing such instrumentation to application programs. 2. Shared Abstractions in Parallel Programs: Programmers use a variety of methods for decomposition of programs into concurrently executable processors, including the static or dynamic decomposition of program's data domains, divide and conquer strategies, functional decompositions, and pipelining. Many parallel programs resulting from such decompositions exhibit coordinator/server structures, where coordinator processes generate work units processed by workers or at least supervise a number of workers. Sample applications structured in this fashion range from (1) domain-decomposed scientific applications to (2) MultiLisp implementations on parallel machines, where ``future's' are entered into queues and removed and processed by available processors , to (3) parallel optimization codes , and (4) even operating system services like file or I/O servers. The sample parallel program used in our research is a client/server structured application, a parallel branch-and-bound algorithm solving the Travelling Salesperson problem (TSP). We employ the algorithm of Little, Murty, Sweeney and Karel (LMSK algorithm) , and we use a parallelization similar to the one by Mohan on the Cm* multiprocessor . The resulting parallel LMSK algorithm is implemented as a collection of asynchronous, cooperating searcher threads each of which independently conducts a search in a dynamically constructed search space until the best tour is found. The searchers cooperate using two shared abstractions: (1) a work sharing queue (``work queue'') storing the leaf nodes of the tree representing the search space and permitting the dynamic distribution of work among searchers, and (2) a shared integer value (``best tour'') representing the current best tour found so far and used by searchers to prune the search space. A TSP computation is initiated by a single thread, by first enqueuing a representation of the initial problem (the root node) in the work sharing queue, and then forking some predefined number of searcher threads. A complete description of the LMSK algorithm and its parallel implementation can be found in and respectively. The parallel LMSK algorithm is of interest for two reasons. First, branch-and-bound algorithms are commonly used in the solution of optimization problems and have therefore, been frequently studied and evaluated on parallel machines. Second, experimental evaluations of the algorithm's implementation on distributed memory platforms and on workstation networks have already demonstrated the importance of the work sharing and tour abstractions to parallel program performance, where different implementations of the queue itself and of load balancing among queue fragments significantly effect speedup and scalability. Shared Memory Implementation of TSP Abstractions When using shared memory to implement the TSP ``tour'' and ``work queue'' abstractions, two important factors affect the resulting parallel program's performance: (1) contention due to concurrent abstraction access (synchronization overhead) and (2) remote memory access costs (communication overhead). Specifically, in the BBN Butterfly, the ratio of access costs to local vs. remote memory is approximately 1:12, which implies that the cost of executing an operation on a shared abstraction strongly depends on the number of remote references performed by the operation . As a result and in order to reduce contention and take advantage of locality, implementations of shared abstractions in NUMA machines like the BBN Butterfly often explicitly distribute their state and code data to participating processors' memory units, and then use abstraction-specific communication structures to maintain consistency among such distributed information. Alternative implementations of the shared work queue abstraction are described and evaluated next. We continue to use shared memory for implementation of the ``tour'' abstraction, since the performance impacts of alternative implementations of this abstraction are small in NUMA multiprocessors (this may not be the case for distributed memory machines, as described in ). The work sharing queue stores the current leaf nodes of the search tree. In a parallel implementation, this abstraction implements: (1) a node selection heuristic, (2) a work distribution strategy, and (3) a protocol for program termination. The node selection heuristic is implemented as an ordering of queue elements. Queue elements (nodes) are ordered (a) by their lower bound on the problem's solution and (b) by their sub-problem sizes. When using a double-priority queue ordered by (a) and (b), retrieval and processing of the first node on the queue implements a best first heuristic for node selection. In other words, best first node selection always chooses for expansion the node with the least subproblem size from the set of nodes that have the lowest lower bound value. This strategy favors nodes that are likely to lead to good solutions fast. It has been shown useful in many parallel LMSK implementations , since it tends to minimize the total number of nodes in the final search tree. We term a queue implementation consistent if a global priority ordering is maintained among queue elements. A consistent queue faithfully implements the `best first' node selection heuristic, whereas queue implementations that do not maintain a total queue ordering -- termed inconsistent -- decrease the effectiveness of the node selection heuristic. Such decreased effectiveness is undesirable since it leads to substantial additional computations in the parallel algorithm due to the expansion of nodes that would not be expanded by the sequential algorithm -- termed additional nodes . Since the TSP's search space is constructed dynamically, another important role of the work sharing abstraction is to ensure the equal distribution of work (ie., nodes) among searchers. This is trivially ensured when using a global queue. In fragmented queue implementations, however, load-balancing must be performed among queue fragments. Since such load balancing must take into account both the sizes of queue fragments (number of nodes per fragment) and the ordering among nodes, it is henceforth termed quality balancing . An effective quality balancing strategy, then, ensures both a global ordering of nodes and an equal distribution of nodes among queue fragments. Tradeoffs in effectiveness vs. efficiency of queue implementation and quality balancing are apparent in the three alternative queue implementations and their measurements described next. a) Global queue representation: A first implementation using a single queue copy takes advantage of the BBN Butterfly's shared memory. Each searcher thread allocates new nodes in its processor's local memory. However, all nodes are linked into a single queue that spans all processors. A predetermined, single processor maintains the queue's head as well as a spinlock for mutual exclusion in queue access. In this implementation, no work distribution strategy is needed, and the termination protocol is implicit: searchers terminate when the queue is empty. b) Distributed representation without quality-balancing: A second implementation attempts to maximize locality of access to queue elements, while performing minimal load balancing. Specifically, the global priority queue is split into several subqueues, which are interconnected with a unidirectional ring. Each searcher thread owns a local queue fragment, which is implemented as a priority queue and protected by a local spinlock. The searcher threads enters and removes nodes into/from its local subqueue, and allocates new nodes in local memory. The work distribution strategy performs load-balancing as follows: if a searcher performs a `get' operation on an empty local queue fragment, it then simply removes the `best' node from the next non-empty remote queue fragment along the ring. This results in the sharing of `good' nodes among searchers only when searchers have exhausted their own parts of the search space. This queue representation also requires an explicit termination protocol. In this case, a searcher terminates when all of the queue fragments along the ring are empty and at least one tour has been found. c) Distributed representation with quality-balancing: A third implementation is like the previous one, but also performs continuous quality-balancing. Specifically, similar to the strategy used in , every two `get' operations by a searcher thread on its local queue trigger a move of the second best node from the local queue to the next subqueue along the ring. As a result, `good' nodes are frequently shared among different searcher threads. This increases the overall quality of nodes used by searcher threads, but it also increases the total number of accesses made by threads to non-local node representations Sharing of nodes more (for every `get' operation) or less (every four `get' operations) frequently results in performance degradation. Similarly, the association of node sharing with `getting' vs. `putting' nodes appears to have no visible performance effects . . The Scalability of Parallel Programs: A Case Study of Shared Queues All measurements given in this section are performed on a 32-node GP 1000 BBN Butterfly. The measurements shown are the averages of the executions of 100 different, randomly generated TSP problems. Each TSP problem has 32 cities and is described by an initial random cost matrix, with costs in the range of 1 to 50. Each TSP problem is executed for each of the three work sharing abstractions, with the same initial cost matrix. Each searcher thread executes on its own dedicated processor with local copies of its code, stack, local data, and with a local copy of the cost matrix. The first experimental results shown below demonstrate that the achievement of good scalability of parallel programs must use representations of shared abstractions that take into account program semantics as well as program implementation details. Specifically, Figures shows the speedups of the TSP application when it is executed with 1 to 25 processors using each of the three different queue implementations. Variant a is the global queue implementation, where searcher threads share all subproblems ranked by knowledge about program semantics, which is subproblem size and quality. In contrast, variant b is the distributed queue without quality balancing, where searcher threads share no knowledge concerning such program semantics. Variant c is the distributed queue with quality balancing. Speedup is computed as the ratio between the sequential (execution time of the sequential implementation of TSP is 18484 milliseconds) and the parallel execution times. The results depicted in Figure demonstrate that significant execution speedups are possible with the distributed queue implementations. Similar speedups should be achievable on larger parallel machines as long as the problem size is increased beyond the 32 cities used in our measurements. It is apparent from Figure that variant c -- the distributed queue with quality balancing -- behaves best. While improvements in locality of access to queue elements exist in variant b compared to variant a , the disadvantages incurred by additional work performed by searcher threads outweigh the accrued performance gains. In other words, while the distributed implementations ( variants b and c ) are superior to variant a regarding the locality of access, the complete loss of the total ordering maintained by the global queue in variant a is not acceptable. Therefore, it is not an effective strategy to implement shared abstraction without using information about program semantics. These results and similar results reported for distributed memory machines are our main motivation for rejecting conceptually simpler approaches like distributed shared memory for the implementation of shared abstractions in parallel programs. Figure provides additional explanation of the results depicted in Figure , by depicting the total number of nodes expanded in order to arrive at a solution. As stated in the previous paragraph, the total number of expanded nodes is highest when load sharing ignores semantic information in variant b (ie., no quality balancing), whereas the number of expanded nodes with quality balancing ( variant c ) closely approximates the number attained with the globally ordered priority queue ( variant a ). The elapsed time in the queue abstraction consists of time spent in the `get' and `put' operations, which can be decomposed into: (1) the time spent for managing the queue, (2) the time spent for explicit communication between queue fragments (generating messages, processing requests, etc.), (3) the time spent waiting for locks protecting the queue from concurrent accesses, and (4) the wait time experienced during termination detection. Of these times, (1) is insignificant since the time spent managing the double priority queue is only about 1.75 of total program execution time. In variant a , the time spent in the work sharing abstraction is almost entirely due to (3) -- queue access contention. This time significantly increases with the number of processors and beyond 15 processors, it exceeds the time spent doing useful work (ie., expanding nodes). It is the main cause for the degradation of speedup in variant a as shown in Figure . In variant b , contention is insignificant, because searchers almost always access local queue fragments and therefore, the time searchers spend in the queue is primarily due to queue management and termination detection, neither of which are very time-consuming. As expected, the quality balancing performed in variant c increases the time spent in the queue abstraction, but it is outweighed by the significant reduction in the total number of node expansions performed during problem solution. An issue not discussed above is the storage of node data, which results in performance differences regarding the expansion of locally vs. remotely stored nodes. In this implementation of TSP on the BBN Butterfly machine, such differences are not as significant as in distributed memory implementations Expansion of a locally vs. remotely stored node can be performed in 25ms vs. 27 ms on a BBN Butterfly machine. . To summarize, we have used the shared queue abstraction in a parallel branch-and-bound program to demonstrate that the TSP program's speedup is limited by the performance of the abstractions shared by its processes. Interestingly, this paper shows that the most complex implementation of the major shared abstraction in TSP -- the fragmented, quality balanced queue -- is superior to its alternatives. Unfortunately, this implementation is not simple, which reduces the much-heralded ease of implementation offered to application programmers by the multiprocessor's shared memory model. Ease of programming is the topic of the remainder of this paper. 3. The DSA Library: Implementing Distributed Objects: The primary objectives of the DSA library are to (1) facilitate the implementation of shared, distributed abstractions in parallel programs, and (2) provide a unified interface for implementing and using such shared objects in shared memory vs. non-shared memory environments. The library's structure, implementation, and performance on multiple shared memory machines is described next, including its performance on a 32-node GP1000 BBN Butterfly multiprocessor and a 32-node Kendall Square Supercomputer. Additional examples regarding DSA's use and an alternative implementation of DSA as part of the operating system kernel on a distributed memory machine are described in . A DSA object defines a communication structure (a topology ) and protocol among a number of communicating user threads. When viewed by other program components (ie., the end user's view), such an object appears as a single abstraction shared among threads that potentially execute on different processors. The object exports well-defined operations that may be invoked by any thread able to access it. In contrast, the implementor of a DSA object understands the object's internal structure to consist of a set of cooperating object fragments (state and code), where (1) object fragments may be stored in different memory units, and (2) such fragments must explicitly communicate in order to execute some (or all) of the operations performed on the object. Three distinct view of objects are offered by the interfaces provided by the DSA library (Figure ). At user-level , the library offers routines for binding a user thread to an already defined object and for invoking the object's operations. At representation level , the library offers routines for object creation and data structures defining object fragments and the communication structure. At implementation-level , the library offers routines for implementing individual object fragments, including their operations, their communications with other fragments, etc. DSA Objects -- User-Level Interface Basic definition. A DSA object consists of a set of identical fragments that are potentially located on different processors. Such distributed fragments are connected with a statically defined logical communication structure. As a simple example, consider the tour value shared by all searcher threads in the TSP application. For simplicity in implementation, we represent this object as identical fragments (or vertices) The use of different versions of object fragments in a single DSA object is not supported in the current implementation of the DSA library. Such a generalization of DSA objects can be useful, and is discussed further in . linked by a ring communication structure. Even though an object may have a complex internal communication structure, an end user only knows about the locations of specific object fragments, each of which exports all of the object's operations and therefore, much like `proxies' locally emulates the object's complete functionality. For example, the shared tour object's operations are `read tour' and `new tour', and the private data of the object is the current `best tour' value. Communications among fragments are not visible to object invokers. In the case of the tour object, such communications concern updates to the local copies of `best tour' values stored in object fragments. Specifically, while `read tour' simply reads the copy of the tour value stored in the local fragment, `new tour' initiates the propagation of a new tour value around the ring to other object fragments. This propagation can be performed asynchronously to the execution of additional operations on the local or remote object fragments, so that the desired consistency of the multiple copies of tour values around the ring can be controlled by the tour object's implementation. Object binding. Before using an instantiated DSA object, an application's threads must be bound to the object's vertices. Each vertex may be bound to zero or multiple threads, and each thread may be bound to multiple vertices of the same or of different DSA object instances. However, the current version of the library requires that a binding is performed only between a thread and locally stored vertices. Bindings are established, and broken using library routines listed in Figure . The `top open' routine returns a handle for further accesses to the specified object's vertex. The `top close' routine breaks the specified `obj handle' binding. However, `top close' does not clean up object state for future use. Such cleanup operations have to be implemented as additional services called explicitly by application programs. Object invocation. The effect of `top send' is the invocation of the service identified by `srv id' in the vertex identified by handle `obj handle'. The required invocation parameters are assumed to be packaged in a parameter block called `param', and `param size' indicates the size of this block. Each invocation may be tagged with an arbitrary, user-provided `tag' value, which may be used for communication of sequencing information, etc. If user programs require synchronization with output generation at the local fragment, they may invoke the vertex operation `top send w'. This operation will block the invoker until the invoked service and fragment have generated all of the required outputs. A user thread obtains the result of a service executed by an invoked vertex by calling `top receive'. This routine copies the parameters returned by service `srv id' into the buffer pointed to by `param'. The `tag' parameter permits a wild card value. The `top receive w' routine blocks the caller thread until the requested return value is available at the local fragment. Since such threads resume execution in the `top receive w' routine, they will complete the receive upon being dispatched. Given the library operations defined on object fragments, the procedural interface of the tour object can be easily written to isolate end users from the implementation details of objects implemented with the library . DSA Objects -- Creation and Internal Representation Object creation. The creation of DSA object instances is typically performed at the time of program initialization. Once created, an object instance cannot be removed until program termination. A DSA object is described by (1) the size of the private data buffer in the object's vertices, (2) the operations it implements, (3) the object's logical communication structure and (4) the mapping of the logical structure to the physical nodes of the underlying machine. Each service implementing an object operation is described in a table element by (1) a unique identifier, (2) three procedure addresses, including (a) a procedure performing precondition evaluation (explained further below), (b) a procedure implementing the actual operation, called a service routine , and (c) a procedure performing postcondition evaluation (also explained below), and (3) a representation specifier. A service executed as a procedure is represented as an ADT (Abstract Data Type), whereas a service executed asynchronously to the invoker by its own thread is represented as a TADT (Thread Abstract Data Type). The logical communication structure of an object is described as a NxN from/to connection matrix, where N is the number of vertices Routines able to generate such a matrix at the time of program initialization may be used in place of the simple statically defined structure shown in this example. . Similarly, the mapping of vertices to physical nodes is described by a table with `N' entries. Given the matrix and table structures as above, an application creates a mapped object instance by calling the ``top create'' routine (Figure ). The first seven parameters of this routine describe the new object's id, the space required for each fragment's state, the number of service routines whose addresses appear in the `services table', the number of vertices and the connection matrix, and the mapping of vertices to physical processor nodes. The last two parameters determine the size of the pool of pre-allocated invocation blocks associated with each object fragment. The implementor of an object can control the activation of a service by associating a precondition routine with that service. The role of such a routine is to define a service scheduling policy, based on the availability of inputs for that service in a particular vertex. For instance, a precondition may require that inputs from all input edges must be present in order to activate a service, as shown useful in synchronization objects implemented as combining trees or in certain implementations of objects implementing global sums or minima . Similarly, postconditions associated with service routines can control the propagation of values across the object's communication structure by controlling output generation at vertices. An output may be generated after each service execution, or after some delay required or desired by the application. Furthermore, the result of a service's execution may be sent to one, some, or all output edges of a vertex, or to a user thread requiring it. For example, in a tree-structured global sum object, while each vertex can incrementally perform its addition operations upon the arrival of each input, each single output cannot be generated until all inputs have been received and added. This requires the use of a postcondition. By default, a DSA service routine is activated incrementally as each input for that service arrives. In summary, the purpose of the precondition and postcondition procedures specified as part of object creation and executed with each object invocation (if such routines have been specified) is to determine (1) when a service routine is activated in response to an invocation (service scheduling), (2) when control is returned to the user thread (invocation control), and (3) what, if any, other object fragments must be accessed for execution of the desired service (fragment management). DSA Objects -- Implementation of Object Fragments Input and output queues. It is apparent from the discussions above that each object fragment is constructed such that its operations (services, pre- and postconditions) can be executed asynchronously with the invoking program. Furthermore, a fragment's services may be executed in response to invocations from other object fragments or from a locally bound user-level thread. As a result, each object fragment's implementation contains addressing information about bound threads and connected fragment, and it contains several queueing structures in addition to the aforementioned object state and its user-specified services and pre- and postconditions. These queues include: (1) an input queue shared by all threads bound to the vertex, (2) an edge queue for each edge providing input to the vertex, and (3) output queues for each vertex output. Service routines. Service routines perform the computations implementing an object's operations . A service is executed either as a result of invocation of the operation on the local fragment, or when a message arrives at a fragment's input edge. If executed as a result of a message receipt from another fragment, the information is contained in an invocation block `ib' queued in the fragment's input queue. Each invocation block contains routing information (source and destination vertices), an identifier of the invoked service, a buffer into which the parameters required by this service have been packed, and a tag value. The invocation block only contains a pointer to the actual parameters, so that unnecessary copy operations are avoided. The services defined in a topology object are uniquely identified by an integer `Tag'. For instance, in a ring topology, the application code or pre-/postconditions can use the tag value to identify a previously sent ib that has fully traversed the ring. A few examples of service routines, use of postconditions and application dependent memory consistency are documented in . Remote invocations. The edges connecting object fragments are uni-directional, logical communication links. While the physical representation of such an edge is the appropriate edge queue of the target vertex, all communications across edges use a remote fragment invocation mechanism. As an example, consider a link from vertex v1 to v2 . Whenever a service routine in v1 outputs a new tour value across this edge (ie., enters data into the appropriate edge queue of v2 ), it also initiates the execution of the target vertex' service routine `new tour srv'. The resulting remote queue access coupled with remote service routine execution comprises the remote invocation protocol used by the library for fragment communications. Such remote invocations can be immediate , which means that the control flow on the target vertex' processor is interrupted (using Unix `signal' operations), or they can be delayed , which means that the remote service will be executed only when the user thread bound to the remote vertex executes one of the vertex' operations. Both alternatives have been implemented, and measurements of both will be shown in Section 4. Service representation. Since services may range from simple, low-latency message switching to complex computations, the DSA library offers two different execution modes for service routines -- (1) Small grain computations can be performed by service routines implemented as procedures called in response to an invocation. Execution of such a service is atomic (non-preemptible) and multiple invocations of it are thus implicitly serialized. (2) Larger grain computations can be performed by service routines represented as preemptible threads. A new thread is created for each invocation of such a service. Threads executing service routines are scheduled in a round robin fashion and have priority over user-level threads. Library Support for Service Implementation This section reviews the DSA library's low-level support routines used for service implementation. We elide details of the implementation of invocation blocks and of the addressing information maintained in those blocks. Instead, we assume that such blocks are the atomic units manipulated at this level of the DSA library. Preconditions. A local fragment invocation or an invocation from a remote fragment initiates the execution of the appropriate service routine when there exists no precondition, else it calls the precondition procedure, in either case providing an invocation block (ib). The precondition is executed non-preemptively, and it must explicitly activate the actual service using the support routine `top service'. Activation of a service either involves calling the procedure defining the service, or creating a new thread that will execute this procedure, depending on the service's representation. The aforementioned queueing structures inside each vertex are required because services can be implemented to execute asynchronously with the user threads requesting them. Several routines are available for precondition procedures to check and manipulate those queues (Figure ). `Top dequeue input' scans the vertex's input queue and dequeues the first ib that matches the given service identifier and tag value. The tag parameter admits a wild-card value. `Top check input' checks the vertex's input queue for ib's that match the given parameters. `Condition' may be a combination of different flags for specifying complex conditions like: `ib's must be available from all input edges', `from at least one input edge', `from a user threads', etc. Service routines. A service routine implements the actual functionality of the operation performed by a service. The address of the vertex' private data and the address of the parameter block referenced in the ib are generated using the macros `top data p' and `top param p'. Once completed, a service that wishes to send output parameters to other vertices, or to a user thread, can activate its postcondition procedure with the `top postcond' routine. Postconditions. As with preconditions, all postconditions are executed non-preemptively. A postcondition defines a service's output policy. Specifically, each vertex contains an output queue for temporary storage of output ib's, and the DSA library offers access routines for queue manipulation (Figure ). A postcondition procedure can use these routines to define an output propagation policy for its vertex. The most important action taken by postconditions is to generate vertex output. `Top output edges' sends a copy of the specified ib across all of the vertex' output edges. For exception handling or when a vertex' output edges cannot be defined as part of the object's creation, the precondition procedure can alternatively use the routine `top output vertex', which sends a copy of ib only to the single specified vertex. This routine is particularly useful when an object's communication structure is constructed dynamically, such as in dynamic broadcast trees, or for message routing in distributed systems. Finally, `top output user' is used for transmission of results to a user thread. Such transmissions are performed via the vertex's output queue. Namely, the routine first enqueues the specified invocation block on the output queue and then checks if a user thread is waiting for it. If a thread is waiting, the routine puts the thread back in the processor's ready queue. 4. Evaluation of the DSA Library The DSA library has been implemented on a 32-node GP1000 BBN Butterfly multiprocessor, on a KSR supercomputer, and on a smaller-scale SGI multiprocessor. In this section, detailed performance results are presented for the BBN machine, followed by a few comparative measurements on the KSR. For reference, a procedure call without parameters costs approximately 3 microseconds on the BBN Butterfly, a call to a local abstract data type (an ADT) costs about 18 microseconds, and a thread context switch in our lightweight threads library costs about 215 microseconds. DSA Object Creation and Access Object representation. The performance of DSA objects depends in part on their internal representation. In addition to the queueing structures used for fragment inputs and outputs, each fragment is referenced via lists maintained on their processors. Specifically, all vertices located on a processor are linked via a local vertex queue. Furthermore, each vertex is internally described by a vertex control block (abbreviated vcb ). A vcb contains (1) an object instance identifier, (2) the current vertex identifier, (3) a private data buffer, (4) an input queue for the vertex input(s), (5) an output queue to the bound thread(s), (6) a waiting queue, (7) a pool of free ib's, (8) a table describing the object's services, (9) a table describing the output edges (vertex id and node number of each linked vertex), and (10) an array of pointers to all of the object's vcb 's. The latter array permits a vertex to access any remote vcb of the object by direct reference. Object creation. Object creation has not been optimized in the current implementation, in part because objects are typically created at the time of program initialization and are deleted only when the program terminates. However, it is instructive to consider the steps necessary for object creation and undertaken by the library routine `top create'. This routine first allocates each of the object's vcb 's on the appropriate nodes according to the given mapping table. It then initializes these vcb 's as per the object's description. Finally, `top create' sends a creation event with the appropriate vcb to each target node. Upon reception of a creation event, the event dispatcher calls a setup procedure, which enqueues the transmitted vcb in the local vertex queue. Object binding. A user thread binds itself to an object's vertex using the `top open' routine. Specifically, this routine first performs a linear search for the vcb of the specified vertex on the local vertex queue. It then allocates a user control block and binds the calling thread to the specified vertex by storing the thread identifier and the vcb address in the user control block. The latter also contains a single dedicated invocation block for use in subsequent `top send' and `top receive' calls by the bound thread. Finally, the `top open' routine returns a pointer to the user control block as an object handle. Object access. Since object access is critical regarding the performance of DSA objects, the steps taken during `normal' object operations must be highly efficient. The current implementation of `top send' performs the following four steps: (1) it disables events, thereby preventing other operations on the local vertex while it is operating on it, (2) it acquires and initializes the single invocation block `owned' by the bound thread, which includes noting the service id, sizes and addresses of the call's parameters, (3) it performs a local invocation of the requested service, and (4) it enables events. The parameters are directly accessed from within the service; they need not be copied out of the parameter block unless otherwise indicated. The cost of a `top send' operation for representation of services as `procedures' (ADTs) is 109 seconds whereas a `top send' followed by a `top output user' cost 232 seconds. These measurements are attained on a single processor node, using the average latency derived from 1000 consecutive calls. When `top send' is performed for services represented as threads (TADTs), additional context switch overheads arise, since the invoking thread has to release the processor, followed by the processor's acquisition by the thread executing the service. Two alternative implementations of such context switching on the BBN Butterfly (1) use an un-optimized operating system call that saves signal masks vs. (2) use an optimized context switch for lightweight threads. The costs of (1) are 2.5 milliseconds on the BBN Butterfly, whereas (2) only requires 215 microseconds. Unfortunately, (1) must be used when remote services are invoked in immediate mode on the BBN Butterfly (using Unix signals rather than using the low-level interrupts available at kernel level). This results in a 10 millisecond latency for invocation of threaded, immediate services, vs. a 912 microsecond cost of service invocation for simple threaded services. Similarly, The `top receive' routine costs 123 seconds for both ADT and TADT service types. The `top receive w' routine executes the same steps as the `top receive' routine; in addition it blocks the caller thread if the invocation block is not present in the output queue. These timings are made when the required invocation block is available, and when there are no other threads are waiting to receive from the vertex. The cost of allocation for a single remote invocation block is 43 microseconds. Remote Service Invocation The DSA library uses remote invocation as an inter-vertex communication primitive. While low cost of remote invocation is critical to the performance of fragmented objects, the use of remote invocation vs. remote access provides several performance advantages on NUMA multiprocessors: (1) it tends to improve the locality of reference of programs by removing remote references, and (2) it provides implicit synchronization for cooperating threads. Results reported in show that the overheads associated with explicit synchronization and remote references increase with increases of the `size' of remotely accessed data and code, whereas the overheads associated with remote invocation are fixed. Therefore, remote references outperform remote invocation only on `simple' operations. Such results are part of our motivation for implementation of the low-level remote invocation construct for inter-vertex communication. In DSA, a remote invocation is initiated by a call to the `top output edges' or `top output vertex' routines. A remote service invocation consists of: (1) extracting a free invocation block from the target vertex' pool using remote references, (2) copying invocation data into this block, including parameters, and (3) sending an event carrying the invocation block to the target processor. An event defines an asynchronous action that is to be performed on a remote processor, and such events are associated with messages that pertain to the desired actions. When processing such an event, the target event dispatcher (1) performs a local invocation of the appropriate service, and upon its completion, (2) places the invocation block back into the pool of its home vertex. The performance of inter-vertex communication depends on (a) the costs of event transmission via an event transmission facility and (b) the costs of event activation at the target in either immediate or delayed mode. In order to reduce the costs of event generation, each processor locally maintains an event queue and a pool of pre-allocated event descriptors. These two data structures are protected by a spinlock. The generation of immediate events requires the use of Unix signals or of kernel-level interrupts on the BBN Butterfly. For portability, we use Unix signals. The performance of such signal operations is not satisfactory. Specifically, the cost of generating an inter-processor signal is 750 seconds on the GP1000 BBN Butterfly. In comparison, the additional overheads due to the implementation of events in the DSA library are small. Specifically, the total time for sending a single event is 937us when a UNIX signal is generated, and 187us otherwise. The time for a handler to dispatch an event is 153us. However, the Unix implementation on the BBN Butterfly results in a signal delivery time that varies from 1 to 110 milliseconds depending on the involved processors' current activities. Consider an asynchronous DSA object resembling the tour object in the TSP application. This object links one thread to itself with a 8 vertices ring spanning 8 nodes. The evaluated service performs only routing of incoming invocation blocks. Due to the extreme variability of UNIX signal delivery times, we have measured total round trip times ranging from 18 to 400 milliseconds. Due to the high costs of event generation and delivery, the DSA library's BBN Butterfly implementation employs several optimizations of event transmission and servicing. First, since event generation is expensive, several simultaneous events can be grouped into a single event, by simply generating a single event descriptor for multiple invocation blocks entered into the target vertex' input edges. Upon receipt of the event, the target vertex' service routine processes all invocation blocks found in the appropriate input edges. Second, the remote event queue is checked prior to signal generation. An empty queue implies that the remote vertex is currently running the event handler, so that a signal need not be generated. The third optimization concerns event masking. Specifically, Since disabling and enabling events are quite expensive on Unix systems (Unix signal masking/unmasking system calls cost 800 seconds on the BBN Butterfly), our implementation maintains `events enabled' and `events disabled' flags on each processor. These flags are set by the local event handlers and inspected at the time of event generation. The event generation routines do no issue signals to the target vertex when its events are currently disabled, since that implies that the event handler is currently running on the target processor and will receive and process the invocation blocks that have already been generated and added to the appropriate input edge queues. Given the costs of event generation described above, the total cost of a single remote invocation is 1.15 milliseconds when a UNIX signal is generated and 406 seconds otherwise. These measurements are attained with an invocation block containing a two byte parameter. For comparison, a pair of lock/unlock calls in the Cthreads library costs 53 seconds. A kernel-level implementation of remote invocations like the one described in (using hardware interrupts instead of UNIX signals) would reduce remote invocation costs to roughly 500 seconds. To summarize, the DSA library's implementation of event generation and delivery favors the use of simultaneous events and therefore, total event generation and processing overheads are reduced for increasing numbers of total events generated in the DSA abstraction's execution. Performance of TSP with DSA Objects This section demonstrates the utility of the DSA library by evaluating its use with the TSP application. The first set of measurements reported below compare the performance of TSP's variant c when using the shared memory implementation of the distributed queue vs. using the DSA library for implementation of the same queue variant (see Figure ). Despite the significant overheads of event generation and handling experienced with the signalling implementation of DSA, performance results indicate that the DSA library is suitable even for larger-scale parallel systems: good speedup is maintained when using the DSA library. The speedup results depicted in Figure are explained with additional measurements shown in Figure . These measurements evaluate the ratio of time spent in the work sharing abstraction vs. the application's total execution time. It is apparent from these numbers that the cost of DSA object use is roughly three times higher than the cost of using the direct shared memory implementation of queue variant c (due to the high cost of signalling in the BBN Butterfly's Mach implementation). However, some compensation for those additional costs arises from increases in program locality. Specifically, searcher threads interact only with the locally stored vertices, and all operations on remote vertices are performed by event handlers on remote processors. Similar results is observed when event activation on remote processors is performed in delayed mode. This means that no signals are generated when an event is entered in a remote event queue. Instead, the event queue is checked (polled) each time a local thread accesses the fragment (ie., performs an operation on the fragment) and at that time, all events found in the queue are processed in arrival order. This polling approach works well for frequently accessed abstractions; it does not work for abstractions with vertices that are not bound to local threads (intermediate vertices used for communication only) or for abstractions that exhibit widely varying access frequencies to different fragments. 5. Portability of the DSA Library The DSA library is easily ported to any machine offering Cthreads support. We have ported the DSA library to several other machines, including Sparcstations, SGI multiprocessors, and the Kendall Square supercomputer. Preliminary measurements of DSA performance on the KSR platform indicate similar results to those attained on the BBN Butterfly. Specifically, using delayed event generation (ie., event polling rather than signalling), it is clear that parallel programs written with the DSA library can deliver performance improvements for large-scale parallel applications. This is demonstrated by the measurements of actual executions achieved for the TSP application with the DSA library on a 32-node KSR machine shown in Figure . Actual execution times are comparatively smaller to those on the BBN Butterfly due to the faster processors on the KSR machine. These results are attained with an initial implementation of the DSA library in which no KSR-specific optimizations have been performed. 6. Conclusions and Future Research This paper presents the DSA runtime library for the efficient implementation of distributed shared abstractions in multiprocessor systems, notably NUMA multiprocessors. Measurements of the library's primitives and its evaluation with a sample parallel program on a 32-node BBN Butterfly demonstrate that the DSA library supports the implementation of shared abstractions such that they are efficiently executable on large-scale parallel machines (scalability). The implementation of the DSA library assumes the availability of an efficient remote invocation mechanism used for communication among object fragments. The DSA library offers two implementations of this mechanisms, one resulting in immediate fragment execution asynchronous to the execution of other threads on the same processor, the other delaying a fragment's execution until the fragment is accessed by a local thread. Our future research concerns the continued use and optimization of the DSA library on parallel machine platforms. In addition, we are now developing a shared framework for implementation of distributed shared abstractions and of distributed shared memory on parallel and distributed machines.