The Design and Evaluation of a Shared Object System
                      for Distributed Memory Machines

                     Daniel J. Scales and Monica S. Lam

                        Computer Systems Laboratory
                       Stanford University, CA 94305
                        {scales,lam}@cs.stanford.edu

                                  Abstract

       This paper describes the design and evaluation of SAM, a shared
    object system for distributed memory machines.  SAM is a portable
    run-time system that provides a global name space and automatic
    caching of shared data.  SAM incorporates mechanisms to address the
    problem of high communication overheads on distributed memory
    machines; these mechanisms include tying synchronization to data
    access, chaotic access to data, prefetching of data, and pushing of
    data to remote processors.  SAM has been implemented on the CM-5,
    Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations
    running PVM. SAM applications run on all these platforms without
    modification.

       This paper provides an extensive analysis on several complex
    scientific algorithms written in SAM on a variety of hardware
    platforms.  We find that the performance of these SAM applications
    depends fundamentally on the scalability of the underlying parallel
    algorithm, and whether the algorithm's communication requirements
    can be satisfied by the hardware.  Our experience suggests that SAM
    is successful in allowing programmers to use distributed memory
    machines effectively with much less programming effort than required
    today.
_______________________________
This research was supported in part by DARPA contract DABT63-91-K-0003.


1  Introduction

Distributed memory systems, especially in the form of networks of
workstations, are an important computational resource.  However, programming
distributed memory machines using commonly available message-passing
libraries is a difficult process.  The necessary communication and
synchronization in an application must be reduced to low-level
message-passing calls that require coordination between the senders and
receivers.  The difficulties become even greater for applications that have
highly irregular parallelism and communication.  The single address space
provided by shared memory machines significantly eases the process of
programming these kinds of applications.  Shared data are easily accessed
via the shared address space without regard to where data are currently
located in the system.  In addition, cache-coherent shared memory
multiprocessors reduce the cost of accesses to shared data by automatically
caching recently accessed data locally.

Our thesis is that it is possible to build a portable software layer that
provides the shared memory programming model in a distributed memory
environment, while providing mechanisms to help tolerate the relatively high
cost of communication on these machines.  In this paper we describe the
design and evaluation of a shared object system called SAM which we have
implemented that supports this thesis.  Shared data items in SAM are
accessed via a global name space and are automatically cached in the main
memory of each processor to exploit the locality of reference in
applications.  For efficiency reasons, shared data are accessed and managed
only at the level of user-defined data types.  By requiring that programmers
signify how data will be used when it is accessed, SAM can combine
synchronization and data access and reduce communication.  In addition, SAM
provides operations that allow the programmer to explicitly optimize
communication when desired.  SAM is intended to be used in exploiting
task-level parallelism; we believe that the parallelism inherent in
operations on regular data structures, such as arrays, is most effectively
exploited by parallelizing compilers.

SAM has been implemented as a C library on a variety of platforms:  on the
CM-5, Intel iPSC/860, Intel Paragon, and IBM SP1, and on heterogeneous
networks of workstations using PVM [26].  To evaluate the effectiveness of
SAM, we have chosen a number of complex scientific applications that have
previously been developed by other researchers either for machines with
hardware shared memory support or for distributed memory machines using the
message-passing style.  We have implemented these applications on
distributed memory machines using SAM; each application runs unchanged
across the various platforms.  We then attempt to evaluate the usefulness of
the SAM system by analyzing the performance, programming difficulty, and SAM
overheads of these applications, and comparing with the original
implementations on shared memory multiprocessors or using message passing.

In the following section, we describe the SAM design rationale and discuss
some related work.  Next, we describe the basic SAM primitives.  Then we

                                     2


describe the applications, explain some of the aspects of programming these
applications using SAM, and give performance numbers on all the machines.
We also give figures on the various costs of parallelizing these
applications, including the SAM software overhead.  We then describe the
effects of various aspects of SAM's design on the performance of the
applications, and conclude.


2  Design Rationale

In this section, we give background on some existing software distributed
shared memory systems and describe the basic SAM design principles.


2.1  Background

There have been a number of software systems for distributed memory machines
that give the user the illusion of a shared memory and provides automatic
caching of shared data.  Shared virtual memory (SVM) systems such as
Ivy [18] apply the concepts of hardware caching to virtual memory pages.
Such systems transparently support the view of a single address space across
processors.  However, the use of pages as the unit of coherence can result
in extensive false sharing; unnecessary communication occurs because of the
placement of unrelated shared data on the same page, which may be difficult
to avoid in applications with complex data structures.  In addition, the
protocols for maintaining cache consistency may produce excess
communication, since the system treats the entire memory uniformly.
Processes use lock and unlock operations to synchronize with each others'
data accesses.  The system must guarantee that all memory updates preceding
a synchronization operation are observable by all processors when the
synchronization completes, though most updates may be relevant to only a few
processors.

Munin [15] and Treadmarks [16] address the problem of false sharing by
allowing multiple processors to write to a page and merging changes at the
next synchronization point.  In addition, Treadmarks attempts to alleviate
the problems of excess consistency messages by implementing lazy release
consistency, in which the modifications or invalidations to data are not
sent until a processor acquires a lock that requires that it see those
modifications.  Cox et al. [12] compare the performance of several
applications on a network of eight workstations running Treadmarks and on a
hardware shared-memory multiprocessor.  Some of the applications perform
comparably under the two systems, but several do not scale well for certain
inputs and one application must be restructured to get speedups under
Treadmarks.  It remains undemonstrated as to how these approaches perform
for applications with more demanding communication patterns.

To provide the user with control over the granularity of communication and
avoid false sharing, systems such as Amber [11], Prelude [28], Orca [1],
Midway [3], and Linda [5, 7] communicate data at the level of user-defined

                                     3


data types (or objects), rather than virtual-memory pages.  Amber, Prelude,
and Orca provide access to shared data in the context of object-oriented
languages.  Amber and Prelude primarily provide access to shared objects by
moving tasks to the processor containing the data.  Orca and Midway allow
for replication of shared objects across processors.  One Orca
implementation replicates objects on all the processors and uses an update
protocol to maintain the consistency of shared data.  Such a protocol works
well for some applications, but can cause performance problems in many other
applications.  Midway associates arbitrary regions of shared data with
locks, and ensures that shared data are consistent when the lock associated
with the data is acquired.  In this way, it successfully hides
synchronization messages by combining them with data access messages.  The
reader-writer locks protecting the shared data are managed via an
invalidation protocol.  The Linda system provides operations that insert,
read, and remove ``tuples'' from a shared tuple space.  Sophisticated
compiler analysis is typically required to analyze and optimize the
communication in a Linda program.  We do not know of a distributed-memory
Linda implementation that provides dynamic caching of tuples (except for one
which broadcasts all tuples to all processors [7]).


2.2  Design of SAM

SAM takes one step further than Orca and Midway in providing user control
over communication on a distributed memory machine.  SAM also communicates
data at the level of user-defined data types.  While most shared object
systems differ primarily in the consistency schemes and implementations
used, SAM is based on a significantly different set of primitives that are
motivated by optimizations commonly used on distributed memory machines.
Message-passing programs seldom incur extra communication just for the sake
of synchronization, since synchronization is tied with the arrival of
messages that also contain data.  Furthermore, it is possible to relax
synchronization by taking advantage of the explicit copies made on
distributed address space machines.  For example, a write to a local copy of
shared data need not wait for all reads of the older version of the data to
complete as long as a copy of the original version exists.  Finally,
message-passing programs can hide the latency of communication by sending
data to remote processors before they require the data.  SAM preserves these
advantages of message-passing systems while providing the ease of
programming associated with shared memory machines.

The basic principle of SAM is to require the programmer to designate the way
in which data will be accessed.  With this access information, SAM can
combine synchronization and communication of data.  There are two kinds of
data in SAM, which correspond to the two kinds of data relationships (hence
synchronization) in parallel programs:

- values with a single-assignment semantics.  Values make it simple to
    express producer-consumer relationships or precedence constraints; any
    read of a value must wait for the creation of the value.

                                     4


- accumulators [21], whose data accesses are mutually exclusive.  Data are
    migrated automatically in turn to processors which request mutually
    exclusive access.

The explicit naming of all values makes it possible to eliminate
anti-dependence constraints between processors; a processor can continue to
read an older version of a data item while another processor generates
another version (using a different name).  The concept of single-assignment
values have been used in a variety of parallel languages [10, 9, 13, 14, 20]
as a way of exposing parallelism and synchronizing independent tasks.  Such
systems often must deal with problems of excessive memory usage and data
copying as shared data items are created and modified.  SAM avoids these
problems by requiring that the programmer explicitly supply information on
when a piece of data is no longer needed, and allowing the programmer to
specify that one value should reuse the storage of another value.  We have
found that managing the names and memory of these values to be
straightforward using the primitives provided by SAM.

SAM also provides access to stale, local copies of accumulator data, which
can be used in chaotic algorithms which do not always need to use the most
up-to-date data.  Using a local, possibly older copy is sometimes sufficient
and can reduce the total execution time of these algorithms.  A number of
systems have supported the notion of chaotic access.  Agora [4] supports a
memory model in which all accesses are chaotic, since all modifications to
shared data structures are allowed to complete before holders of cached
copies have been notified.  Mether [19] and Clouds [22] support operations
for accessing a read-only copy of a page (or segment) that will not be kept
coherent even if the contents of the page (or segment) are changed by
another processor.

Finally, besides minimizing communication, SAM also provides primitives for
optimizing communication.  SAM provides mechanisms for producers to push
data to consumers, and for processes to prefetch a data item.  All these
optimizations are well integrated into SAM's shared memory model.


3  SAM Overview

In this section, we present a brief overview of SAM mainly via several
examples.  A more detailed description is provided in [24].


3.1  Basic Primitives

In SAM, all shared data are represented by either a value or an accumulator.
(SAM deals only with the management and communication of shared data; data
that are completely local to a processor can be managed by any appropriate
method.)  In Figure 1, we show several common idioms, as they would be
expressed using semaphores on a shared address space machine and using SAM
primitives.

                                     5


            Figure 1:  Example of Creating and Accessing Values


In the first example, mutual exclusion is required to protect updates to
shared data.  In SAM, an accumulator is used to represent a piece of data
that is to be updated a number of times, and whose final value is
independent of the order in which the updates occurs.  SAM automatically
migrates the accumulator between processors as necessary and ensures that a
process does not access the accumulator until mutual exclusion is obtained.
Updates to an accumulator must be encapsulated by the SAM primitives
begin_update_accum and end_update_accum.  The call to begin_update_accum
returns a pointer by which the accumulator can be accessed.  SAM supports
the idiom of chaotic computation via primitives which provide read access to
a ``recent'' value of the accumulator, which is not guaranteed to be the
most current value of the accumulator.  SAM maintains a cache on each
processor of versions of accumulators that have been recently accessed and
therefore may be able to satisfy the chaotic request locally without
communication.

In the second example, a consumer (right column) accesses data created by a
producer (left column).  In SAM, a value provides producer/consumer
synchronization.  Values have a single-assignment semantics:  a value is
atomically created once its initial contents are set and is henceforth
immutable.  The code to create a value, which may include arbitrary updates
to different components of the value, is encapsulated by a pair of
primitives begin_create_value and end_create_value.  Similarly, code
accessing a value is encapsulated by the primitives begin_use_value and
end_use_value.  A process that attempts to access a value will automatically
be suspended until the value is created and has been brought to the local

                                     6


processor.  Conversely, an access will succeed immediately if the value is
already cached on the local processor, returning a pointer to the local
copy.

In the third example, a consumer accesses a sequence of values created by a
producer through a limited-sized buffer.  To avoid memory usage problems
that are associated with single-assignment values, SAM allows one value to
reuse the storage of another value via the begin_rename_value primitive.
This primitive provides the necessary synchronization to ensure that the
producer does not reuse the storage of a value before the consumer has
accessed it.  In a similar fashion, imperative data objects are easily
represented in SAM via a sequence of values which can all share the same
storage.  SAM also allows a value to be converted to an accumulator and vice
versa.


3.2  Memory Management

The creator of a value or an accumulator must specify the type of the new
data.  With the help of a preprocessor, SAM uses this type information to
allocate space for, pack (for sending in a message), unpack, and free the
storage of the data.  The preprocessor can handle complex C data types,
including types that contain pointers and therefore are not necessarily
stored in one contiguous block in memory.(1)  In heterogeneous environments,
SAM also handles any necessary data conversion between dissimilar machines.

SAM maintains local copies of values fetched from remote processors in the
form of a cache.  Because all values have distinct names and are immutable,
there is no consistency problem associated with maintaining this cache.  SAM
automatically frees up local copies that are not in use when the cache
memory becomes filled.  SAM must ensure that at least one copy of a value is
maintained in the system, until it can determine that there will not be any
other processes that will need to access the value.  The SAM programmer
provides this information by specifying the number of accesses to the value
that will occur or explicitly indicating when all accesses to the value have
occurred.


3.3  Communication Optimizations

An important mechanism for tolerating the communication latency is support
for asynchronous access.  SAM provides the capability to fetch values and
accumulators asynchronously.  An asynchronous fetch succeeds immediately if
a copy of the value is available on the local processor.  However, if the
value is not immediately available, the fetch operation returns an
indication that the value is not available.  The requesting process can
_______________________________
(1)The preprocessor only handles simple hierarchical data types; it does
not handle general data structures that contain multiple pointers to the
same data.

                                     7


proceed with other accesses or computation.  When the value becomes
available on the local processor, the requesting process is notified by
calling a function specified when the request was made.  For asynchronous
access to an accumulator, the process is notified when the accumulator has
been fetched to the local processor and mutual exclusion has been obtained.

Another method for optimizing communication is to send data directly from
one processor to another processor which will need it.  A copy of any
specified value available on a processor can be explicitly sent (``pushed'')
to a remote processor via the push_value primitive.  SAM's basic mechanisms
combine smoothly to provide the buffering necessary to support a
message-passing style.  If a process attempts to access a piece of data
before it has arrived, then the process suspends until the named value
arrives.  Conversely, if a value arrives at a processor before it is needed,
it is automatically buffered by caching it as a local copy.  Note, however,
that the push operation is only an optimization and does not change the
correctness of a SAM program.


4  Applications

In this section, we describe the applications we have used in evaluating
SAM, assess the ease of programming these applications using SAM, and give
performance results on a variety of hardware platforms.  For comparison
purposes, we have chosen several complex applications that have been
implemented previously either on shared-memory multiprocessors or
distributed memory machines using message passing.  The applications are:


  o Block Cholesky - application that does a parallel Cholesky factorization
    of a sparse matrix by doing block updates in parallel.

  o Barnes-Hut - application that simulates the evolution of an n-body
    system using a tree data structure to speed up the force calculations.

  o Grobner Basis - application that computes the Grobner basis for a
    polynomial set by repeatedly processing pairs of polynomials and
    potentially adding new polynomials to the basis until no further
    polynomials need to be added.
              ________________________________________________
              |                ||serial |SAM  |DASH |msg-pass.||
              |________________||code___|code_|code_|_code____||
              |_Block_Cholesky_||NA_____|6713_|6813_|___NA____||
              |_Barnes-Hut_____||1959___|2896_|2232_|_3973____||
              |_Grobner_Basis__||3757___|4082_|NA___|__5747___||

                     Figure 2:  Application Line Counts


                                     8


      	processor clock	 peak  	I-Cache	D-Cache	network	   measured	send		round-trip
Machine 	  rate	 MFLOPS	size	size	topology   bandwidth	time		time
CM-5	Sparc	33 MHz	 8	64KB	64KB	fat tree   8MB/s	11microsec.	57microsec.
iPSC/860 i860	40 MHz	 60	4KB	8KB	hypercube  2.8MB/s	47microsec.	154microsec.
Paragon	i860	50 MHz	 75	16KB	16KB	mesh	   61MB/s	50microsec.	125microsec.
SP1	RS6000	62.5 MHz 125	32KB	64KB	multistage 7MB/s	240microsec.	415microsec.
DASH	R3000	33 MHz	 10	64KB	64KB	bus/mesh   NA		NA		NA

                      Figure 3:  Machine Characteristics

 Figure 2 gives approximate line counts for different versions of each
 application.  The SAM version of the block Cholesky application is derived
 from the original code written for the DASH shared-memory
 multiprocessor [17].  Both the SAM and DASH versions of the Barnes-Hut
 application are derived from the original serial code.  Warren and Salmon's
 message-passing Barnes-Hut code [27] is a completely separate code not based
 at all on the serial Barnes-Hut code.  Chakrabarti's message-passing version
 of the Grobner basis algorithm for the CM-5 [8] is based on the original
 serial code, as is the SAM version.  The line counts indicate the size of
 the applications and give a rough comparison of the difficulty in
 programming the different versions, which we will address further below.

 Each of the SAM applications runs without modification on the CM-5,
 iPSC/860, Paragon, SP1, and networks of workstations.  Below, we present
 performance results for a 64-processor CM-5 (running CMOST 7.3 and CMMD
 3.2), a 32-processor iPSC/860, a 56-processor Paragon (running OSF 1.0.4 and
 NX 1.2.1), a 16-processor SP1, and a 48-processor DASH.(2) In Figure 3, we
 give a summary of the important characteristics of each machine.  The first
 five columns describe the processor at each node of the machine.  (The third
 column reports peak double-precision megaflops.)  The last three columns
 give the values that we have measured for bandwidth, one-way message send
 time, and round-trip message time between two nodes.  By analyzing the
 performance of these applications on a variety of machines with different
 characteristics, we get a better understanding of the hardware and software
 factors that affect their performance.


 4.1  Block Sparse Cholesky Factorization

 The block Cholesky application [23] performs a Cholesky factorization of a
 sparse, symmetric matrix in parallel.  It decomposes the sparse matrix into
 blocks and assigns work to processors at the granularity of updates to
 blocks.  Such updates typically involve using two source blocks to update
 _______________________________
 (2)We were not able to get full numbers on the SP1 because of limited
 access to the machine.  We were also not able to get 48-processor numbers
 for DASH for the block Cholesky application because of changes in the
 machine configuration.


                                      9


one destination block.  The parallel algorithm involves executing updates to
blocks in parallel while respecting the necessary data dependences.  The
block Cholesky algorithm benefits from dynamic caching, since each sparse
block may be used many times by a processor to update other blocks.

The SAM implementation of the block Cholesky application is derived directly
from a version for the DASH multiprocessor.  Each individual block of the
matrix is a SAM data item, and the matrix data structures remained largely
unchanged.  SAM's ability to deal with complex, non-contiguous data types as
a single item is important, since each block actually contains a number of
dynamically allocated index and data arrays.  Each block in the matrix goes
through three phases, which are readily apparent from the basic
factorization algorithm.  The first phase consists of a series of
commutative updates (updates that can occur in any order).  In the second
phase (after the last update is done), the contents of the block are
finalized by a matrix division.  In the third and final phase, the block is
not modified any further and is only used to update other blocks.  Thus each
block is represented as an accumulator in the first phase, and the second
phase creates a value (using the same storage as the accumulator) that is
used in the third phase.

Each processor is responsible for all the updates to a statically assigned
set of blocks in the matrix.  A task is created when one of the source
blocks becomes available and is assigned to the processor that ``owns'' the
destination block.  The processor then accesses the second source block
asynchronously.  If the block is not immediately available, the processor
continues computing with other data while the system fetches the block in
the background.

Figure 4 gives performance results for a sparse matrix BCSSTK15 (from the
Harwell Boeing sparse matrix test set) and a dense matrix D1000,
respectively.(3)  The left graph gives parallel speedups with reference to
an efficient left-looking, column-based serial factorization algorithm on
the same machine; the right graph gives the corresponding absolute
performance in double-precision megaflops.  For these results, we use blocks
of 32 by 32 double-precision numbers.  For the sparse matrix, the average
size of data transfer over a 32-processor run is 4233 bytes; for the dense
matrix, it is 8384 bytes.  (Because of the static task and data placement,
these figures depend only on the input matrix and number of processors
used.)  Figure 5 gives information on the average time between shared data
references in block Cholesky factorizations of the sparse matrix.  To get
these numbers, we have divided the serial run times (which represent useful
work) by the total number of shared data accesses and the number of these
accesses that must request their data remotely (i.e.  that incur a cache
miss), respectively.
_______________________________
(3)On the CM-5, we do not use the four vector units at each node to enhance
floating-point performance, because significant additional programming is
required to utilize them.


                                     10


                                      (a)


                                      (b)
Figure 4:  Block Cholesky Speedup (left) and Performance (right) for (a)
BCSSTK15 and (b) D1000

The speedup trends are similar on all machines, indicating the success of
SAM in providing portability across a range of distributed memory machines.
As we discuss in Section 4.4, the low speedups for the sparse matrix are
largely due to limited parallelism and poor load balancing in the parallel
algorithm rather than SAM overheads; continued improvement in the parallel
algorithm will result in better speedups on all machines.

The differences in the speedup curves reflect the different characteristics
of the various machines.  In particular, parallel performance can be limited
by the available bandwidth between nodes on each machine.  The Paragon and
the DASH multiprocessor have the best speedups because of their high network
bandwidth.  The similar speedups of the Paragon and DASH suggest that

       _____________________________________________________________
       |          ||number of  |avg.  useful work|avg.  useful work ||
       |          ||processors |between accesses |between accesses  ||
       |__________||__________|_to_shared_data___|__to_remote_data__||
       |_CM-5_____||___32_____|_____438microsec__|______1910microsec||
       |_iPSC/860_||___32_____|_____364microsec__|______1588microsec||
       |_Paragon__||___32_____|_____292microsec__|______1274microsec||
       |_SP1______||___12_____|______76microsec__|_______409microsec||

Figure 5:  Frequency of Shared Data Access in Block Cholesky Factorization
of BCSSTK15

                                     11


performance is not significantly affected by the software shared memory
implementation, because of the coarse granularity of shared data access.
The SP1 reaches bandwidth limits for larger number of nodes, but has very
high uniprocessor performance and achieves impressive parallel performance
for a small number of nodes.  The CM-5 and iPSC/860 also reach bandwidth
limits, but at much larger numbers of processors.


4.2  Barnes-Hut Algorithm

The Barnes-Hut algorithm [2] is a fast algorithm for simulating the
evolution of a system of astronomical bodies as they interact with each
other via the gravitational force (the ``n-body problem'').  At each time
step, the algorithm computes the gravitational forces between the n bodies,
and determines the new position and velocity of each of the bodies.  The
Barnes-Hut application builds a tree data structure, called an oct-tree, at
each time step to summarize the gravitational effects of nearby groups of
bodies, so that the force calculation can be done more quickly.  The
structure of the oct-tree is complex and dependent on the input data, so the
memory to hold the oct-tree structure cannot be statically allocated and
partitioned across processors.  However, the work of the force calculation
phase can be partitioned so that there is extensive locality in each
processor's access to the tree nodes [25].  The programmer only needs to
worry about doing this partitioning correctly; SAM automatically exploits
this locality by caching recently accessed tree nodes on each processor.

To reduce address translation and message-passing overhead for this
application, we have experimented with blocking the nodes of the tree
together.  That is, as the oct-tree is built in each time step, we combine
several nodes of the tree into one SAM data item.  We have hidden the
complexity of the blocking in an oct-tree library, and the user can specify
the option of blocking or not blocking.  When blocking, the tree library
automatically brings over a whole block when the ``top'' node in the block
is accessed.  Such blocking increases the granularity and reduces the
frequency of communication.  It also does a form of prefetching, since it
fetches a whole collection of nodes that are likely to be accessed in the
near future when one of them is accessed.  The disadvantage is that extra
bandwidth and memory are used in bringing over nodes that are never
accessed.  In addition, for large blocks, the parallelism in some of the
tree traversal phases is decreased, because only one processor can modify a
block at a time.

Figure 6 shows speedup (measured against the efficient serial algorithm) and
absolute performance (bodies processed per second) of our parallel version
running on all machines for a simulation of a highly irregular distribution
of 25000 bodies.  We have also included performance numbers (labeled in the
graph as MP-iPSC) for the message-passing Barnes-Hut code by Warren and
Salmon [27].  We have ported this message-passing code to the iPSC/860 and


                                     12


run it on an identical problem.(4)


Figure 6:  Barnes-Hut Speedup (left) and Absolute Performance (right) for
25000-body Simulation

For these performance figures, tree blocking is used for the runs on the
iPSC/860, Paragon, and SP1.  Because of the low cost of sending and
receiving messages on the CM-5, tree blocking is unnecessary and not used.
The average granularity at which shared data are accessed depends on whether
blocking is used or not.  Each tree node has a size of 152 bytes.  For
typical 32-processor runs, the average size of messages that communicate
data objects is about 220 bytes when blocking is not used, and 1340 bytes
when blocking is used.  Figure 7 gives information on the frequency of
shared data references for the 25000-body simulation.  These figures show
that the granularity of access to shared data is much more fine-grain than
in the block Cholesky application, but the locality of reference is also
much higher.
       _____________________________________________________________
       |          |number of  |avg.  useful work |avg.  useful work |
       |          |processors |between accesses  |between accesses  |
       |__________|___________|_to_shared_data___|_to_remote_data___|
       |_CM-5_____|___32______|______27microsec__|______3170microsec|
       |_iPSC/860_|___32______|______39microsec__|______8603microsec|
       |_Paragon__|___32______|______32microsec__|______7069microsec|
       |_SP1______|___16______|______13microsec__|______8848microsec|
Figure 7:  Frequency of Shared Data Access in 25000-body Barnes-Hut
Simulation

The speedup curves scale for all versions of the application up to the
maximum number of processors, though with different slopes.  The
message-passing version achieves the best speedups overall.  However, as
_______________________________
(4)Because the error criterion in the message-passing version is different
from the original serial version (and the SAM version), it does not do
exactly the same computations as these versions, even for identical initial
conditions.  However, we have set the error parameters so that the work done
in all runs is comparable.


                                     13


illustrated by the line counts in Figure 2, this version is much more
complex and difficult to program than the original serial algorithm.  It
minimizes communication overhead by distributing tree nodes to all of the
processors that will reference them in a single communication phase, and is
highly dependent on the way that the Barnes-Hut algorithm uses the oct-tree.
The SAM version is somewhat more complex than the serial version, but a
large fraction of the increased number of lines is due to the tree blocking
implementation, which is isolated in the tree library and reusable for other
applications.  By using SAM to program the Barnes-Hut algorithm, we have
chosen to trade off some overall performance in return for considerably less
programming effort and greater portability to newer machines.

The DASH shared-memory multiprocessor also achieves better speedups than the
distributed memory machines using SAM. The performance of DASH benefits from
its additional hardware that does address translation, caching, and
communication without any software overheads.  As we will see in Section
4.4, because of the finer granularity of shared data access in this
application, software address translation and cache management take up a
significant portion of each processor's time in the SAM versions.
Nevertheless, the SAM versions scale well and achieve high absolute
performance.  In addition, the use of machines with higher uniprocessor
performance can compensate for the overheads of providing a shared name
space in software.  As with the block Cholesky algorithm, the SP1 achieves
very good absolute performance with only a small number of processors.  The
low overhead of sending messages on the CM-5 allows us to get good speedups
without blocking (and a much finer granularity of access to shared data).
The SAM versions on the iPSC/860 and SP1 have the lowest speedups, because
of the high cost of sending and receiving messages on these machines.


4.3  Grobner Basis Algorithm

We have used SAM to parallelize an important algorithm from symbolic algebra
systems.  The algorithm computes the Grobner basis [6] of a set of
polynomials, which is used to solve systems of non-linear equations and
determine implicit forms for parametric equations.  This algorithm has been
previously implemented on the CM-5 by Chakrabarti [8].  The basic structure
of the algorithm is to start with the initial basis equal to the input set
of polynomials.  Then, each possible pair of polynomials from the basis is
examined; potentially a new polynomial is produced that is added to the
basis, and a new set of pairs between the new polynomial and the current
members of the basis is generated.  The algorithm continues until there are
no more pairs left to be examined.  All calculations of polynomial
coefficients are done using an arbitrary-precision arithmetic package.

In the Grobner basis algorithm, each polynomial remains unchanged once added
to the basis.  In our parallel implementation of the Grobner basis algorithm
using SAM, each polynomial is represented by a SAM value.  The dynamic
caching of the polynomials in the basis set that is automatically provided
by SAM is crucial to good performance, since each processor repeatedly

                                     14


accesses these polynomials.  The basis set is a monotonically growing set of
polynomials.  We represent the set by a linked list of polynomials and an
accumulator which points to the polynomials at the head and tail of the
list.  Using SAM, we have built a distributed set abstraction, which allows
polynomials to be added to the set and provides an operation to iterate over
the current elements in the set.  The set abstraction uses chaotic access to
the head and tail pointers of the list when appropriate to reduce contention
for these pointers.

Figure 8 shows the speedups and absolute performance of our Grobner basis
program for a representative sample of input polynomial sets.  The absolute
performance is calculated as the number of polynomials tested in the serial
execution divided by the parallel run times.  The parallel algorithm is
inherently non-deterministic due to different task orderings and can give
superlinear speedup.  In addition, the run times for comparable serial
implementations can vary widely, depending on the heuristic used for
ordering tasks.  Speedups are determined by running the algorithm five times
for a particular input set and number of processors, and dividing the
average run time by the run time for an efficient serial algorithm with the
same task ordering heuristic.  For 32-processor runs, the average size of
messages communicating data (mainly transmitting polynomials) varies from
200 to 1000 bytes depending on the input set.  Figure 9 gives statistics on
the frequency of shared data access in 32-processor runs using the Lazard
input set.  Because the parallel execution can do much more work than the
serial execution, as we describe next, we give these figures in terms of the
amount of time between accesses in the parallel execution.

The actual speedups are limited for increasing numbers of processors in
large part because the parallel algorithm almost always performs more work
than the sequential counterpart.  In the serial algorithm, as each
polynomial is added to the basis set, it increases the fraction of the
remaining candidate polynomials that can be eliminated quickly.  However,
when the algorithm runs in parallel, each processor executes independently
without the knowledge of the polynomials that are about to be added to the
basis set by other processors.  Therefore, the processors typically do more
work and the basis set grows larger than it would in the serial execution.
This effect increases with larger numbers of processors and with longer
communication latencies.

Our implementation of the algorithm appears to have better heuristics for
setting the priority of tasks than the Chakrabarti's message-passing
implementation for the CM-5 [8].  In consequence, we have better absolute
serial and parallel run times on the majority of the available polynomial
benchmarks.  Our uniprocessor and 10-processor times are better than those
of the message-passing implementation for eight of the nine input sets whose
timings are reported in [8].  For one input set, our 10-processor time is
substantially slower than our 1-processor time, because our task-ordering
heuristic happens not to work well for that input set.  Our speedups are
comparable to those in  [8] for several of the benchmarks, but smaller for
the other benchmarks because of our faster uniprocessor times.

                                     15


As shown in Figure 2, the line count for the SAM version is not much larger
than the line count of the serial version.  The implementation of the
distributed set abstraction accounts for most of the extra lines.  On the
other hand, the message-passing implementation has many more lines than the
serial version, mainly because it implements an application-specific form of
caching and consistency based on invalidation.


4.4  Parallelization and Communication Costs

In Figures 10 and 11, we give statistics on the parallelization and
communication costs for 32-processor runs of our applications.  Figure 10
displays the average values of the overheads, while each entry in Figure 11
has both an average value and a range over all the processors.  In Figure
10, we have also included a segment (labeled Application time) that
indicates the amount of work which would be done by each processor if there
was perfect speedup (i.e.  if each processor did 1/32 of the work done by
the serial algorithm).  The overheads divide naturally into those associated
with the parallel algorithm (idle time), the communication hardware (message
and stall time), and the software shared memory layer (address translation
and pack/unpack time):

  o Idle time indicates the percentage of time that a processor was idle
    because of lack of work.

  o Message time is the total time spent sending messages and responding to
    incoming messages (e.g.  requests for a data item).

  o Stall time is the time spent waiting for data from a remote processor,
    excluding time that is spent serving incoming messages.

  o Address translation time is the percentage of time spent in the SAM
    system ensuring that a copy of a data item is available and determining
    the address of the local copy.  This time includes the cost of a hash
    table lookup and LRU management of the cache of data items.

  o Pack/unpack time is the total time spent packing and unpacking messages
    and has been separated out from the ``message time''.  Packing and
    unpacking is necessary because SAM allows each data item to be an
    arbitrary non-contiguous data structure (connected by pointers).

In Figure 10, we have also shown the time that is not accounted for in any
of the above categories.  The biggest component of this unaccounted time is
the extra computing done by the parallel application that is not done by the
serial application.  Some examples are the extra work done in the parallel
Barnes-Hut algorithm to determine the appropriate partitioning of bodies
across processors, and the extra work done in the parallel Grobner basis
algorithm because the basis grows larger in parallel runs.  The unaccounted
time also covers some uncertainties in measuring the overheads.  All of the
statistics are only approximate, since the timer calls used to measure the

                                     16


overheads may perturb the execution of the applications.(5)  In addition,
the message times given on the iPSC/860 and Paragon are only a lower bound,
since much of the process of transmitting and receiving messages is handled
via interrupts in system routines and are not instrumented.

We can make a number of general observations from the data.  First, for
applications with complex data structures and irregular parallelism like the
ones we have used, the limited parallelism inherent in the underlying
parallel algorithms can lead to large amounts of idle time that dominate
other hardware and software overheads.  In the block Cholesky application,
the limited parallelism results from long critical paths and the static
placement of tasks; in the Grobner basis application, the limited
parallelism is due to difficulties in achieving good load balance as
variable-sized tasks are dynamically created and distributed across
processors.

For applications that must communicate a lot of data between processors,
message sending and receiving can constitute a large fraction of the run
time, as illustrated by the block Cholesky application.  These overheads are
largely governed by machine characteristics such as network bandwidth and
can also limit performance regardless of other overheads.  Note the high
percentage of message time in the block Cholesky application for the
iPSC/860, which has the least bandwidth of all the machines.  The time for
packing and unpacking complex objects (the blocks) is also significant in
the block Cholesky application, because of the large amount of data
communication.  Much of this message handling time may be eliminated (or
potentially offloaded to a message co-processor, as in the Paragon) with
more efficient support for handling messages on newer machines.

The stall time is a function of both the parallel algorithm and machine
characteristics.  There are significant stall times in the Barnes-Hut
algorithm, because of contention for nodes as the tree is built and frequent
accesses to remote nodes.  Similarly, the Grobner basis algorithm has
significant stall times because of contention for the shared basis set.  The
percentage of stall time is larger in the iPSC/860 than in the other
machines likely because of the higher cost of sending messages and the lower
network bandwidth of the iPSC/860.

Finally, for applications with finer-grain access to shared data, the
software address translation overhead can be significant.  The SAM software
overhead in the Barnes-Hut algorithm is substantial, especially for the
CM-5, where we have not used tree blocking.  However, because it depends
only on the number of shared accesses and not on the amount of
communication, software address translation is a fixed percentage overhead
on each processor and does not affect the scalability of the parallel
application.
_______________________________
(5)The instrumentation adds at most 11% to the execution time of all runs.


                                     17


In summary, the scalability of parallel applications is principally
determined by the parallel algorithm used.  The speedups of the block
Cholesky and Grobner basis applications are limited by the characteristics
of their underlying parallel algorithms.  The Barnes-Hut application has
good scalability because of its inherent parallelism and extensive locality
of reference.  SAM applications with fine-grain access to shared data, such
as the Barnes-Hut application, may have significant software overhead in the
form of address translation.  The speedups of parallel applications can also
be limited by machine characteristics such as network bandwidth.


5  Evaluation of Communication Optimizations

In this section, we evaluate the usefulness of several aspects of SAM that
allow for optimizing communication by measuring their effects on some of the
applications we have described.


5.1  Caching

We are able to evaluate the usefulness of caching in the SAM system by
measuring the performance of our applications when there is no caching (i.e.
each object must be fetched from the processor on which it was created each
time it is accessed).  Figure 12 shows the improvements in 32-processor run
times when caching is added.  For reference, it also includes the serial
times for the same applications.  These figures do not include the pushing
and chaotic optimizations of sections 5.3 and 5.4.

The figures show that caching is important for all three applications,
especially the Barnes-Hut and Grobner basis applications, which show a large
amount of locality between the references of tasks executing on the same
processor.  The block Cholesky application does not show as much inter-task
locality and benefits less from the caching provided by SAM, since any
particular remote block may be used only a small number of times to update
local blocks.  However, the block Cholesky application has significant
intra-task locality in block update operations themselves, which is already
exploited well by the hardware caches of the nodes.

Many recent scientific algorithms use complex data structures and dynamic
task placement and can benefit substantially from dynamic caching of remote
data.  Often, such caching functionality is implemented in an
application-specific way in message-passing programs (for example, in [8]),
at the cost of much programmer effort.  SAM provides caching as a common
functionality that is reusable in many applications and automatically
exploits the locality of reference in these programs.

                                     18


5.2  Synchronization

In a typical shared object system based on an imperative shared memory
model, programmers express synchronization separately from data access.
These explicit synchronization operations result in extra messages in
addition to the messages used to transmit actual data.  In contrast, SAM
data access primitives directly express the mutual exclusion and
producer/consumer data relationships in parallel program.  No
synchronization other than the synchronization implicit in shared data
access is necessary to ensure correctness.  For instance, during the
parallel, post-order modification of the oct-tree in the Barnes-Hut
algorithm, all synchronization occurs appropriately as nodes are accessed.
In an imperative shared memory system, additional locks and flags would be
necessary at each node to ensure that nodes are accessed in the proper
order.

Figure 13 attempts to quantify the synchronizations that would be necessary
for our applications in an imperative shared object system.  In the first
column, we show the number of barriers used in shared-memory implementations
of our applications, and in the second column we show the total number of
accesses to shared data objects.  In the last two columns, we give the
number of accesses (and percent of total shared accesses) that, in an
imperative shared object system, would require extra synchronization not
handled by the barriers.  These figures are determined by classifying the
types of accesses in runs of the SAM versions of the applications.  The
numbers show that there are a large number of synchronizations in our
applications that would incur extra communication in imperative shared
object systems.

Our applications include several common synchronization patterns which
cannot be handled via barriers.  For example, the producer/consumer
synchronization necessary for the post-order tree modification in the
Barnes-Hut algorithm is similar to the synchronization of a tree-based
reduction, which occurs in many parallel programs.  By combining
synchronization with data access, SAM makes these important types of
communication patterns more efficient.


5.3  Pushing Data

SAM provides a mechanism for sending values to remote processors that may
require them.  This ``pushing'' mechanism has the advantage of eliminating
the latency for access to data if the pushed data item arrives at a
processor before the processor requires the data.  We were able to use this
pushing mechanism in both the Barnes-Hut algorithm and in the block Cholesky
application to improve performance.  In the Barnes-Hut algorithm, we push
the first few levels of the oct-tree to all processors after these levels
have been modified, since all processors will likely access the top part of
the tree.  In the block Cholesky algorithm, we push completed blocks of the
matrix to (exactly) the processors that will access them, because a major

                                     19


limiting factor in the run time is the critical path of block dependencies.
Anything that reduces this critical path by reducing latencies for accessing
blocks improves the run time.  Figure 14 shows the improvements in
32-processor run times when the push and chaotic (see Section 5.4)
optimizations are used (with caching).  In both applications, the push
optimizations are simple additions to the code, and the pushes produce a
substantial improvement in the run time of the computations where they are
used.  However, since the push optimizations are only used in certain phases
of each application, the effects on the overall run times are lower.


5.4  Chaotic Access

In a variety of scientific applications, it is possible to use chaotic
accesses to shared data to improve parallel performance, while still
producing correct results.  In the Barnes-Hut algorithm, we use chaotic
accesses when traversing the tree to determine where an item should be
inserted; the properties of the oct-tree allow this optimization as long as
an exclusive access is used when a potential insertion point has been
reached.  In the Grobner basis algorithm, we use chaotic accesses in some
references to the head and tail pointers of the shared polynomial basis.  In
Figure 14, we show the improvement in run times for 32-processors runs when
chaotic accesses are used.  (When not using chaotic access, we allow copies
of data items to be cached, but force these copies to be invalidated when
the items are modified.)  The Barnes-Hut improvements are small but
significant, because the chaotic optimization applies only to the tree
building phase of the algorithm, not the main force calculation phase.  The
results show that significant performance improvement is possible by using
application-specific knowledge to determine when accesses with relaxed
consistency can be employed.


6  Conclusion

In this paper we have presented the design and evaluation of a shared object
system for distributed memory machines called SAM. SAM is a portable
run-time system that provides a global name space and automatic caching of
shared data, but also provides the programmer with the ability to optimize
communication when necessary.  To evaluate SAM, we have used it to
parallelize several complex scientific algorithms, each of which consists of
thousands of lines of code.  Programming these applications in SAM is
significantly easier than writing message-passing code and provides
portability across machines.

Our results show that the various communication optimizations provided by
SAM are important for obtaining good performance results on these complex
applications.  SAM provides generic caching functionality that can be reused
across applications and therefore simplifies programming effort.  This
caching is essential in a variety of complex applications for getting good
performance by exploiting the dynamic locality of reference.  By tying

                                     20


synchronization to data access, SAM improves communication efficiency in
applications with complex synchronization patterns that cannot be managed
via barriers.  Our results also show that the use of ``pushes'' and chaotic
accesses, when appropriate to the application, can provide significant
performance improvements with minimal effort.  We believe that software
distributed shared memory system must address the problem of high
communication overhead by providing flexible mechanisms for optimizing
communication.

We find that the performance results achieved using the SAM system depend on
a variety of factors.  For the complex scientific algorithms we used, the
largest factor can be the limited parallelism available in the parallel
algorithms.  Performance is also significantly affected by the limited
communication capabilities of current distributed memory machines, which
will improve in future machines.  The amount of software overhead is a
function of the granularity of access to shared data and is acceptable for
moderately-grained applications.  Overall, we found that SAM significantly
eased programming of our applications and allowed us to achieve good
performance for these applications on a variety of platforms.


Availability

The SAM system is available via anonymous FTP at suif.stanford.edu in
/pub/sam and via the World Wide Web at https://suif.stanford.edu.


Acknowledgments

We thank Ed Rothberg for his block Cholesky code, J.P. Singh for his DASH
Barnes-Hut code, and Soumen Chakrabarti for his Grobner basis code.  Evan
Torrie ported the Barnes-Hut message-passing code to run on the iPSC/860.
Some of the performance numbers were obtained using the CM-5 at Berkeley,
which is supported by National Science Foundation Infrastructure Grant
number CDA-8722788.  The IBM SP1 was donated by IBM Corporation.  We also
thank the referees for comments that helped improve the paper.


References

 [1] H. E. Bal, M. F. Kaashoek, and A. S. Tanenbaum. Orca:  A Language for
     Parallel Programming of Distributed Systems. IEEE Transactions on
     Software Engineering, 18(3), Mar. 1992.

 [2] J. E. Barnes and P. Hut. A Hierarchical O(NlogN) Force-Calculation
     Algorithm. Nature, 324(6096):446--449, Dec. 1986.


                                     21


 [3] B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The Midway
     Distributed Shared Memory System. In COMPCON 1993, Mar. 1993.
 [4] R. Bisiani and A. Forin. Multilanguage Parallel Programming of
     Heterogeneous Machines. IEEE Transactions on Computers,
     37(8):930--945, Aug. 1988.

 [5] R. D. Bjornson. Linda on Distributed Memory Multiprocessors. PhD
     thesis, Yale University, Department of Computer Science, Nov. 1992.
 [6] B. Buchberger. Grobner Basis:  an Algorithmic Method in Polynomial
     Ideal Theory. In N. K. Bose, editor, Multidimensional Systems Theory,
     chapter 6, pages 184--232. D. Reidel Publishing Company, 1985.

 [7] N. Carriero. Implementation of Tuple Space Machines. PhD thesis, Yale
     University, Department of Computer Science, 1987.

 [8] S. Chakrabarti and K. Yelick. Implementing an Irregular Application on
     a Distributed Memory Multiprocessor. In Proceedings of the Fourth
     ACM/SIGPLAN Symposium on Principles and Practices and Parallel
     Programming, pages 169--179, May 1993.
 [9] K. M. Chandy and C. Kesselman. Composition C++:  Compositional
     Parallel Programming. In Fifth Workshop on Languages and Compilers for
     Parallel Computing, pages 124--144, Aug. 1992.

[10] K. M. Chandy and S. Taylor. The Composition of Concurrent Programs. In
     Proceedings of Supercomputing '89, Reno, Nevada, Nov. 1989. ACM.

[11] J. S. Chase, F. G. Amador, E. D. Lazowska, H. M. Levy, and R. J.
     Littlefield. The Amber System:  Parallel Programming on a Network of
     Multiprocessors. In Proceedings of the Twelfth ACM Symposium on
     Operating Systems, pages 147--158, Dec. 1989.
[12] A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and
     W. Zwaenepoel. Software Versus Hardware Shared-Memory Implementation:
     A Case Study. In Proceedings of the 21st Annual International
     Symposium on Computer Architecture, pages 106--117, April 1994.

[13] M. J. Feeley and H. M. Levy. Distributed Shared Memory with Versioned
     Objects. In 1992 ACM SIGPLAN Conference on Object-Oriented Programming
     Systems, Languages, and Applications, Oct. 1992.

[14] J. T. Feo, D. C. Cann, and R. R. Oldeheoft. A Report on the SISAL
     Language Project. Journal of Parallel and Distributed Computing,
     10(4):349--366, Dec. 1990.
[15] J.B. Carter and J.K. Bennett and W. Zwaenepoel. Implementation and
     Performance of Munin. In Proceedings of the 13th ACM Symposium on
     Operating Systems Principles, pages 152--164, Oct. 1991.

[16] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks:
     Distributed Shared Memory on Standard Workstations and Operating
     Systems. In Proceedings of the 1994 Winter Usenix Conference, pages
     115--132, January 1994.

                                     22


[17] D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy,
     M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. Computer,
     25(3):63--79, Mar. 1992.
[18] K. Li. IVY: A Shared Virtual Memory System for Parallel Computing. In
     Proceedings of the 1988 International Conference on Parallel
     Processing, pages II 94--101, Aug. 1988.

[19] R. G. Minnich and D. J. Farber. Reducing Host Load, Network Load, and
     Latency in a Distributed Shared Memory. In Tenth International
     Conference on Distributed Computing Systems, pages 468--475, June 1990.

[20] R. S. Nikhil. The Parallel Programming Language Id and Its Compilation
     for Parallel Machines. International Journal of High Speed Computing,
     5(2):171--223, June 1993.
[21] K. Pingali and K. Ekanadham. Accumulators:  New Logic Variable
     Abstractions for Functional Languages. Theoretical Computer Science,
     81(2):201--221, Apr. 1991.

[22] U. Ramachandran, M. Yousef, and A. Khalidi. An Implementation of
     Distributed Shared Memory. Software - Practice and Experience,
     21(5):443--464, May 1991.

[23] E. Rothberg and A. Gupta. An Efficient Block-Oriented Approach to
     Parallel Sparse Cholesky Factorization. In Proceedings of
     Supercomputing '93, pages 503--512, Nov. 1993.
[24] D. J. Scales and M. S. Lam. An Efficient Shared Memory System for
     Distributed Memory Machines. Technical Report CSL-TR-94-627, Computer
     Systems Laboratory, Stanford University, July 1994.

[25] J. P. Singh. Parallel Hierarchical N-body Methods and Their
     Implications for Multiprocessors. Technical Report CSL-TR-93-565,
     Stanford University, Mar. 1993.

[26] V. Sunderam. PVM: a Framework for Parallel Distributed Computing.
     Concurrency:  Practice and Experience, 2(4):315--339, Dec. 1990.
[27] M. Warren and J. Salmon. An O(NlogN) Hypercube N-body Integrator. In
     Third Conference on Hypercube Concurrent Computers and Applications,
     pages II 971--975, Jan. 1988.

[28] W. Weihl et al. Prelude:  A System for Portable Parallel Software.
     Technical Report MIT/LCS/TR-519, MIT, Oct. 1991.


                                     23


                                      (a)


                                      (b)


                                      (c)

Figure 8:  Grobner Speedups (left) and Performance (right) for Inputs (a)
Lazard, (b) katsura4, and (c) trinks1

     _________________________________________________________________
     |          ||number of |avg. parallel work|avg.  parallel work ||
     |          ||processors|between accesses  |  between accesses  ||
     |__________||__________|__to_shared_data__|___to_remote_data___||
     |_CM-5_____||___32_____|_______55microsec_|_______3188microsec_||
     |_iPSC/860_||___32_____|_______75microsec_|_______4315microsec_||
     |_Paragon__||___32_____|_______51microsec_|_______2947microsec_||
     |_SP1______||___8______|_______30microsec_|_______7100microsec_||

 Figure 9:  Frequency of Shared Data Access in Grobner Basis Runs on Lazard


                                     24


Figure 10:  Parallelization and Communication Costs for 32-processor Runs on
the CM-5 (CM), iPSC/860 (IP) and Paragon (PR)


Block Cholesky	idle		message		stall		addr. trans.	pack/unpack
(BCSSTK15)	time (%)	time (%)	time (%)   	time (%)	time (\%)
CM-5		31.8 (1.9-57.0)	13.2 (9.1-22.3)	3.0 (0.0-6.8)	1.7 (1.1-2.1)	11.5 (8.6-14.4)
iPSC/860	36.6 (15.4-55.4) 27.2 (17.0-38.9) 5.5 (0.0-12.6) 0.8 (0.6-1.0)	6.9 (5.1-8.5)
Paragon		28.2 (4.8 - 53.5) 12.4 (7.4-19.2) 0.7 (0.0-4.1) 1.3 (1.0-1.9)	16.2 (8.5-21.4)

Barnes-Hut
(25000 part.)
CM-5		4.6 (0.6 - 8.8)	2.5 (1.9-5.8)	9.0 (5.2-11.0)	22.7 (21.8-26.2)2.0 (1.5-6.8) 
iPSC/860	4.6 (0.2-10.6) 	7.9 (6.1-13.8)	16.2 (4.7-18.6)	9.2 (7.9-10.2)	2.0 (1.3-2.4) 
Paragon		3.6 (0.0-10.8) 	2.9 (2.4-3.4)	12.5 (5.5-14.9)	11.2 (9.7-12.4)	2.3 (1.5-2.8) 

Grobner Basis
(Lazard)
CM-5		23.1 (4.6 - 40.4) 9.3 (7.1 - 16.1) 10.8 (5.7 - 17.9) 3.8 (1.7 - 5.8) 3.2 (2.5 - 5.1)
iPSC/860	24.0 (0.0 - 54.9) 4.1 ( 2.0 - 13.0) 12.9 (6.6 - 19.4) 3.3 (2.8 - 4.4) 2.0 (1.6 - 4.6)
Paragon		30.3 (0 - 54.9)	4.6 (2.5 - 24.3) 10.5 (6.6 - 17.6) 1.7 (1.3 - 2.2) 1.5 (1.0 - 2.6)

 Figure 11:  Parallelization and Communication Costs for 32-processor Runs


                                     25


           ______________________________________________________
           |                 ||       |____32-processor_time_____||
           | Block Cholesky  ||serial |without|with    | factor  ||
           |__(BCSSTK15)_____||time___|caching|caching_|improve. ||
           | CM-5            ||27.6s  |4.13s  | 3.39s  |  1.22   ||
           | iPSC/860        ||23.0s  |5.14s  | 3.94s  |  1.30   ||
           |_Paragon_________||18.8s__|2.62s__|_2.18s__|__1.20___||
           | Barnes-Hut      ||       |       |        |         ||
           |_(25000_bodies)__||_______|_______|________|_________||
           | CM-5            ||794.8s |618.5s |42.3s   |  14.6   ||
           | iPSC/860        ||490.2s2|643.5s |42.4s   |  62.3   ||
           |_Paragon_________||402.8s_|743.9s_|27.2s___|__27.3___||
           | Grobner Basis   ||       |       |        |         ||
           |_(Lazard)________||______ |_______|________|_________||
           | CM-5            ||29.7s  |50.60s | 3.36s  |  15.1   ||
           | iPSC/860        ||23.3s  |87.25s | 3.94s  |  22.1   ||
           |_Paragon_________||17.4s__|45.56s_|_3.08s__|__14.8___||

           Figure 12:  Caching Performance for 32-processor Runs


				barrs.	total shared	est. non-barrier synchronizations
					data accesses	prod/cons	mutual excl 
Barnes-Hut (25000 bodies)	7	14649035	11210 (.08\%)	27463 (.19\%)
Block Cholesky (BCSSTK15)	2	93093		13197 (14\%)	0
Grobner (Lazard)		2	1147680		0		17301 (1.5\%)

        Figure 13:  Number of Synchronizations in 32-processor Runs


___________________________________________________________________________________________________
|                         | with caching ||with   |percent||with   |percent||with pushes |percent ||
|_________________________|_____only_____||pushes_|speedup||chaotic|speedup||and_chaotic_|speedup_||
|Barnes-Hut     |CM-5     | 42.3s        || 42.0s |  1%   || 41.5s |  2%   ||    40.7s   |  4%    ||
|(25000 bodies) |iPSC/860 | 42.4s        || 36.2s | 17%   || 38.1s | 11%   ||    32.9s   | 29%    ||
|_______________|Paragon__|_27.2s________||_25.6s_|__6%___||_24.5s_|_11%___||____21.4s___|_27%____||
|Block Cholesky |CM-5     | 3.39s        || 3.20s |  6%   ||       |       ||            |        ||
|(BCSSTK15)     |iPSC/860 | 3.94s        || 3.52s | 12%   ||  NA   | NA    ||    NA      |  NA    ||
|_______________|Paragon__|_2.18s________||_1.66s_|_31%___||_______|_______||____________|________||
|Grobner        |CM-5     | 3.36s        ||       |       || 1.98s |  70%  ||            |        ||
|(Lazard)       |iPSC/860 | 3.99s        ||  NA   |  NA   || 2.68s |  49%  ||   NA       |  NA    ||
|_______________|Paragon__|_2.54s________||_______|_______||_1.83s_|__39%__||____________|________||

         Figure 14:  Effects of Optimizations on 32-Processor Runs


                                     26