################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	    Proceedings of the USENIX SEDMS IV Conference
      (Experiences with Distributed and Multiprocessor Systems)
	     San Diego, California, September 22-23, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


    	     NUMACROS:  Data Parallel Programming  on  
			NUMA Multiprocessors

                    Hui Li and Kenneth C. Sevcik

                Computer Systems Research Institute

                       University of Toronto

                        Toronto ON M5S 1A4

                              CANADA

                 Email: {hui|kcs}@csri.toronto.edu


                            Abstract

    Data parallel programming has been widely used in  developing
scientific  applications  on  various types of parallel machines:
SIMD, MIMD distributed memory machines,  and  UMA  shared  memory
machines.   On  NUMA shared memory machines, data locality is the
key to good performance of parallel applications.  In this paper,
we  propose a set of macros (NUMACROS) for data parallel program-
ming on NUMA machines.  NUMACROS attempts to achieve both ease of
programming  and  data locality for performance. Programs written
using NUMACROS are nearly as short and easily readable as sequen-
tial  versions of the programs. To obtain data locality, data and
loops are distributed and partitioned in  a  coordinated  fashion
among  the processors.  Although global address spaces facilitate
data distribution on NUMA systems, a naive implementation  of  an
application  will  suffer  from high costs. To reduce the cost, a
number of approaches have been proposed and evaluated.  These in-
clude  index  precomputing,  index checking, loop transformation,
and others. Our experimental results, with the  Hector  multipro-
cessor, show that these approaches are effective.  While such fa-
cilities will be provided by compilers in the long run,  NUMACROS
is a helpful interim step.

_______________________________
    This research  has been supported by research grants from the
Natural Sciences  and Engineering Research Council  of Canada and
from the Information Technology Research Centre of Ontario.


1    Introduction

The data parallel programming  model  has  been  widely  used  in
developing  scientific  applications.  Writing a parallel program
in this model for distributed memory multiprocessors involves two
major steps:  selecting data distributions and then using them to
derive node programs with explicit communications to access  non-
local  data.   Manually  specifying  communications is a tedious,
non-portable, and error-prone step.  To  overcome  this  problem,
many parallel programming languages have been proposed, including
C* [16], Dataparallel C [14, 9], Kali [12], DINO [17], Fortran  D
[7,  11],  High Performance Fortran Form (HPFF) [1], Superb [21],
and NESL [4].  These languages  provide  global  name-spaces  for
ease  of  programming,  but  require the programmers to carefully
determine data distributions for good performance.  On distribut-
ed  memory multiprocessors, the compiler translates references to
global arrays into references to smaller local arrays  stored  in
processors' local memory modules and generates communications for
nonlocal  accesses.   The  performance  of  these  programs  thus
depends  on  the  effectiveness  of optimizing communications and
global-local index mapping.

    Non-Uniform Memory Access (NUMA) time shared memory multipro-
cessor systems support a global memory space by hardware.  Howev-
er, the cost of remote memory access is significantly higher than
that  of  local memory access.  Memory locality is thus essential
for good performance.  In order to achieve  high  locality,  data
distributions must match loop partitioning.

    Data parallel programming on NUMA machines raises issues dif-
ferent  from  UMA  and  distributed  memory MIMD machines. On UMA
machines, memory locality is not an issue  so  an  implementation
for  an  UMA system is not suitable for NUMA machines. On distri-
buted memory MIMD systems, data must be distributed among proces-
sors and accesses to remote data require explicit inter-processor
communications.  The owner computes rule is usually used for gen-
erating  communications.  On NUMA machines, neither explicit com-
munications nor the owner computes rule are needed.

    In this paper, we describe the implementation of a set  of  C
macros  (NUMACROS) for developing parallel applications using the
data parallel model.  NUMACROS  allows  the  programmers  to  use
parallel loops and data distributions to annotate sequential pro-
grams so the parallel programs are  readable  and  usually  quite
similar  to  the  sequential versions.  The key to achieving good
performance is to match parallel loops with  data  distributions.
Ideally, as software for parallel systems matures, the facilities
provided by NUMACROS will be provided directly by compilers them-
selves  [8].   However,  in the meantime NUMACROS provides a con-
venient way to produce concise and easily readable parallel  pro-
grams that attain good speedup across a variety of applications.

    The next section describes data  parallel  programming  using
NUMACROS.  Section 3 discusses implementation issues and alterna-
tives for data distributions on NUMA systems.  Section  4  illus-
trates  the importance of data locality and evaluates the perfor-
mance of some implementation alternatives.  Section  5  discusses
related work, and the last section presents brief conclusions.


2    NUMACROS

NUMACROS (NUma MACROS) is a set of C macros  for  Single  Program
Multiple  Data  (SPMD)  parallel programming on NUMA multiproces-
sors.  It supports parallel loops by scheduling their  iterations
and  providing  data distribution constructs for partitioning ar-
rays among processors.

    Ideally, data distribution and loop parallelization should be
generated   by   parallelizing  compilers  with  data  dependence
analysis and global optimization.  We use NUMACROS to  illustrate
how  to  annotate  sequential programs with data distribution and
parallel loops, and how to generate efficient code for NUMA  mul-
tiprocessors when such data distribution and parallel loop infor-
mation is available.

    A parallel program in NUMACROS starts with one  thread,  then
creates a number of threads which start to execute the Main func-
tion. To minimize scheduling overhead, the number of  threads  is
set  to  be  equal  to the number of allocated processors. Global
variables are shared by all  threads,  but  local  variables  are
private.


2.1    Data Distributions

NUMACROS currently  supports  data  distributions  on  both  one-
dimensional  (1-D) grid and two-dimensional (2-D) grid of proces-
sors by macros dist_1D and dist_2D respectively.  The  number  of
processors  is  chosen  at run time as parameter P for a 1-D grid
and P1 x P2 for a  2-D  grid.   The  parameters  of  dist_1D  and
dist_2D are given in Figure 1.  The macro dist_1D maps dimensions
of an array onto a 1-D grid of processors in a block, cyclic,  or
block-cyclic  fashion [7, 12].  For example, dist_1D(A, 1, BLOCK,
idx, N) distributes the N rows of two dimensional array  A  in  a
block  fashion  on  the 1-D grid of P processors, where the block
size is N/P and ith block is mapped onto processor i.  The  dummy
parameter,  idx,  is  used for indexing, and the second parameter
indicates how many dimensions of the array are to be distributed.
Similarly, dist_1D(A, 2, BLOCK, idxI, N, CYCLIC, idxJ, N) distri-
butes both the rows and columns of array A, where rows are mapped
in  block  fashion and columns are mapped in cyclic fashion, e.g.
row j is on processor p if only if (j mod  P = p).

    The macro dist_2D provides block,  cyclic,  and  block-cyclic
data  distributions  for  a 2-D grid of processors.  For example,
dist_2D(A, BLOCK, i, N, CYCLIC, j, N) specifies that the rows  of
array  A  are  distributed in blocks along the first dimension of
the processor grid and the  columns  are  distributed  in  cyclic
fashion along the second dimension of the processor grid.

_________________________________________________________________
dist_1D(A, 1, distType, i, sizeI)

    o dist_1D indicates that the distribution is performed on the
      1-D grid of processors.

    o The first parameter gives the array name.

    o The second parameter, equal to 1, indicates that  only  the
      first  dimension of the array is distributed among  proces-
      sors using distType method.

    o distType can be BLOCK, CYCLIC, and BLOCK_CYCLIC.

    o i is a dummy parameter for indexing.

    o sizeI is the size of the distributed dimension.

dist_1D(A, 2, disTypeI, i, sizeI, disTypeJ, j, sizeJ)

    o The second parameter, equal to 2, indicates that the  first
      two dimensions of the array are distributed. The parameters 
      disTypeI, i, and sizeI are used for the distribution of the  
      first dimension,  and the parameters disTypeJ, j, and sizeJ 
      are for the second dimension.

dist_2D(A, disTypeI, i, sizeI, disTypeJ, j, sizeJ)

    o dist_2D indicates that distribution  is  performed  on  2-D
      grid  of  processors.  Its parameters have similar usage as 
      above.
_________________________________________________________________

             Figure 1: Data Distributions in NUMACROS


2.2    Parallel Loops

In NUMACROS, a number of parallel loop constructs are defined  to
match  the  data  distributions  so  that  good  locality  can be
achieved.  For a 1-D grid of processors, NUMACROS provides paral-
lel loops of do_blk, do_cyc, and do_blk_cyc which schedule itera-
tions of loops  in  cyclic,  block,   and  block_cyclic  fashions
respectively  [12]. (Parameters of these macros are given in Fig-
ure 2.)

_________________________________________________________________

    o do_cyc(i,  l,  u) schedules  the iterations (l <= i < u) of
      the loop in cyclic fashion, which means iteration i will be  
      executed  by thread  (i mod P).  Parameter i  is a  private 
      variable, and l and u are the upper and the lower bounds of 
      loops.

    o do_blk(i, l, u) partitions the iterations (l  i < u) of the
      loop into P chunks, and each thread executes one chunk.


    o do_blk_cyc(i, l,  u,  blksize)  partitions  the  loop  into
      chunks  of  size  "blksize", and schedules  the chunks in a 
      cyclic fashion.
_________________________________________________________________

              Figure 2: Parallel Loops in NUMACROS

    To reduce the communication cost in applications based on the
high  dimensional  grids  of  data, NUMACROS allows data and loop
nests to be mapped onto a 2-D grid of processors.   The  approach
can  be  easily  extended to accommodate higher dimensions.  On a
2-D  grid  of  processors,  parallel  loop  constructs   do1_blk,
do1_cyc,  and do1_blk_cyc partition the iterations of loops along
first  dimension  of  the  grid,  and  parallel  loop  constructs
do2_blk, do2_cyc, and do2_blk_cyc partition them along the second
dimension of the grid. For example, do1_cyc(i,  l,  u)  indicates
that each processor on row p1 of the 2-D processor grid (P1 x P2)
will execute iterations from p1*(u-l)/P1 to (p1+1)*(u-l)/P1 - 1.


2.3    An Example:  LU Decomposition

_________________________________________________________________
double A[N][N];                     double A[N][N]; 
                                    dist_1D(A, 1, CYCLIC, i, N) 
main()                              Main()
{ int i, j, k;                      { int i, j, k;
  ... ...                             ... ...
  for (k = 0; k<N-1; k++)             for (k = 0; k<N-1; k++)
    for (j= k+1; j<N; j++) {            do_cyc(j, k+1, N)
      A[j][k] /= A[k][k]                  A(j,k) /= A(k,k);
      for (i = k+1; i<N; i++)            for (i = k+1; i<N; i++)
          A[j][i] -=                        A(j,i) -=
              A[j][k]*A[k][i];                    A(j,k)*A(k,i);
     }                                  enddo
}                                   }
     (a) sequential version             (b) parallel version
_________________________________________________________________

               Figure 3: LU Decomposition Example

Figure 3 (b) shows the parallel version of LU decomposition algo-
rithm  written in NUMACROS. The code is similar to the sequential
version shown in Figure 3 (a) except for  a  construct  for  data
distribution,  dist_1D,  and  a  construct  for  a parallel loop,
do_cyc. This algorithm has a sequential outer  loop,  a  parallel
loop,  and an inner sequential loop. Since the iteration space of
the k - j loops  is  triangular,  the  do_cyc  loop  is  used  to
schedule  iterations among processors in cyclic fashion for load-
balancing. In C, the row-major storage method for arrays is used.
If  the  row  size of an array does not match the physical memory
page size, page false-sharing occurs  between  threads  operating
adjacent  rows.   To reduce such false-sharing, NUMACROS supports
data distribution constructs  that  map  arrays  onto  processors
properly, observing physical memory boundaries.  In this example,
dist_1D(A, 1, CYCLIC, i, N) is used to distribute rows of array A
in cyclic fashion among processors, so that accesses to all rows,
except the pivot row, are local. Note that references to array  A
between  do_cyc and enddo in Figure 3 (b) appear to be array ref-
erences with subscripts, but actually are macros with parameters.


3    Implementation of NUMACROS

In this section, we describe the implementation of parallel loops
and  data  distributions  in NUMACROS on Hector [18], and discuss
various strategies for reducing overheads in data distribution.


3.1    Hector

Hector is a scalable NUMA shared memory multiprocessor  [18].  It
consists  of  sets  of processor-memory pairs connected by a bus,
several buses connected by a local ring, and several  rings  con-
nected by a global ring. Hector provides a single global physical
address space.  Each memory module contains one  portion  of  the
global  memory.  Memory access times are 1 cycle to cache, 19 cy-
cles to local memory, 29 cycles to on-station memory, 37  on-ring
memory,  and 46 off-ring memory.  Each processor board contains a
Motorola 88100 CPU, a 16K byte instruction cache, a 16K byte data
cache  and 4M bytes of globally addressable memory.  Hurricane, a
general purpose operating system for Hector, supports traditional
page-based virtual memory. Placement policies for mapping virtual
memory to physical pages (of size  4K  bytes)  include  First-Hit
(each  page  is  placed in the memory of the processor that first
touches it) and Round-Robin (pages are placed  in  a  round-robin
fashion among the memory modules.)


3.2    Parallel Loops

A parallel loop can be scheduled on all processors or on one  di-
mension  of a 2-D grid of processors. The run-time parameters for
processor configurations are P :  the total number of processors,
or  P1  and P2: dimensions of a 2-D grid of processors (where P =
P1 x P2). Each thread/processor is assigned an identifier (myPid)
on the 1-D  grid,  or  an identifier pair (myP1, myP2) on the 2-D
grid, which are used for scheduling parallel loops.

    Since NUMA  multiprocessor  systems  support  global  address
spaces  by hardware, the implementation of data parallel programs
is simpler than that for message-passing based distributed memory
systems in the following two aspects:

    o no explicit messages need to be generated; and

    o owner computes rule [11] need not to be followed.

Hence, iterations of parallel loops  can  be  considered  as  the
basic units for scheduling.

    The definitions of cyclic parallel  loops  in  C  macros  are
given  in  Figure  4.  The implementation of other parallel loops
(block, block-cyclic) is similar. In do_cyc(i, a, b), the  itera-
tion set on processor myPid is defined as {i | (i mod P = myPid)& 
(a <= i < b)}, while in do_cyc1(i, a, b), all processors  with an 
identifier pair (myP1, X) (where (0 <= X < P2))  will execute the
same iteration set:  {i | (i mod  P1 = myP1) & (a <= i < b)}.  At
the  end  of a parallel loop, all threads are synchronized with a
barrier. If the execution times of iterations of a parallel  loop
have  high  variance, fine grain synchronization primitives (such
as locks, process counter) may be  more  efficient.  For  regular
parallel  loops,  we  observed  that  the  performance difference
between using barriers and using locks was negligible on  Hector.

_________________________________________________________________
#define do_cyc(i, a, b) { int _start = ((a)/P)*P + myPid;  \
                          if (_start <(a)) _start += P; \
                          for ( i= _start; i<(b); i += P){

#define do1_cyc(i, a, b) { int _start = ((a)/P1)*P1 + myP1;  \
                          if (_start <(a)) _start += P1; \
                          for ( i= _start; i<(b); i += P1){

#define do2_cyc(i, a, b) { int _start = ((a)/P2)*P2 + myP2;  \
                          if (_start <(a)) _start += P2; \
                          for (i = _start; i<(b); i += P2){
_________________________________________________________________

            Figure 4: Cyclic Parallel Loops in NUMACROS


3.3    Data Distributions

  In the rest of the paper, we will discuss row cyclic/block dis-
tributions on a 1-D grid of processors and block distributions on
a 2-D grid of processors.  Our approaches can be applied to other
regular data distributions as well.

    Block distribution on 1-D grid. Consecutive rows  are  mapped
onto pages and these pages are placed in processor memories based
on a block distribution. All pages that contain only  array  ele-
ments  accessed  by processor i under the block distribution will
be mapped to processor i's memory.  Also, processor i  may  share
one  page with processor i + 1.  False-sharing is thus negligible
if each processor has many rows.  Hence,  this  distribution  can
rely on the First-Hit page placement by the operating system.

    Cyclic distribution on 1-D grid. False-sharing may  exist  in
every  row if the row size is not equal to a multiple of the page
size.  To reduce  false-sharing,  cyclic rows  (i, i + P, i + 2P,
...)  in  the  original array are mapped onto consecutive ones so
that processors i and i + 1 may share at most  one  page.   (This
technique  is  illustrated  in Figure 5 for the case of mapping 8
rows onto 3 processors  cyclically).  We  will  assume  that  the
number of rows is a multiple number of processors for the simpli-
city of  presentation  (although  NUMACROS  handles  the  general
case).

    Block distribution on 2-D grid. If  an  application  requires
2-D  block-partitioning  of  an array to reduce the communication
volume between processors, then the row-major  (or  column-major)
storage  of the array leads to substantial page false-sharing and
thus increases the number of remote memory accesses. One approach
to eliminating false sharing is to tile the two dimensional array
into a four dimensional array based on the grid of P1 x  P2  pro-
cessors.    For    example,    array   A[N][N]   is   mapped   to
A[P1][P2][S1][S2], where P1 x S1 = N, P2 x S2 = N, and each  sub-
array  of size S1 x S2 will be allocated on one processor.  Since
row-major storage is used by C compilers, only the second  dimen-
sion  needs  to  be  tiled  so that the original array A[N][N] is
stored as A[P2][N][S2].


  Figure 5: Cyclic Distribution Mapping on 1-D  Grid  Processors


3.4    Mapped Array References

For data distributions  discussed  in  Section  3.3,  arrays  are
mapped  differently from the original ones. So references to ori-
ginal arrays in the programs must be changed to references to the
mapped  array,  based  on  the  data  distributions.  Programmers
should be relieved from this tedious translation.  In  this  sub-
section,  we  discuss  various  approaches of such translation on
NUMA multiprocessors. The first two approaches can be implemented
in  C  macros, while the other two approaches can be incorporated
into compilers.


3.4.1   Naive Implementation

For regular data distributions, there exist closed form functions
for  mapping  indices from an original array to the mapped one. A
naive implementation of references to mapped arrays replaces  in-
dices in original arrays by the corresponding mapping functions.

    The mapping function for the cyclic distribution for the  1-D
grid  of  processors  is 

	cyc(i) = (N/P) * (i mod P) +  i/P 

so the original A[i][j] can be replaced by A[cyc(i)][j].  For the
block  distribution  on the 2-D (P1 x P2) grid, a mapped array is
tiled on the second dimension with the block size of S2, so  map-
ping  functions  for  block numbers and offsets within the blocks
are

	blk(j) = j/S2,      off(j)= j mod S2

References  to  the  original  A[i][j]  are  thus   replaced   by
A[blk(j)][i][off(j)].  In this method, both the cyclic-mapped ar-
rays in the 1-D grid and the block-mapped arrays in the 2-D  grid
require  three  additional  operations for indexing:  multiplica-
tion, division, and mod, which are slow on many machines.


3.4.2   Index Precomputing

Computing mapping functions (such as cyc(i), blk(j), and  off(j))
may  be more expensive than making a local memory access.  In the
index precomputing approach, mapping functions for each array are
computed  once and the results are stored in local arrays. Refer-
ences to mapped arrays are  made  by  indirect  indexing  through
these local arrays.

    A row-oriented cyclic mapping on the 1-D grid  of  processors
can  be  computed  and stored in a local pointer array where each
element points to a row in a cyclic fashion.  For the block  dis-
tribution  on the 2-D grid, both block numbers and offsets of the
second dimensions are computed and stored in two arrays (blk  and
off)  in  the  local memory on each processor. A reference to the
original A[i][j] is replaced by A[blk[j]][i][off[j]].


3.4.3   Index Checking

Index precomputing increases the number of local memory accesses.
An  alternative  is to reduce operator strength in the index map-
ping function by incremental computing.

    Cyclic distribution on 1-D grid.  Consider a  reference  with
index function, f(i) = a*i + C,  on the cyclic-distributed dimen-
sion on a 1-D grid of processors, where a is a positive  integer,
i  in  f(i)  is  the  innermost  loop  induction  variable, and C
represents the invariant reminder in loop  i.   Assume  that  the
stride  of  loop i is constant s, and the size of the distributed
dimension (N) is  a multiple of P (or N = P x S).  Then the func-
tion cyc(f(i)) can  be incrementally  computed  for the innermost 
loop as follows:

    cyc(f(i+1)) = cyc(f(i))+a*s*S         if cyc(f(i))+a*s*S < N 
                  cyc(f(i))+a*s*S - (N-1) otherwise

This  requires  only one addition and one comparison.  (Note that
a*s*S needs to be computed only once in the entire loop.)

    Block distribution on 2-D grid. The original array A[N][N] is
stored  as A[P2][N][S2] where N = P2 x S2. For a reference to the
original array A[i][f2(j)]  with f2(j) = a*j + C where j  is  as-
sumed  to  be  the  innermost  loop  with constant stride s and C
represents the rest of  the  invariant  in  loop  j,  both  block
numbers  and  offsets for the mapped array can be computed incre-
mentally in the innermost loop:

    blk(f_2(j+1)) = blk(f_2(j))         if off(f2(j)) + a*s < S2
                    blk(f_2(j)) + 1 	otherwise

   off(f2(j+1)) = off(f2(j))+a*s        if off(f2(j)) + a*s < S2
                  off(f2(j))+a*s-(S2-1) otherwise


3.4.4   Loop Transformations

The index checking approach still requires several cycles for in-
dexing  computation.  Loop partitioning [19 , 20] and loop split-
ting [19 , 20] can further reduce this cost.

_________________________________________________________________
for (i = L, i<U; i++) {
    A[i][...] = ...
    ...
}
    (a) original code

iii = cyc(L); BSize = N/P;
for (p = 0; p<P; p++) {
    for (i = p+L, ii=iii; i<U; i+= P, ii++) {
        A[ii][...] =  ...
        ...
    }
    iii += BSize; if (P == ((L%P)+p+1)) iii++;
    if (iii >= N) iii -= N;
}

    (b) transformed by loop partitioning
_________________________________________________________________

                    Figure 6: Loop Partitioning

    Loop partitioning for cyclic distribution on a 1-D grid.  As-
sume  that a loop does not have loop-carried data dependences [2]
and the number of iterations in the  loop  is  greater  than  the
number  of processors (P).  To optimize  the index computation in
a reference, the loop can be partitioned into  a  two-level  loop
nest  in  a  cyclic  fashion such that the reference in the inner
loop only accesses data from local memory.  If the loop  contains
multiple  references  with different index expressions, we choose
one of them.  Figure 6 illustrates the loop partition transforma-
tion  based  on reference A[i][j], but the method can be extended
to any linear  index expressions (e.g.  f(i) = a*i + C).   Loop i
in  the original code (Figure 6 (a)) is partitioned into an outer
loop (on p) and an inner loop i so that only one  add  is  needed
for  indexing  (Figure  6  (b)).  Moreover,  since the inner loop
accesses consecutive data, the loop partition  may  yield  better
spatial locality.

    Loop partitioning for block distribution on a 2-D grid. Simi-
lar  to  loop  partitioning  on a 1-D grid, a loop is partitioned
into a two-level nest so that some references in the  inner  loop
always  access  a  single  block of the array. An example of loop
partitioning for index optimization is shown  in  Figure  7  (b).
Loop j in Figure 7 (a) is partitioned into an outer loop (bb) and
an inner loop (j) in Figure  7  (b).  So  reference  A[bb][i][jj]
(i.e. the original A[i][j] ) in the inner loop accesses block bb,
requiring one operation (add) for indexing.  However,  references
of  the  form A[blk(j-1)][i][off(j-1)] (i.e.  original A[i][j-1])
have not been optimized since they access more than one block.

    Loop  splitting  for  block  distribution  on  a  2-D   grid.
For  the index expressions that differ by a constant from the one
optimized in loop partitioning (e.g. A[i][j-1] and  A[i][j+1]  in
the  example),  loop splitting can be applied. The first/last few
iterations are split from the inner loop so that all these  index
expressions  can be computed in one cycle in the inner loop (Fig-
ure 7 (c)).

_________________________________________________________________
double A[N][N];
... ...
for (j = 1; j< N - 1; j++)
    A[i][j] = (A[i][j-1] + A[i][j] + A[i][j+1])/3;

                      (a) original loop

double A[N][N];
dist_2D(A, BLOCK, i, N, BLOCK, j, N);
...  ...
jj = off(1);
bb1 = blk(1); bb2 = blk(N-1);
for (bb = bb1; bb<= bb2; bb++) {
    for (j = max(bb*S2, A); j < min( (bb+1)*S2, B); jj++, j++)
        A[bb][i][jj] = (  A[blk(j-1)][i][off(j-1)] + A[bb][i][jj]
                        + A[blk(j+11)][i][off(j+1)])/3;
    jj = 0;
}
                 (b) after loop partitioning

double A[N][N];
dist_2D(A, BLOCK, i, N, BLOCK, j, N);
...  ...
jj = off(1);
bb1 = blk(1); bb2 = blk(N-1);
for (bb = bb1; bb<= bb2; bb++) {
    if (jj == 0)
       A[bb][i][jj] = 
              (A[bb-1][i][S2-1]+ A[bb][i][jj]+ A[bb][i][jj+1])/3;

    for (j = max(bb*S2+1, A, 2); j< min((bb+1)*S2 -1, B, N-2); 
         jj++, j++)
       A[bb][i][jj] = 
              (A[bb][i][jj-1]+ A[bb][i][jj]+A[bb][i][jj+1])/3;

    if (jj == S2)
       A[bb][i][jj] = 
              (A[bb][i][jj-1] + A[bb][i][jj] + A[bb+1][i][0])/3;
    jj = 0;
}
             (c) after loop partitioning and splitting

_________________________________________________________________

   Figure  7:  Loop Transformations for Block Data Distribution


               Table 1: Mapped Array Reference Approaches

    Table 1 summarizes the approaches to  translating  references
to  mapped arrays under the row-cyclic distribution on a 1-D grid
of processors and the block distribution on a 2-D grid of proces-
sors.


4    Experiments

We experimented with three basic programs: Matrix  Multiplication
(MM), LU Decomposition (LU), and Successive Over Relaxation (SOR)
on the 16-processor Hector system.

    o Matrix Multiplication: the  regular  matrix  multiplication
      algorithm was parallelized  based on the outer loop (shown 
      in Figure 8 (a)).  The matrices to be multiplied  each con-
      tained 300 x 300 double precision numbers.

    o LU Decomposition: A matrix of  400 x 400  double  precision
      numbers was chosen.  The  middle loop  was parallelized and
      scheduled in a cyclic fashion. (See Figure 3 (b).)

    o Successive Over Relaxation:  SOR was implemented with a se-
      rial outer loop and a parallel inner loop (in Figure 8(b)).
      Because every processor has to access all  its  neighboring
      elements, locality  plays  a major role in  obtaining  good 
      performance.  The matrix  contained 400 x 400 double preci-
      sion numbers.

_________________________________________________________________
double A[N][N], B[N], C[N];  double A[N][N], B[N][N]
dist_1D(A, 1, BLOCK, i, N)   dist_2D(A, BLOCK, i, N, BLOCK, j, N)
dist_1D(B, 1, BLOCK, i, N)   dist_2D(B, BLOCK, i, N, BLOCK, j, N)
dist_1D(C, 1, BLOCK, i, N)
Main()                       Main()
{  int i, j, k;              { int i, j, t;
  ... ...                     ... ...
  do_blk(i, 0, N)             for (t =0; t<100; t++) {
    for (j=0; j<N; j++)         do_blk1(i, 1, N-1) 
      for (k=0; k<N; k++)         do_blk2(j, 1, N-1)
         C(i,j) +=                 A(i,j)= .25*(B(i-1,j)+B(i+1,j)
           A(i,k)*B(k,j);                    +B(i,j-1)+B(i,j+1));
  enddo                           end
}                               enddo
                                do_blk1(i, 1, N-1)
                                  do_blk2(j, 1, N-1)
                                   B(i,j) = A(i,j);
                                  end
                                enddo
                              }
(a) Matrix Multiplication                  (b) SOR
_________________________________________________________________

                 Figure 8: MM and SOR in NUMACROS


4.1    Effects of Mapped Array References

In the last section, we discussed a few  approaches  to  handling
mapped  array  references.  To  evaluate the effects of these ap-
proaches, MM and SOR were implemented and executed  on  a  single
processor  to quantify the effects of index calculation for a cy-
clic distribution on a 1-D processor grid and a  block  distribu-
tion  on a 2-D processor grid respectively.  The cyclic distribu-
tion version of MM (using P  = 9 threads) can  be  obtained  from
Figure  8 (a) by a slight modification. In SOR, the matrices were
partitioned among the 2-D grid (3   3)  threads.   Five  versions
were  created  for each program. All the indexing approaches were
applied to each of the programs, resulting in  four  versions  in
addition  to the direct index version as the lower bound for com-
parison.  To isolate the effects of other overheads such as  com-
munication  cost,  the measurements of the two programs were con-
ducted on one processor.


         Figure 9: Effects of Mapped Array Reference Approaches

    The experimental results shown in Figure 9 are the ratios  of
execution  times of the other four versions to that of the direct
index version executing on a single processor.  Index computation
in  the naive version increases execution time by a factor of 2.6
to 6 and the optimizing compiler (gcc -O2) fails to fix it. Index
precomputing  in  both MM and SOR reduces index computation costs
significantly, since accesses to precomputed indices lead to spa-
tial locality1. However, the ratios are still about 1.2. With the
extra cost of one check per reference,  the  index  check  method
reduces  the  ratios to 1.09 and 1.06 in MM and SOR respectively,
outperforming index precomputing.

    Loop partitioning further reduces the overhead and results in
a  ratio  of 1.01 in MM. In SOR, however, loop partitioning alone
optimizes only references A[i][j]  and  B[i][j].  Combining  loop
partitioning  and  splitting makes optimization of all references
possible so that the best improvement is obtained.


4.2    Locality

For each of the three applications, two versions were  used,  one
with  locality  and  the  other without locality. Row-based block
loop partitioning in matrix multiplication and  SOR  can  achieve
good  locality  if  data  are  also block distributed, but not if
pages are allocated in a round-robin  fashion  by  the  operating
system. For LU decomposition, cyclic loop scheduling is necessary
to achieve good  load-balance.  The  locality  version  uses  the
dist_1D(1, A, CYCLIC, i, N) macros to cyclically distribute rows,
while the non-locality version relies on the operating system  to
place pages in a round-robin fashion.


                Figure 10: Locality vs Non-Locality

    Figure 10 compares the speedups of the two  versions  of  the
three  programs.  For matrix multiplication, the locality version
performs only slightly better than the non-locality  version  be-
cause  accesses  to  matrix B are remote in both versions and the
high cache hit ratios on matrices A and C counteract the  effects
of  memory  locality. In LU Decomposition, accesses to pivot rows
are remote in both versions, but their temporal and  spatial  lo-
cality results in a high cache hit ratio.  Accesses to other rows
are local in the locality version, but might  be  remote  in  the
non-locality  version due to false-sharing at the row boundaries.
In SOR, boundary rows are remote and shared with neighbor proces-
sors.  In  the  non-locality version, since the loop partitioning
does not match the data distribution, almost all accesses  become
remote.  Thus, the performance is decreased dramatically.


4.3    Performance

Four versions of the three kernel programs were measured  on  the
16-processor  Hector  system. The execution times (in seconds) of
the three programs are shown in Table 2.  The execution time  ra-
tios   (relative  to  loop  partitioning  and/or  loop  splitting
transformations) are also given in brackets. The naive version of
all three programs performed very poorly. The ratios of the naive
version to the loop transformed version is over five  in  MM  and
SOR, and over two in LU. The index precomputing approach involves
index computation for the data distribution only once in the  en-
tire program, so it significantly improves performance.  However,
it requires extra local memory accesses,  which  result  in  more
than  50%  overhead in MM and SOR, and about 20% in LU. The index
checking approach further improves performance in  all  the  pro-
grams and obtains execution time ratios between 1.05 to 1.20. The
loop transformation version does strengthen reduction of the  in-
dex computation and uniformly outperforms other versions.


              Table 2: Execution Times (in seconds)

5    Related Work

Most existing parallel programming languages support data  paral-
lel  programming.  The implementations of these languages on dis-
tributed memory multiprocessors apply the owner computes rule  to
simplify  generating  communications (Fortran D compiler [10,11],
Crystal [13], Kali [12], SUPERB [21], DINO [17], Paragon [15], C*
[16], Dataparallel C [14 , 9]).  However, this rule is not needed
for generating code on  NUMA  multiprocessors  since  all  memory
modules  are globally accessible.  NESL [4] supports nested data-
parallel programming.  Dataparallel C  has  been  implemented  on
various  parallel  platforms,  including  CM-2,  CM5, iPSC/2, and
Sequent Balance [16 , 9], but these are not  NUMA  shared  memory
machines.   NUMACROS is a simple and quick implementation of data
parallel programming on NUMA multiprocessors.

    p4 [5] is a set of macros for portable parallel  programming,
but  it  does not include high-level constructs for data parallel
programming.  NUMACROS focuses on data parallel  programming  for
NUMA multiprocessors with the goal of achieving both ease of pro-
gramming and data locality for performance.

    Crowl and LeBlanc [6] show how to adapt a parallel program to
different architectures using control abstraction.  This approach
produces programs that exhibit most of the potential  parallelism
in an algorithm, and whose performance can be tuned for a specif-
ic architecture simply by choosing among the various  implementa-
tions  for  the  control  constructs in use. Hoover, lack of data
abstraction corresponding to control abstraction may lead ineffi-
cient programs for NUMA multiprocessors.

    The Chameleon model [3] consists of primitives of  a  sequen-
tial   language   and   a   set  of  abstract  interfaces.  Data-
representation and partitioning-scheduling are two  key  abstrac-
tions and implemented by a run-time library.  Since references to
distributed data structure need to resolve at  run-time,  it  may
generate higher overhead than compilers.


6    Conclusions

We have proposed and implemented a set of macros  (NUMACROS)  for
data parallel programming on NUMA machines.  Programs written us-
ing NUMACROS are nearly as short and easily readable  as  sequen-
tial  versions of the programs. We have compared different imple-
mentations of NUMACROS for NUMA and found that

    o NUMACROS supports both ease of programming and good perfor-
      mance.

    o With NUMA systems, data locality is the key to good perfor-
      mance of  parallel  applications.  To obtain data locality, 
      data  and loops  should be distributed  and partitioned ac-
      cordingly among processors.

    o Although global address spaces facilitate data distribution
      on  NUMA systems,  a naive  implementation suffers from the 
      high  indexing  cost.  To reduce the cost, a  number of ap-
      proaches have been proposed and evaluated. Our experimental 
      results show that the approaches such as index checking and
      loop transformations are effective.

Currently, NUMACROS requires users to specify the data  distribu-
tion  and loop partitioning. Our future work includes incorporat-
ing NUMACROS facilities in a compiler and automatically  optimiz-
ing data distribution and partitioning.


Acknowledgements:  Our thanks to Sudarsan Tandri  for  invaluable
discussions and much help in implementation.


References

 [1] High performance Fortran language specification  (High  Per-
     formance Fortran Forum).  Technical Report Draft,  Rice Uni-
     versity, Jan. 1993.

 [2] J.R. Allen and K. Kennedy.  Automatic translation of Fortran
     programs  to vector form.   ACM Transactions  on Programming
     Languages and Systems, 9(4):617-640, 1987.

 [3] Gail A. Alverson and David Notkin.   Abstracting data-repre-
     sentation and  partition-scheduling  in  parallel  programs.  
     In Proc. Int'l. Symposium on Shared Memory  Multiprocessing, 
     pages 138-151, Toyko, Japan, April 1991.

 [4] G.E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein,  
     and M. Zagha.  Implementation of a portable nested data-par-
     allel language.  In Proc  4th ACM SIGPLAN Symposium on Prin-
     ciples and  Practice of Parallel Programming, pages 102-111, 
     San Diego, CA, May 1993.

 [5] J. Boyle,  R. Butler, T. Disz T., B. Glickfield, E. Lusk, R.
     Overbeek, J. Patterson, and R. Stevens.   Portable  Programs  
     for Parallel Processors.   Holt, Rinehart  and Winston inc.,
     1987.

 [6] Lawrence A. Crowl and Thomas J. LeBlanc. Control abstraction
     in parallel  programming languages.  In Proc. Int'l. Confer-
     ence on Computer Languages, pages 44-53, Oakland, CA,  April 
     1992.

 [7] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel,  U.  Kremer,
     C. Tseng, and M. Wu.    Fortran D  language   specification.  
     Technical  Report TR90-141, Dept. of Computer Science,  Rice 
     University, Dec. 1990.

 [8] M. Gupta  and P. Banerjee.  Automatic  data  partitioning on 
     distributed memory multiprocessors. IEEE Transations on Par-
     allel and Distributed Systems, 3(2):179-193, March 1992.

 [9] P.J. Hatcher, M.J. Quinn, R.J. Anderson, A.J. Lapadula, B.K.
     Seevers, and A.F. Bennett.  Architecture-independent  scien-
     tific programming in Dataparallel C: Three case studies.  In 
     Proc.  Supercomputing'91,  pages  208-217,  Albuquerque, NM, 
     Nov. 1991.

[10] S. Hiranandani, K. Kennedy, and C. Tseng. Compiler optimiza-
     tions for Fortran D on  MIMD distributed-memory machines. In 
     Proc. Supercomputing'91, pages 86-100, Albuquerque, NM, Nov. 
     1991.

[11] S. Hiranandani, K. Kennedy, and  C.  Tseng.   Evaluation  of
     compiler optimizations for Fortran  D  on  MIMD  distributed
     machines.  In Proc. International Conference on Supercomput-
     ing, pages 1-14, Amsterdam, The Netherlands, June 1992.

[12] C. Koelbel and P.  Mehrotra.   Compile-time  techniques  for
     data distribution in distributed memory machines. IEEE Tran-
     sations on Parallel and Distributed Systems, 2:440-451, Oct.
     1991.

[13] J. Li and M. Chen.  Compiling  communication-efficient  pro-
     grams for  massively parallel machines.  Journal of Parallel
     and Distributed Computing, 2(3):361-376, July 1991.

[14] M.J. Quinn and P.J. Hatcher.  Data-parallel  programming  on
     multicomputers. IEEE Software, 7:69-76, Sept. 1991.

[15] A. Reeves. The Paragon programming paradigm and  distributed
     memory compilers. Technical Report EE-CEG-90-7, Cornell Uni-
     versity Computer Engineering Group, Ithaca, NY, June 1990.

[16] J.R. Rose and G.L. Steele, Jr.  C*:  An extended C  language
     for data parallel programming. Technical Report TR. PL 87-5,
     Thinking Machines Corporation, 1987.

[17] M. Rosing,  R. Schnabel,  and R. Weaver.   The DINO parallel
     programming  language.   Journal of Parallel and Distributed 
     Computing, 13(1):30-42, Sept. 1991.

[18] Z. G. Vranesic, M. Stumm, D. M. Lewis, and R. White. Hector:
     A hierarchically  structured  shared memory  multiprocessor. 
     IEEE Computer, 24(1):72-79, Jan. 1991.

[19] Michael Wolfe.  Optimizing Supercompilers  for  Supercomput-
     ers.  MIT Press, Cambridge, MA, 1989.

[20] Michael Wolfe. The tiny loop restructuring research tool. In
     Proc. Int. Conference on  Parallel  Processing,  volume  II:
     Software, pages 47-53, St. Charles, IL, August 1991.

[21] H. Zima, H. Bast, and M. Gerndt. Superb: A  tool  for  semi-
     automatic  MIMD/SIMD  parallelization.  Parallel  Computing, 
     6:1-18, 1986.