################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org NUMACROS: Data Parallel Programming on NUMA Multiprocessors Hui Li and Kenneth C. Sevcik Computer Systems Research Institute University of Toronto Toronto ON M5S 1A4 CANADA Email: {hui|kcs}@csri.toronto.edu Abstract Data parallel programming has been widely used in developing scientific applications on various types of parallel machines: SIMD, MIMD distributed memory machines, and UMA shared memory machines. On NUMA shared memory machines, data locality is the key to good performance of parallel applications. In this paper, we propose a set of macros (NUMACROS) for data parallel program- ming on NUMA machines. NUMACROS attempts to achieve both ease of programming and data locality for performance. Programs written using NUMACROS are nearly as short and easily readable as sequen- tial versions of the programs. To obtain data locality, data and loops are distributed and partitioned in a coordinated fashion among the processors. Although global address spaces facilitate data distribution on NUMA systems, a naive implementation of an application will suffer from high costs. To reduce the cost, a number of approaches have been proposed and evaluated. These in- clude index precomputing, index checking, loop transformation, and others. Our experimental results, with the Hector multipro- cessor, show that these approaches are effective. While such fa- cilities will be provided by compilers in the long run, NUMACROS is a helpful interim step. _______________________________ This research has been supported by research grants from the Natural Sciences and Engineering Research Council of Canada and from the Information Technology Research Centre of Ontario. 1 Introduction The data parallel programming model has been widely used in developing scientific applications. Writing a parallel program in this model for distributed memory multiprocessors involves two major steps: selecting data distributions and then using them to derive node programs with explicit communications to access non- local data. Manually specifying communications is a tedious, non-portable, and error-prone step. To overcome this problem, many parallel programming languages have been proposed, including C* [16], Dataparallel C [14, 9], Kali [12], DINO [17], Fortran D [7, 11], High Performance Fortran Form (HPFF) [1], Superb [21], and NESL [4]. These languages provide global name-spaces for ease of programming, but require the programmers to carefully determine data distributions for good performance. On distribut- ed memory multiprocessors, the compiler translates references to global arrays into references to smaller local arrays stored in processors' local memory modules and generates communications for nonlocal accesses. The performance of these programs thus depends on the effectiveness of optimizing communications and global-local index mapping. Non-Uniform Memory Access (NUMA) time shared memory multipro- cessor systems support a global memory space by hardware. Howev- er, the cost of remote memory access is significantly higher than that of local memory access. Memory locality is thus essential for good performance. In order to achieve high locality, data distributions must match loop partitioning. Data parallel programming on NUMA machines raises issues dif- ferent from UMA and distributed memory MIMD machines. On UMA machines, memory locality is not an issue so an implementation for an UMA system is not suitable for NUMA machines. On distri- buted memory MIMD systems, data must be distributed among proces- sors and accesses to remote data require explicit inter-processor communications. The owner computes rule is usually used for gen- erating communications. On NUMA machines, neither explicit com- munications nor the owner computes rule are needed. In this paper, we describe the implementation of a set of C macros (NUMACROS) for developing parallel applications using the data parallel model. NUMACROS allows the programmers to use parallel loops and data distributions to annotate sequential pro- grams so the parallel programs are readable and usually quite similar to the sequential versions. The key to achieving good performance is to match parallel loops with data distributions. Ideally, as software for parallel systems matures, the facilities provided by NUMACROS will be provided directly by compilers them- selves [8]. However, in the meantime NUMACROS provides a con- venient way to produce concise and easily readable parallel pro- grams that attain good speedup across a variety of applications. The next section describes data parallel programming using NUMACROS. Section 3 discusses implementation issues and alterna- tives for data distributions on NUMA systems. Section 4 illus- trates the importance of data locality and evaluates the perfor- mance of some implementation alternatives. Section 5 discusses related work, and the last section presents brief conclusions. 2 NUMACROS NUMACROS (NUma MACROS) is a set of C macros for Single Program Multiple Data (SPMD) parallel programming on NUMA multiproces- sors. It supports parallel loops by scheduling their iterations and providing data distribution constructs for partitioning ar- rays among processors. Ideally, data distribution and loop parallelization should be generated by parallelizing compilers with data dependence analysis and global optimization. We use NUMACROS to illustrate how to annotate sequential programs with data distribution and parallel loops, and how to generate efficient code for NUMA mul- tiprocessors when such data distribution and parallel loop infor- mation is available. A parallel program in NUMACROS starts with one thread, then creates a number of threads which start to execute the Main func- tion. To minimize scheduling overhead, the number of threads is set to be equal to the number of allocated processors. Global variables are shared by all threads, but local variables are private. 2.1 Data Distributions NUMACROS currently supports data distributions on both one- dimensional (1-D) grid and two-dimensional (2-D) grid of proces- sors by macros dist_1D and dist_2D respectively. The number of processors is chosen at run time as parameter P for a 1-D grid and P1 x P2 for a 2-D grid. The parameters of dist_1D and dist_2D are given in Figure 1. The macro dist_1D maps dimensions of an array onto a 1-D grid of processors in a block, cyclic, or block-cyclic fashion [7, 12]. For example, dist_1D(A, 1, BLOCK, idx, N) distributes the N rows of two dimensional array A in a block fashion on the 1-D grid of P processors, where the block size is N/P and ith block is mapped onto processor i. The dummy parameter, idx, is used for indexing, and the second parameter indicates how many dimensions of the array are to be distributed. Similarly, dist_1D(A, 2, BLOCK, idxI, N, CYCLIC, idxJ, N) distri- butes both the rows and columns of array A, where rows are mapped in block fashion and columns are mapped in cyclic fashion, e.g. row j is on processor p if only if (j mod P = p). The macro dist_2D provides block, cyclic, and block-cyclic data distributions for a 2-D grid of processors. For example, dist_2D(A, BLOCK, i, N, CYCLIC, j, N) specifies that the rows of array A are distributed in blocks along the first dimension of the processor grid and the columns are distributed in cyclic fashion along the second dimension of the processor grid. _________________________________________________________________ dist_1D(A, 1, distType, i, sizeI) o dist_1D indicates that the distribution is performed on the 1-D grid of processors. o The first parameter gives the array name. o The second parameter, equal to 1, indicates that only the first dimension of the array is distributed among proces- sors using distType method. o distType can be BLOCK, CYCLIC, and BLOCK_CYCLIC. o i is a dummy parameter for indexing. o sizeI is the size of the distributed dimension. dist_1D(A, 2, disTypeI, i, sizeI, disTypeJ, j, sizeJ) o The second parameter, equal to 2, indicates that the first two dimensions of the array are distributed. The parameters disTypeI, i, and sizeI are used for the distribution of the first dimension, and the parameters disTypeJ, j, and sizeJ are for the second dimension. dist_2D(A, disTypeI, i, sizeI, disTypeJ, j, sizeJ) o dist_2D indicates that distribution is performed on 2-D grid of processors. Its parameters have similar usage as above. _________________________________________________________________ Figure 1: Data Distributions in NUMACROS 2.2 Parallel Loops In NUMACROS, a number of parallel loop constructs are defined to match the data distributions so that good locality can be achieved. For a 1-D grid of processors, NUMACROS provides paral- lel loops of do_blk, do_cyc, and do_blk_cyc which schedule itera- tions of loops in cyclic, block, and block_cyclic fashions respectively [12]. (Parameters of these macros are given in Fig- ure 2.) _________________________________________________________________ o do_cyc(i, l, u) schedules the iterations (l <= i < u) of the loop in cyclic fashion, which means iteration i will be executed by thread (i mod P). Parameter i is a private variable, and l and u are the upper and the lower bounds of loops. o do_blk(i, l, u) partitions the iterations (l i < u) of the loop into P chunks, and each thread executes one chunk. o do_blk_cyc(i, l, u, blksize) partitions the loop into chunks of size "blksize", and schedules the chunks in a cyclic fashion. _________________________________________________________________ Figure 2: Parallel Loops in NUMACROS To reduce the communication cost in applications based on the high dimensional grids of data, NUMACROS allows data and loop nests to be mapped onto a 2-D grid of processors. The approach can be easily extended to accommodate higher dimensions. On a 2-D grid of processors, parallel loop constructs do1_blk, do1_cyc, and do1_blk_cyc partition the iterations of loops along first dimension of the grid, and parallel loop constructs do2_blk, do2_cyc, and do2_blk_cyc partition them along the second dimension of the grid. For example, do1_cyc(i, l, u) indicates that each processor on row p1 of the 2-D processor grid (P1 x P2) will execute iterations from p1*(u-l)/P1 to (p1+1)*(u-l)/P1 - 1. 2.3 An Example: LU Decomposition _________________________________________________________________ double A[N][N]; double A[N][N]; dist_1D(A, 1, CYCLIC, i, N) main() Main() { int i, j, k; { int i, j, k; ... ... ... ... for (k = 0; k= N) iii -= N; } (b) transformed by loop partitioning _________________________________________________________________ Figure 6: Loop Partitioning Loop partitioning for cyclic distribution on a 1-D grid. As- sume that a loop does not have loop-carried data dependences [2] and the number of iterations in the loop is greater than the number of processors (P). To optimize the index computation in a reference, the loop can be partitioned into a two-level loop nest in a cyclic fashion such that the reference in the inner loop only accesses data from local memory. If the loop contains multiple references with different index expressions, we choose one of them. Figure 6 illustrates the loop partition transforma- tion based on reference A[i][j], but the method can be extended to any linear index expressions (e.g. f(i) = a*i + C). Loop i in the original code (Figure 6 (a)) is partitioned into an outer loop (on p) and an inner loop i so that only one add is needed for indexing (Figure 6 (b)). Moreover, since the inner loop accesses consecutive data, the loop partition may yield better spatial locality. Loop partitioning for block distribution on a 2-D grid. Simi- lar to loop partitioning on a 1-D grid, a loop is partitioned into a two-level nest so that some references in the inner loop always access a single block of the array. An example of loop partitioning for index optimization is shown in Figure 7 (b). Loop j in Figure 7 (a) is partitioned into an outer loop (bb) and an inner loop (j) in Figure 7 (b). So reference A[bb][i][jj] (i.e. the original A[i][j] ) in the inner loop accesses block bb, requiring one operation (add) for indexing. However, references of the form A[blk(j-1)][i][off(j-1)] (i.e. original A[i][j-1]) have not been optimized since they access more than one block. Loop splitting for block distribution on a 2-D grid. For the index expressions that differ by a constant from the one optimized in loop partitioning (e.g. A[i][j-1] and A[i][j+1] in the example), loop splitting can be applied. The first/last few iterations are split from the inner loop so that all these index expressions can be computed in one cycle in the inner loop (Fig- ure 7 (c)). _________________________________________________________________ double A[N][N]; ... ... for (j = 1; j< N - 1; j++) A[i][j] = (A[i][j-1] + A[i][j] + A[i][j+1])/3; (a) original loop double A[N][N]; dist_2D(A, BLOCK, i, N, BLOCK, j, N); ... ... jj = off(1); bb1 = blk(1); bb2 = blk(N-1); for (bb = bb1; bb<= bb2; bb++) { for (j = max(bb*S2, A); j < min( (bb+1)*S2, B); jj++, j++) A[bb][i][jj] = ( A[blk(j-1)][i][off(j-1)] + A[bb][i][jj] + A[blk(j+11)][i][off(j+1)])/3; jj = 0; } (b) after loop partitioning double A[N][N]; dist_2D(A, BLOCK, i, N, BLOCK, j, N); ... ... jj = off(1); bb1 = blk(1); bb2 = blk(N-1); for (bb = bb1; bb<= bb2; bb++) { if (jj == 0) A[bb][i][jj] = (A[bb-1][i][S2-1]+ A[bb][i][jj]+ A[bb][i][jj+1])/3; for (j = max(bb*S2+1, A, 2); j< min((bb+1)*S2 -1, B, N-2); jj++, j++) A[bb][i][jj] = (A[bb][i][jj-1]+ A[bb][i][jj]+A[bb][i][jj+1])/3; if (jj == S2) A[bb][i][jj] = (A[bb][i][jj-1] + A[bb][i][jj] + A[bb+1][i][0])/3; jj = 0; } (c) after loop partitioning and splitting _________________________________________________________________ Figure 7: Loop Transformations for Block Data Distribution Table 1: Mapped Array Reference Approaches Table 1 summarizes the approaches to translating references to mapped arrays under the row-cyclic distribution on a 1-D grid of processors and the block distribution on a 2-D grid of proces- sors. 4 Experiments We experimented with three basic programs: Matrix Multiplication (MM), LU Decomposition (LU), and Successive Over Relaxation (SOR) on the 16-processor Hector system. o Matrix Multiplication: the regular matrix multiplication algorithm was parallelized based on the outer loop (shown in Figure 8 (a)). The matrices to be multiplied each con- tained 300 x 300 double precision numbers. o LU Decomposition: A matrix of 400 x 400 double precision numbers was chosen. The middle loop was parallelized and scheduled in a cyclic fashion. (See Figure 3 (b).) o Successive Over Relaxation: SOR was implemented with a se- rial outer loop and a parallel inner loop (in Figure 8(b)). Because every processor has to access all its neighboring elements, locality plays a major role in obtaining good performance. The matrix contained 400 x 400 double preci- sion numbers. _________________________________________________________________ double A[N][N], B[N], C[N]; double A[N][N], B[N][N] dist_1D(A, 1, BLOCK, i, N) dist_2D(A, BLOCK, i, N, BLOCK, j, N) dist_1D(B, 1, BLOCK, i, N) dist_2D(B, BLOCK, i, N, BLOCK, j, N) dist_1D(C, 1, BLOCK, i, N) Main() Main() { int i, j, k; { int i, j, t; ... ... ... ... do_blk(i, 0, N) for (t =0; t<100; t++) { for (j=0; j