################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org (This is an ASCII version of the paper presented at the Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, California, September, 1993. Figures were omitted from this version.) ************************ On the Importance of Parallel Application Placement in NUMA Multiprocessors Timothy Brecht + Department of Computer Science, University of Toronto Toronto, Ontario, CANADA M5S 1A4 brecht@cs.toronto.edu Abstract The thesis of this paper is that scheduling decisions in large-scale, shared-memory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an application to processors application placement. We explore the importance of placement decisions by measuring the execution time of several parallel applications using different placements on a shared-memory NUMA multiprocessor. The results of these experiments lead us to conclude that, as expected, in small-scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. However, the results also show that placement decisions in large-scale multiprocessors are critical and localization that considers the architectural clusters inherent in these systems is essential. Our experiments also show that the importance of placement decisions increases substantially with the size and NUMAness of the system and that the placement of individual processes of an application within the subset of chosen processors also significantly impacts performance. ____________ + Author's current address: Department of Computer Science, York University, 4700 Keele Street, North York, Ontario, CANADA M3J 1P3 email: brecht@cs.yorku.ca - 2 - 1. Introduction Small-scale, shared-memory multiprocessors based on a single shared bus have become prevalent and the number of manufacturers building and selling such systems continues to rise. The success of such systems can be partially attributed to the simple parallel programming model they present, relative to strictly non-shared-memory systems. This simple programming model has allowed many applications to achieve substantial increases in performance by making effective use of all of the processors in the system. The considerable improvements in parallel application performance attained using small-scale multiprocessors have fueled the desire for even greater performance improvements. One approach to increasing performance is to simply build larger systems while maintaining the shared-memory model. Single-bus systems, however, are not scalable because the bandwidth of the bus limits their size. As a result research and design efforts in shared-memory multiprocessors have focused on scalable architectures. These architectures distribute memory modules throughout the system in order to optimize access times to some memory locations. The result is an important class of scalable shared-memory systems known as Non-Uniform Memory Access (NUMA) multiprocessors. Alternatively, all memory accesses could be made uniform, but then they would be uniformly slow. The emergence of large-scale, shared-memory multiprocessors presents a number of new opportunities and challenges. The opportunities are to solve much larger problems than previously possible, with applications that use more processors, and to solve many problems concurrently, by simultaneously executing multiple parallel applications. The challenges are to effectively utilize the processors while enabling the efficient execution of multiple applications. The multiprogramming of parallel applications is required because not all applications will be capable of effectively utilizing all processors in a large-scale system. An obvious but critical difference between scheduling in this new class of NUMA (Non-Uniform Memory Access) multiprocessors and small-scale UMA (Uniform Memory Access) multiprocessors, is that in UMA systems all processors can be treated equally (aside from cache contexts). This is because in a UMA system the time to access any memory location is the same from any processor. NUMA system designers must, however, consider the time it takes to access different memory locations from different processors. Therefore, an important aspect of scheduling in large-scale, shared-memory, NUMA multiprocessors is application placement. That is, how should the parallel processes of an application be placed in a NUMA multiprocessor? - 3 - This paper shows that in large-scale, shared-memory, NUMA multiprocessors the execution time of a parallel application is directly related to which processors it executes on. As a result, efficient and effective placement decisions become critical to processor scheduling and overall system performance. In fact, it is likely to be a contributing factor in ultimately determining the success or failure of large-scale NUMA multiprocessors. The remainder of this paper is organized as follows: Section 2 presents related work. Section 3 describes three existing shared-memory NUMA multiprocessors along with their architectural and NUMA characteristics. This is followed in Section 4, by a description and explanation of the importance of localization during application placement. Section 5 describes the environment and Section 6 the applications used in the experiments. In Section 7 we present the results of a series of experiments designed to demonstrate the the importance of localization. The paper is concluded in Section 8 with a summary of the results, conclusions and a brief discussion of future work. 2. Related Work As microprocessor technology continues to improve at a faster rate than memory or interconnection network technology, the relative increase in communication costs in multiprocessors has become a topic of increasing importance. A number of recent studies consider the importance of memory access costs when making scheduling decisions in shared- memory multiprocessors. An important and common goal of this research is to reduce the number of cache-misses or remote memory references by co-locating lightweight threads or kernel processes with the data being accessed, thus reducing the time spent loading data into the local cache or memory. Using an analytic model of a time-sliced, central ready-queue, scheduling environment and experimental evaluation on a UNIX based multiprocessor, Squillante and Lazowska [11] [10] argue and demonstrate that applications can build considerable cache context, or footprints [13]. Recognizing that it may be more efficient to execute a process on a processor that already contains relevant data in that processor's cache, they design and examine techniques that consider the affinity a process has for a processor. They observe that the performance of applications, which release a processor because of quantum expiration, preemption, or I/O, can be significantly reduced by making use of processor-cache affinity information. - 4 - Subsequent studies have arrived at different conclusions. Gupta, Tucker, and Urushibara [5] also consider the importance of cache-affinity techniques but in a space-shared multiprocessor environment. They simulate a number of scheduling techniques and, using processor utilization as a performance metric, conclude that improvements due to processor-cache affinity are quite small, improving mean processor utilization by only 3%. Vaswani and Zahorjan [16] draw similar conclusions in their study of the importance of cache affinity. They also suggest, using an analytic model, that even with faster processors and larger caches the benefits due to cache affinity will be minimal. The apparent difference between the conclusions drawn in these studies is largely due to the difference in the scheduling policies used to multiprogram applications. While Squillante and Lazowska use time-sharing, the study by Gupta, Tucker, and Urushibara and the study by Vaswani and Zahorjan both use space-sharing. The time-sharing policy employs a small reallocation interval. This results in relatively frequent context switches and ensures that processes do not run long enough to interfere with each other significantly. Vaswani and Zahorjan found that with space-sharing the frequency of context switches is reduced and that intervening applications ran long enough to significantly disrupt the cache context of the previous process, thus greatly reducing the benefits of processor- cache affinity. While cache affinity studies investigate the benefits of reusing cached data when executing more than one application on the same processor (across applications), related work at the University of Rochester concentrates on the benefits of reusing cached data when executing lightweight threads of the same application on the same processor (within applications). The Rochester work also extends the notion of locality management to include one more level in the memory hierarchy by considering systems that may have local-cache, local-memory and remote-memory, such as the BBN TC2000. Markatos [7] first demonstrates that fine-grain parallel programs, because of the overhead required to load data into the local cache, or memory, typically perform much worse than coarse-grain implementations even though the cost of thread management is negligible. This motivates the need for techniques that consider locality when scheduling lightweight threads within an application. Markatos then develops a technique called memory-conscious scheduling which, when used with fine-grain applications, yields execution times that are comparable to coarse-grained implementations. - 5 - Markatos and LeBlanc [8] consider the conflicting requirements for load balancing and locality management when scheduling lightweight threads of an application. They conclude that of the two important considerations, locality management should be the primary factor influencing the assignment of threads to processors. The importance of locality management is also explored in their work on loop scheduling [9]. They demonstrate how traditional loop scheduling techniques incur significant performance penalties on modern shared-memory multiprocessors. They then propose and compare new loop scheduling algorithms that consider the requirements of load balancing, minimizing synchronization, and co-locating loop iterations with the data being referenced. These new algorithms are shown to improve performance by up to 60% in some cases. Our work is complementary to processor-cache affinity and lightweight thread scheduling techniques for improving locality of data references. While these previous studies investigate the importance of scheduling techniques for reducing the number of non-local memory accesses by co- locating processes with the data being accessed, our work investigates the importance of scheduling techniques for reducing the cost of required non-local memory accesses in environments where processes and data cannot be co-located. We have conducted a preliminary simulation study [18] which indicates that placement is an important aspect of scheduling in large-scale, NUMA multiprocessors and has motivated the need for experimentation on a real NUMA multiprocessor. In this paper we experimentally investigate the problem of scheduling parallel processes of an application that concurrently access shared data in an environment in which there is no a priori knowledge of sharing or communication patterns. The complexity of the problem is increased by the architectural trend to cluster processors and memory elements and to connect clusters together in a hierarchical fashion in order to build larger systems. This results in systems with a number of levels in the memory hierarchy and memory access latencies that vary with the number of levels of the hierarchy that must be traversed. Therefore, the placement problem becomes one of placing processes of an application onto processors such that the costs of required accesses to shared data are minimized. - 6 - 3. Scalable Shared-Memory Multiprocessors Examples of three existing scalable shared-memory multiprocessors are the KSR1, from Kendall Square Research [2], DASH, developed at Stanford University [6], and Hector, developed at the University of Toronto [17]. Each of these systems incorporates a hierarchical design to build larger systems by using small-scale multiprocessor components as building blocks. In Hector and DASH the base component is essentially a bus-based multiprocessor containing a small number of processors (they are called stations and clusters, respectively). In the KSR1 the base component, called Ring:0, is a unidirectional ring connecting up to 32 processors. Both the KSR1 and Hector use a ring to connect base components together to form larger systems. The Hector design provides for another level in the hierarchy by connecting a collection of rings together with what is called a global ring. The DASH system uses a mesh interconnection network to connect base components together. The processing modules in the KSR1 and Hector, besides containing a processor and associated cache, also contain local processor memory, which is used to further optimize access times to some memory locations and reduce contention for the base component interconnection network. In DASH each cluster is essentially a Silicon Graphics multiprocessor which does not contain localized processor memory but instead contains a secondary shared cache. The processor used in the KSR1 is a 20 MHz RISC processor developed by Kendall Square Research. DASH uses the 33 MHz MIPS R3000 processor while Hector uses the 16.67 MHz Motorola MC81000. ______________________________________________________________________ |System | Memory | Memory | Processor | Approx. | |and CPU | Level | Location | Cycles | System Size | |________|________|_________________________|___________|_____________| |KSR1 | 1 | Local Memory | 18 | 1 | | | 2 | Ring 0 | 126 | 32 | |________|________|_________________________|___________|_____________| |DASH | 1 | Secondary Cache | 15 | 1 | | | 2 | Local Bus Memory | 29 | 4 | | | 3 | Remote Cluster Memory | 132 | 16-64 | |________|________|_________________________|___________|_____________| |Hector | 1 | Local Memory | 19 | 1 | | | 2 | On Station Memory | 29 | 4 | | | 3 | On Ring Memory | 37 | 16 | |________|________|_________________________|___________|_____________| Table 1.1: Memory reference hierarchies and latencies of some NUMA multiprocessors Table 1.1 shows some memory latency times in processor cycles for each of these systems. The times for the KSR1 are in 50 nano-second cycles and are the times required to - 7 - read one 128 byte cache line [3]. DASH and Hector have 30 and 60 nano-second cycle times respectively and The latencies shown in Table 1.1 are for loading one 16 byte cache line [6] [12]. This table illustrates two of the key issues related to the use of shared-memory NUMA multiprocessors: 1) The time to access remote memory can be significant. 2) The time to access remote memory depends on the distance to the location being accessed (the number of levels of the hierarchy that must be traversed). It is therefore quite natural to hypothesize that placing the parallel processes of an application close to each other in order to reduce communication costs will be essential for their efficient execution. The degree to which the execution time of an application will benefit from a localized placement depends on the number, frequency and latency of remote communication. 4. Application Placement for Localization In this section we describe how a scheduler might choose a ``localized'' subset of processors on which to execute an application. Fortunately, most scalable shared- memory architectures adhere to a hierarchical design and as a result determining a ``localized'' subset of processors is not difficult. Note that processes of an application must be placed individually, since we are assuming a dynamic scheduling environment in which there is no a priori knowledge of the number of processors an application will be allocated. From any one processor, remote memory accesses can have successively higher and higher costs as the distance from the requesting processor increases. These costs can be thought of as forming a hierarchy of levels, where the access time from a given processor to any memory module within the same level is the same. If from each processor we define M sub l to be the time to access memory at level l in the hierarchy and l = 1,2,3,...,L, then: M < M < ... < M < ... < M . 1 2 l L One method of building large-scale, shared-memory multiprocessors which is currently popular is to connect processors in a clearly hierarchical fashion, as is shown is Figure 1.1a. This is the type of interconnection scheme used in the DASH and Hector systems. Figure 1.1b is an example of an alternative interconnection scheme that is not strictly hierarchical by design. It is presented as an - 8 - example of how memory access times can be organized into a hierarchical structure when viewed from individual processors. Figures 1.1a and 1.1b assume that the system is built from processor/memory pairs. The labels indicate the level of the memory access hierarchy that each processor/memory pair belongs to. Labels are assigned relative to the specified source processor, which is labelled and belongs to level 1. If we assume that the first process of an application is placed randomly on the processor/memory pair labelled 1 a localized placement places the next process of that application on any one of the processor/memory pairs labelled 2, since they can all be accessed from the first processor with the same latency. Future placements for the same application continue to use processor memory/pairs labelled 2, until they are all used, at which point level 3 is used. The levels in Figure 1.1a adhere to the hierarchical structure of the system while the levels in Figure 1.1b are determined by simply counting the number of hops required to reach each processor/memory pair from the first processor. Figure 1.1: Memory access hierarchy in two multiprocessor designs A rough classification of memory reference times, such as the hierarchical view just discussed, provides for localized processor allocations by choosing processors from the lowest level possible, relative to the first processor chosen, then moving up the hierarchy once all processors at the current level have been allocated. This is repeated until the number of desired processors has been obtained. 5. The Experimental Environment The experiments presented here were conducted using a prototype of a scalable shared-memory NUMA multiprocessor called Hector, developed at the University of Toronto [17]. Each processing module in the Hector prototype contains a 16.67 MHz Motorola MC81000 CPU, a 16 Kbyte instruction cache, a 16Kbyte data cache, and 4 Mbytes of globally addressable memory. The hierarchical design used in Hector connects a number of processing modules with a bus to comprise a station, several stations are connected with a bit-parallel slotted ring, and rings can be further connected using a hierarchy of rings to easily support up to 256 processors. The prototype used consists of 4 stations, each containing 4 processor modules, for a total of 16 processors and 64 Mbytes of globally addressable memory. There is no hardware support for cache coherence, thus permitting a simple and elegant design that has relatively mild NUMA characteristics. Cache coherence is enforced in software, by the HURRICANE operating system's memory manager - 9 - at a 4Kbyte page level of granularity, by permitting only unshared and read-shared pages to be cacheable [14] [15]. HURRICANE also supports page migration and replication but these features were disabled in order to conduct these experiments. The parallel processes comprising each application are placed on processors in the system by a system scheduler. This is accomplished by having each HURRICANE process or thread creation first contact a user-level scheduler to determine which processor to execute on. The scheduler is designed for a dynamic multi-purpose, multiprogrammed environment so it is assumed that the desired number of processors is not known a priori. As a result processor requests and placement decisions occur one at a time, at the time of process creation. The scheduler was configured in such a way that the desired placements were obtained. Applications were linked with a special library that directs creation calls to the scheduler and notifies the scheduler whenever a process is finished executing. 6. The Applications Since the current Hector system is a prototype, and because much of the work being conducted on the system consists of operating systems research and performance evaluation, there is not a large body of regular users. As a result the applications collected for experimentation consist mainly of applications that were written either to evaluate this type of research or as part of a course project on parallel programming. Consequently some of the applications used are really kernels of what would be considered real parallel applications. All of the applications are of the data parallel or single program multiple data class of applications, which means that each process executes the same computational kernel on different portions of the data space. The data access patterns of each application are different, so the importance of the placement of parallel processes of the application should vary with each application. Since the placement of each parallel process of an application is important relative to the data that is being accessed, the HURRICANE operating system permits the application writer to roughly control where data will be located by specifying the policy to be used when requesting memory. In the applications used most of the shared data is allocated to memory according to a first-hit policy. That is, data will be physically located in the memory of the processor module that first touches the page containing that data. Some applications specify a round-robin policy for some of the shared data so that frequent access of the data by many - 10 - processors reduces the likelihood of hot-spots and also reduces remote memory access costs when executing on all 16 processors. Even though we do not use all of the processors in the system we have not modified the memory allocation policies used by the applications (currently there is no policy for allocating memory on a round-robin basis from the subset of processors assigned to the application). For each application the main process creates a number of children which act as slaves. Since each process of the application is allocated to a separate processor and we do not want processors to be unnecessarily idle, some applications were modified so that the master process not only controls and synchronizes the children but also performs its share of the computation, rather than simply waiting for the children to perform the computation. The applications are listed in Table 1.2 along with the problem size, precision used, the number of lines of C source code, and the speedup measured using four processors of one station, S(4). The speedup values shown were computed by comparing the execution times of the parallel application using one and four processors (since a serial version was not available for all applications). The number of source code lines may be slightly high due to the large number of timing, tracing, and debugging calls used when tuning the applications. More detailed descriptions of each application can be found in [1]. _________________________________________________________________________________________ |_________|______________________________________________________________________________| |FFT | 2D Fast fourier transform | | | 256x256 Single 1300 2.9 | |_________|______________________________________________________________________________| |HOUGH | Hough transformation | |_________|______________________________________________________________________________| |MM | Matrix multiplication | | | 192x192 Double 500 3.4 | |_________|______________________________________________________________________________| |NEURAL | Neural network backpropagation | |_________|______________________________________________________________________________| |PDE | Partial differential equation solver using successive over-relaxation | | | 96x96 Double 700 3.7 | |_________|______________________________________________________________________________| |SIMPLEX | Simplex Method for Linear Programming | |_________|______________________________________________________________________________| Table 1.2: Summary of the applications used The size of the system used is relatively small, and in order to evaluate different application placements, each application is executed using four processors (four processors was also chosen because some applications constrained the number of processors used to a power of two - 11 - or to a number that divides evenly by the size of the data set used). Although the size of the data sets may appear to be small, they were chosen for a number of reasons: 1) They should execute on four processors in a reasonable amount of time since multiple executions of each application are used to compute means and confidence intervals. 2) The size of the data cache on each processor is relatively small (16 Kbytes). Consequently cache misses and memory accesses will occur, even with a relatively small sized problem. 3) The amount of memory currently configured per processor is relatively small (4 Mbytes). If problem sizes are too large data structures that are designed to be allocated to the local processor (by using the first- hit allocation policy) may have to be allocated to a different processor, resulting in remote memory references where the application programmer had not intended. That is, once all of the physical memory of the local processor has been allocated, the memory manager will allocate memory from a neighbouring but remote module. The seemingly poor speedup of some applications is the result of the small data sets used to perform these experiments, since most of the applications were designed to be used with larger data sets on more processors (i.e. the parallelism is relatively coarse-grained). 7. Impacts of Placement on Performance In order to examine the importance of localization in shared-memory NUMA multiprocessors we conduct a series of experiments using the Hector multiprocessor and six parallel applications (or application kernels) each executing on four processors. The main purpose of these experiments is to determine the importance of application placement. That is, the importance of localization. The experiments are conducted by running each application in isolation under different placement strategies. The execution times of the localized placement are then compared with the non-localized placements. The prototype Hector system used to conduct the experiments is configured with sixteen processors. To avoid interference caused by system processes and the workload generator, thus ensuring that differences in execution times are due solely to different placements, we dedicate one station (the four processors in Station 0) to their execution. That is, only stations 1, 2, and 3 are used to - 12 - execute the applications being tested. Figure 1.2 illustrates a localized and non-localized placement of four processes of an application and the notation used to represent these placements. In the localized placement, the four dashes ``----'' above Station 0, 1 and 2 indicate that the four processors in each of these stations are not used, while the numbers ``1234'' above Station 3 indicate which processor each of the four processes of the application are placed on. The first process (1) being the main (master) process of the program and the remaining three (2, 3 and 4) being the child (slave) processes. The placement is localized because all four processes of the application execute within one station and the notation is ``---- ---- ---- 1234''. The non-localized placement spreads the processes across the twelve processors being considered (as mentioned previously, Station 0 is not used, in order to to avoid interference with system and workload generating processes which are restricted to that station). The first process executes on Station 3, the second and fourth on Station 2, and the third on Station 1 and the notation is ``---- ---3 --42 ---1''. Figure 1.2: Localized and non-localized placements Figure 1.3 shows the normalized mean execution times of the six applications when executed using the localized and non-localized placements. This graph along with the detailed results in Table 1.3 show that localized application placement does improve the execution time of some of the applications examined. The table was constructed by executing the applications eight times for each placement. The table contains the mean execution time (Mean) and 90 percent confidence intervals (CI), for each of the placements, as well as the improvements obtained by using the localized placement (% Impr). This is given as the percentage by which the mean execution time was improved by using the localized placement rather than the non- localized placement, expressed as a percentage of the mean execution time of the non-localized placement. Times are measured in seconds. copy "chp5/graphs/good-vs-bad-00.grap" Figure 1.3: Normalized execution times using localized and non-localized placements Notice that improvements obtained here are not large. This is due to relatively small-size and mild NUMA structure of the prototype sixteen processor Hector system. We hypothesized that, under a heavy multiprogrammed workload, contention for shared-resources such as the interconnection network (ring) would be reduced under a localized placement, resulting in even greater benefits. However, preliminary experimental results indicate that, under the multiprogrammed workloads tested, contention is not significant (further experimentation with different - 13 - workloads is ongoing). The remainder of the experiments are therefore conducted in a uniprogrammed setting. ____________________________________________________________ |_________|___________________|___________________|_________| | | Mean | CI | Mean | CI | | |_________|__________|________|__________|________|_________| |FFT | 4.84 | 0.00 | 4.71 | 0.00 | 2.7 | |_________|__________|________|__________|________|_________| |_________|__________|________|__________|________|_________| |MM | 5.25 | 0.01 | 4.93 | 0.01 | 6.1 | |_________|__________|________|__________|________|_________| |_________|__________|________|__________|________|_________| |PDE | 5.45 | 0.00 | 5.36 | 0.01 | 1.7 | |_________|__________|________|__________|________|_________| |_________|__________|________|__________|________|_________| Table 1.3: Mean execution times, in seconds, using localized and non-localized placements The following sections consider larger systems, systems with different architectures, and future multiprocessors, by studying the effects of NUMAness on the importance of localization. 7.1. Increasing Memory Latencies The NUMAness of a system can be thought of as the degree to which memory access latencies are affected by the distance between the requesting processor and the desired memory location. It is determined by: + The differences in memory access times between each of the levels in the memory access hierarchy. + The number of processors that can be accessed in the time determined by each level. + The number of levels. In order to study the effects of changes in the NUMAness of the system, Hector features a set of switches, called delay switches, that add additional delays to off- station memory requests. The range of settings possible are: 0, 1, 2, 4, 8, 16, 32, and 64 cycles. Every packet destined for a memory module not located on the same station is held up at the requesting processor for the number of selected cycles. The delay switches are used to emulate and gain insight into the performance of: 1) Larger systems - since increases in the system size will result in increased memory latencies. - 14 - 2) Systems of different designs - because some systems have larger memory latencies due to the complexity of the interconnection network or hardware cache coherence techniques. 3) Future systems - because processor speeds continue to increase at a faster rate than memory and interconnection networks. _______________________________________________________________________ | | 32bit | 32bit | cache | cache | Delay | |_________|___________|___________|___________|_____________|__________| |local | 10 | 10 | 19 | 19 | | |_________|___________|___________|___________|_____________|__________| |_________|___________|___________|___________|_____________|__________| |ring | 27 | 17 | 37 | 42 | 0 | | | 35 | 21 | 49 | 58 | 4 | | | 43 | 25 | 61 | 74 | 8 | | | 59 | 33 | 85 | 106 | 16 | | | 91 | 49 | 133 | 170 | 32 | | | 155 | 81 | 229 | 298 | 64 | |_________|___________|___________|___________|_____________|__________| Table 1.4: Memory reference times, in processor cycles, on a 16 processor Hector system Table 1.4 shows latencies for local, on-station, and off-station (or ring) memory accesses in units of 60 nano- second cycles. Off-station requests, or those requiring the use of the ring are shown for 0, 4, 8, 16, 32 and 64 cycle delays. The values shown are pessimistic values in the sense that the true values depend on the relative positions of the source and destination stations, and the values shown represent worst case relative positioning. This is because even though the system is symmetric, asymmetry is introduced, since cache line reads consist of one request packet but two reply packets (in order to return the entire 16 byte cache line). Note that the delay switches have no affect on local or on-station requests. For more detailed descriptions of the Hector see [4] [17] [12]. To provide insight into the importance of localization on a slightly larger system and in other shared-memory multiprocessors we set the delay switches to 16 and conduct the same localized versus non-localized placement experiment. The results of this experiment are shown in Figure 1.4. Note that with a delay of 16 cycles the sixteen processor Hector system used has memory access latencies that are roughly equivalent to other existing shared-memory multiprocessors. copy "chp5/graphs/good-vs-bad-16.grap" Figure 1.4: Normalized response times, localized and non-localized placements, delay = 16 - 15 - The experimental results of Figure 1.4 show substantial improvements in execution times as a result of localization. The matrix multiplication application (MM) is improved by more than 50%, while the fast fourier transformation (FFT) is improved by more than 25%, the partial differential equation solver (PDE) more than 20%, and the hough transformation (HOUGH) by more than 15%. Only the neural network application is not significantly improved. The reason for this is an unusually large number of system calls (a more detailed explanation is provided in a subsequent section). 7.2. NUMAness Figure 1.5 illustrates the affects that the NUMAness of the system has on the execution of these applications under non-localized and localized placements. The graphs show the normalized execution times of each application obtained with delay settings of 0, 4, 8, 16, 32 and 64. The delay setting is shown just below the pair of bars representing the localized and non-localized execution times. graph F1 copy "chp5/graphs/numaness1.grap" graph F2 with "chp5/graphs/numaness2.grap" Figure 1.5: The importance of localization using different degrees of NUMAness These graphs demonstrate that as the latencies in the system increase the performance benefits achieved through localization increase for all applications except NEURAL. If the communication and memory references within an application are completely localized then the increase in latencies should have no affect on the execution times when the localized placement is used. The results show this to be true for MM and PDE. Note however, that this is not the case for FFT, HOUGH, SIMPLEX and especially NEURAL. There are a number of reasons why, under a localized placement, some of these applications are more affected by increased latencies than others. 1) The data being accessed is not entirely localized. That is, some data is located on memory associated with processors outside of those the application is executing on. This is true, for example, of FFT. Because normally the computation is dominated by sine and cosine computations pre-computed lookup tables of sine and cosine values are created. For each of these tables, the memory allocation scheme used assigns pages to physical memory on a round-robin basis starting with processor 0 on station 0. In this case, because 64 bit double precision variables are used with a problem size of 256, two 4Kbyte pages will be allocated for each table and references to these tables may require remote memory accesses. The round-robin allocation of these - 16 - tables was done by the original author in order to achieve good performance, by reducing hot spots, when using all 16 processors. For the same reasons HOUGH also uses pre-computed lookup tables for sine and cosine values, which along with the input image, are allocated in a using a round-robin page allocation policy. 2) There are a relatively high number of system calls. a) Some system calls are performed by communicating, via message passing, with a server process that is executing on Station 0, thus incurring delays because of remote communication. This is reflected in increases in execution time as the delays increase because communication with Station 0 requires using the interconnection ring which means incurring the delays. This is the case with the SIMPLEX application. b) Some system calls are handled by a server process that is migrated to the processor of the calling process (handoff-scheduling). The server may then access system data structures, many of which have been allocated on Station 0, thus requiring off- station memory requests which increase execution times as latencies are increased. The application NEURAL performs a large number of such system calls which is the reason that the performance is not improved by using a localized placement. In fact, localization actually degrades performance in this case because the algorithm used executes synchronously, with each of the processes requiring access to shared resources at the same time. The non-localized placement decreases the degree to which the processes are synchronized and decreases the contention for shared resources, thus slightly improving the execution times. 7.3. System Size Another way to view the importance of application placement is to consider possible increases in system size and the different application placements possible, given a fixed number of required processors. For example, if an application requires four processors and it is executed on a system with four processors a localization strategy is not required since any placement is localized. The potential benefits of localization increase in an eight processor system but are not as large as the benefits that can be obtained in much larger systems. That is, if the number of processors allocated to an application is fixed and different sized system are considered, the potential benefits from localization and therefore the importance of - 17 - localization increase with the size of the system. This is illustrated in Figure 1.6. The different placements used correspond to considering localized versus non-localized placements in systems of 4, 8 and 12 processors. The localized placement, ``---- ---- ---- 1234'' is the same for each system size, while the placement ``---- ---- --43 --21'' represents a non-localized placement in a system of 8 processors, because only the 8 processors of Stations 2 and 3 are considered, and the placement ``---- ---3 --42 ---1'' represents a non-localized placement in a system of 12 processors. Therefore, each of the bars of the graphs in Figure 1.6 represent a non-localized placement in systems of size 4, 8 and 12 and should be compared with the localized placement ``---- ---- ---- 1234'' to determine the improvements possible due to localization for a system of that size. copy "chp5/graphs/size.grap" Figure 1.6: The importance of localization with varying system sizes, delay = 16 The results of these experiments demonstrate that, for all applications except NEURAL, the benefits obtained from using a localized placement increase as the size of the system is increased, thus demonstrating the need for and increased importance of localization in larger and larger systems. Note also that the prototype system being used is relatively small and as a result the performance of each placement is also affected by the number of processors being used by the application (four). This can be seen by the small difference between the non-localized placements of four processes on two stations (8 processors) and three stations (12 processors). We expect that in larger and larger systems that the difference in performance between the non-localized placements would likely increase as the size of the systems tested were increased. 7.4. Placement of Processes within the Application All of the applications used in these experiments consist of a parent (main) process and three children. Each application operates in a master/slave fashion, with the parent process creating the children, notifying the children of the functions they are to perform along with the sub- section of the data the functions are to be performed on, and controlling the synchronization. The child and the master processes each perform the same work on different subsets of the problem. However, because the master process is created first, it may be responsible for the initialization of some data which may cause that data to be located on the same processor as the master process. This may be as innocuous as a few variables, for example, the number of processes used and the size of the problem, but if these variables are not cached and are referenced often the cumulative cost of the remote memory accesses can affect execution times. The graph in Figure 1.7 is the result of - 18 - an experiment that was conducted in order to study how the execution time of each application is affected by the location of the child processes relative to the parent. This study can be thought of as examining the following question: + Once a localized subset of processors has been chosen for an application's execution, is the execution time affected by the location of its parallel processes within that subset of processors? Intuitively this will depend on the symmetry, communication, and remote memory access patterns of the application. copy "chp5/graphs/whatisbad.grap" Figure 1.7: Importance of the placement of children relative to the parent, delay = 16 The experiment performed considers two non-localized placements, one in which none of the child processes are placed in the same station as the parent ``---- ---3 --42 ---1'' and one in which one of the child processes is placed in the same station as the parent ``---- ---3 --41 ---2''. We see in Figure 1.7 that the performance of each application is affected by the placement of the child processes relative to the parent since exactly the same subset of processors is used in each case and the only difference is that the location of processes 1 (the master) and 2 (the first child) have been switched. A delay setting of 16 cycles was used and the results show that in two cases, MM and PDE the execution times differ by 15%. These results are significant enough to indicate that the location of parallel processes of an application within a subset of localized processors is important for some applications and that the child processes should be placed as close to the main process as possible. 8. Conclusions and Future Work In this paper we have demonstrated that, in large- scale, NUMA multiprocessors, preserving the locality of parallel applications by placing processes close to each other in order to minimize the costs of accessing shared- data is essential to achieving good performance. In particular the experiments conducted in this paper have shown: + As expected, in small-scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. + Application placement that considers the architectural grouping of processor and memory modules inherent in NUMA multiprocessors is essential and improves performance significantly. - 19 - + The importance of placement decisions increases with the size and NUMAness of the system and will continue to increase as the gap between processor speeds and memory access times (including interconnection schemes) continues to widen. + Placement of the children relative to the parent (main) process affects application performance significantly. Specifically, frequently referenced data is often located on or near the processor that the parent is placed on. Thus, placing children as close as possible to the parent process reduces execution time. Besides continuing to study the hypothesis that localization will reduce contention in a multiprogrammed environment, we are currently conducting an experimental evaluation of a technique, called processor pool-based scheduling, designed to automatically ensure that the locality of an application is preserved by the scheduler [1]. Preliminary simulation studies show that this technique does preserve locality and improve execution times of parallel applications [18]. 9. Acknowledgments I am indebted to the Hector and HURRICANE groups for the countless hours spent implementing, debugging and tuning the system hardware and software, most notably: Ron White, Michael Stumm, Ron Unrau, Orran Krieger, Ben Gamsa, and Jonathan Hanna. I wish to thank Songnian Zhou, Ken Sevcik, and the other members of the scheduling discussion group for many discussions related to scheduling in multiprocessors and Songnian Zhou for his helpful comments concerning the presentation of this paper. Thanks also to Karim Harzallah, Karen Reid, Brian Carlson, and the referees for providing valuable comments. James Pang, Deepinder Gill, Thomas Wong and Ron Unrau contributed the parallel applications. I especially wish to thank Ron White for the design and implementation of the delay switches in Hector. 10. References [1] T. B. Brecht, "," Processor Scheduling Techniques for Large-Scale Shared-Memory NUMA Multiprocessors, Ph.D. Thesis, Department of Computer Science, University of Toronto, Toronto, Ontario, (in preparation) (1993). [2] H. Burkhardt, S. Frank, B. Knobe, and J. Rothnie, "Overview of the KSR1 Computer System," Kendall Square Research, Boston, Technical Report KSR-TR-9202001 (February, 1992). - 20 - [3] T. H. Dunigan, "Kendall Square Multiprocessor: Early Experiences and Performance," Engineering and Mathematics Division, Oak Ridge National Laboratory, ORNL/TM-12065 (March, 1992). [4] B. Gamsa, "," Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors, M.Sc. Thesis, University of Toronto, Toronto, Ontario (September, 1992). [5] A. Gupta, A. Tucker, and S. Urushibara, "The Impact of Operating System Scheduling Policies and Synchronization Methods on the Performance of Parallel Applications," Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 120-132 (May, 1991). [6] D. Lenoski, J. Laudon, T. Joe, D. Nakahari, L. Stevens, A. Gupta, and J. Hennessy, "The DASH Prototype: Implementation and Performance," The Proceedings of the 19th International Symposium on Computer Architecture, pp. 92-103 (May, 1992). [7] E. P. Markatos, "," Scheduling for Locality in Shared- Memory Multiprocessors, Ph.D. Thesis, Department of Computer Science, University of Rochester, Rochester, New York (May, 1993). [8] E. P. Markatos and T. J. LeBlanc, "Load Balancing vs. Locality Management in Shared-Memory Multiprocessors," 1992 International Conference on Parallel Processing, pp. 258-267 (August, 1992). [9] E. P. Markatos and T. J. LeBlanc, "Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors," Proceedings of Supercomputing '92, pp. 104-113 (November, 1992). [10] M. S. Squillante, "," Issues in Shared-Memory Multiprocessor Scheduling: A Performance Evaluation, Ph.D. Thesis, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, Technical Report 90-10-04 (October, 1990). [11] M. S. Squillante and E. D. Lazowska, "Using Processor Cache Affinity Information in Shared-Memory Multiprocessor Scheduling," Department of Computer Science and Engineering, University of Washington, Seattle, Washington, Technical Report 89-06-01 (February, 1990). - 21 - [12] M. Stumm, Z. Vranesic, R. White, R. Unrau, and K. Farkas, "Experiences with the Hector Multiprocessor," Computer Systems Research Institute, University of Toronto, CSRI-276 (October, 1992). [13] D. Thiebaut and H. S. Stone, "Footprints in the Cache," ACM Transactions on Computer Systems, Vol. 5(4) pp. 305-329 (November, 1987). [14] R. Unrau, "," Scalable Memory Management through Hierarchical Symmetric Multiprocessing, Ph.D. Thesis, University of Toronto, Toronto, Ontario (January, 1993). [15] R. Unrau, M. Stumm, and O. Krieger, "Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design," Computer Systems Research Institute, Report CSRI-268, University of Toronto, Toronto, Ontario (March, 1992). [16] R. Vaswani and J. Zahorjan, "The Implications of Cache Affinity on Processor Scheduling for Multiprogrammed, Shared Memory Multiprocessors," Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, pp. 26-40 (October 1991). [17] Z. Vranesic, M. Stumm, D. Lewis, and R. White, "Hector: A Hierarchical Structured Shared-Memory Multiprocessor," IEEE Computer, Vol. 24(1) pp. 72-80 (January, 1991). [18] S. Zhou and T. B. Brecht, "Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors," Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 133-142 (May, 1991).