Memory Hierarchy Performance

Next: Use of Performance Reflection Up: Use of Performance Reflection Previous: Use of Performance Reflection

Memory Hierarchy Performance

Memory hierarchy performance can be very sensitive to competition on shared resources. For example, the standard configuration of IBM Regatta node has modules containing two Power4 processors that share a common cache and interface to main memory. Since it is known that many large scientific programs are memory-bandwidth bound, there is also an HPC variant of the hardware that contains only a single processor per module. For bandwidth-limited applications the second processor adds little or no additional performance and eliminating it saves cost while further eliminating possible cache interference. While it would not save the cost of the extra processors, monitoring miss rates of the shared cache of a standard node would enable the system to either schedule only one thread per module or to possibly identify ``compatible'' threads to co-schedule. Similar scheduling strategies [9,11] have been proposed for use with Simultaneous Multi-threading (SMT) [13]. The performance of non-uniform memory access (NUMA) machines is dependent on the assignment of threads to processors. The kernel can monitor memory behavior by, depending on the level of architectural support, measuring remote references, cache miss behavior, or cycles per instruction (CPI). Thread rescheduling decisions can then be based on this feedback.

Next: Use of Performance Reflection Up: Use of Performance Reflection Previous: Use of Performance Reflection

Sameh Mohamed Elnikety 2003-06-15