| ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
WIESS '02 Paper   
[WIESS '02 Tech Program Index]
Enhancements for Hyper-Threading Technology in the Operating System – Seeking the Optimal Jun Nakajima Software and Solutions Group, Intel
Corporation Venkatesh Pallipadi Software and Solutions Group, Intel
Corporation Abstract Hyper-Threading
Technology (HT) is Intel®’s implementation of Simultaneous Multi-Threading
(SMT). 1. Introduction1.1. Overview of HTHyper-Threading Technology (HT) is Intel®’s implementation
of Simultaneous Multi-Threading (SMT) ([9],[10]).
One logical processor can utilize excess resource bandwidth that is not consumed by the other logical processor, allowing the other task to make progress. This way, both the overall utilization of execution resources as well as the overall performance of software applications in the multi-tasking/multi-threading environment increases. Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses. HT is not expected to make a given single-threaded application to execute faster when executing alone, but when two or more unrelated applications are executing under HT, the overall system throughput can improve due to HT. See [5] for details. 1.2. General HT enhancements in the Operating SystemThis section describes the typical enhancements, to explain the implications of HT to the OS. Following is a summary of enhancements recommended in the OS. Detection of HT – The OS needs to detect both the logical and processor packages if HT is available for that processor(s). hlt at idle loop – The IA-32 Intel® Architecture has an instruction call hlt (halt) that stops processor execution and normally allows the processor to go into a lower-power mode. On a processor with HT, executing hlt transitions from a multi-task mode to a single-task mode, giving the other logical processor full use of all processor execution resources; see [5] for the details. pause
instruction at spin-waits – The OS typically uses synchronization primitives,
such as spin locks in multiprocessor systems. The pause is
equivalent to “rep;nop” for all known Intel® architecture prior
to Pentium® 4 or Intel® Xeon™ processors. The instruction in
spin-waits can avoid severe penalty generated when a processor is spinning on a
synchronization variable at full speed. ·
Special handling for shared physical resources – MTRRs
(Memory Type Range Registers) and the microcode are shared by the logical processors
on a processor package. The OS needs to ensure the update to those registers is
synchronized between the logical processors and it happens just once per processor
package, as opposed to once per logical processor, if required by the spec. Preventing excessive eviction in
first-level data cache – Cached data in the first-level data cache are
tagged and indexed by virtual addresses. This means two processes running on a
different logical processors on a processor package can cause repeated evictions
and allocations of cache lines when they are accessing the same virtual address
or near in a competing fashion (e.g. user stack).
The original Linux kernel, for example, sets the same value to the initial user stack pointer in
every user process. In our enhancement, we offset the stack
pointer simply by a
multiple of 128 bytes using the mod 64, i.e. ((pid%64) << 7) of the unique process ID to resolve this
issue. Scalability issues – The current Linux, for example, is scalable in most cases, at least up to 8 CPUs. However, enabling HT means doubling the number of processors in the system, thus it can expose scalability issues, or it does not show performance enhancements when HT is enabled. Linux (2.4.17 or higher) supports HT, and it has all the above
changes in it. We developed and identified the essential code (about just 1000 lines
code) for those changes (except
scalability issues) based on performance measurements, and then improved the code with
Linux community. Appendix locates the relevant lines for the changes in Linux 2.4.19 (the latest tree as of writing). Those changes or requirements should be applicable to an OS in general when supporting HT, although some of them might need to be re-implemented for the target OS. 1.3. Basic scheduler optimizations for HTThis section describes the basic scheduler related enhancements for HT. Processor-c Since the L2 cache is shared in a processor package, however, the hit (or miss) ratio can depend on how the other logical processor uses the L2 cache as well. If the current process on a processor consumes the L2 cache substantially, it can affect the processes running on the other logical processors. Therefore, to avoid performance degradation caused by cache thrashing between the two logical processors, we need to monitor and minimize such L2 cache misses in a processor package. Note that excessive L2 cache misses also can affect the entire
system, causing significant traffic on the system (front-side) bus. Page coloring (for example, see [4]) could reduce occurrence of severe impacts on two different processes or threads caused by competitive eviction of L2 cache lines in a processor package. If two different processes access their own data very frequently, and the pages associated with the data happen to have the same color, the possibility of competitive eviction of L2 cache lines can be higher, compared to the case where page coloring is implemented. The same discussion is applicable to the threads in a multi-threaded application. Although there are some patches are available for page coloring in Linux, we haven’t measured the benefits of page coloring for HT. HT-Aware idle handling – This enhancement in the scheduler significantly improves performance when the load is relatively low. For the scheduling purposes, the OS needs to find idle CPUs, if any. On HT systems, however, the processor package is not necessarily idle even if one of the logical processor is idle; the other may be very active. Therefore, the scheduler needs to prioritize “real-idle” (both logical processors are idle) over “half-idle” (one of them is idle), when dispatching a process to a logical processor to obtain higher performance. This attribute also helps to avoid the situation where two processes run on a processor package but the other package is completely idle in a 2-way SMP system. However, this kind of situation cannot always be prevented because the OS cannot predict when a particular process terminates. Once this situation occurs, the scheduler usually does not resolve it. Scalability of the scheduler
– The Linux original uses a single global run queue with a global
spin lock. This scheduler works well for most cases, but there are some
scalability issues especially handling a large number of processes/threads. The
O(1) scheduler from Ingo Molnar is proposed to resolve such issues, and it uses
per-CPU run queue and a spin lock for each, and
locking is not required as long as the CPU manipulates its own run queue. 2. Related worksWe discuss how the existing techniques can contribute to performance enhancements of HT systems. In this paper, we assume that processes are scheduled in a time-sharing fashion. 2.1. Performance Monitoring CounterThe work [1] is interesting in that it combined the hardware monitoring counters and program-centric code annotations to guide thread scheduling on SMP, to improve thread locality. Some findings show that some workloads achieved speedup almost entirely through user annotations, and for some long-lived ones speedup is gained by preserving locality with each thread. We need to run a process for some time, to get the information of its workload. Such user’s annotation (including processor binding) would be helpful. We use hardware performance monitoring counters (simply performance monitoring counter, hereafter) to get such micro-architectural information. See [1] for general and detailed description and benefits of performance monitoring counters that are available on various architectures. The major benefit of using performance monitoring counters is to allow software to obtain detailed and specific information at runtime without impacting the performance. Usually, performance monitoring counters are used to tune applications and the OS. There are some tuning tools that use them for the Intel® architectures, such as VTune [3], for example.
We don’t see conflicts with such tools and the scheduler,
although the number of performance monitoring counters available will be reduced.
2.2. Symbiotic
Jobscheduling
We share the issues to resolve with the symbiotic jobscheduler
([6], [7]): ·
Jobs in an SMT processor can conflict with each other on various
shared system resources. ·
The scheduling has to be aware of the SMT requirements to find the optimized
utilization of execution resources. However, the target system and the methodology is quite different: · We don’t attempt to make a running set of jobs that will be coscheduled, by discovering efficient schedules in a processor on the fly. Instead, we attempt to detect interference or conflicts with the execution resources in SMP systems consisting of multiple SMT, i.e. HT processors, and to balance the load among such |