Next: Floating point Up: Improving performance Previous: Instruction Selection

## Register Allocation

The optimizing compiler relies on a variant of linear-scan [9] register allocation. To give the register allocator more freedom, we implemented a variant of linear scan that deals with holes in live ranges. Consider Figure 4. The basic linear-scan algorithm would not allocate x and y to the same physical register, as their live intervals (denoted by the rectangular bars) overlap. However, our enhanced algorithm represents the live intervals with holes, as represented by the black-shaded areas in the Figure. As a result, our allocator could allocate x and y to the same physical register.

Traub et al. [11] previously described a linear scan variant to deal with holes. Our approach differs in two ways.13First, we perform the intersection of two sparse live intervals, rather than insisting on perfect nesting of one interval within another. Although this introduces super-linear complexity to the algorithm, we do not believe this causes major problems in practice.

Figure 4: Example of linear scan live intervals with holes.
 ```1: x = ... 2: if (...) 3: y = ... 4: else 5: return x 6: return y ``` a)

Secondly, Traub et al.'s algorithm splits live ranges on the fly, with a post-pass clean-up phase to reconcile differences. In contrast, our algorithm marks certain symbolic registers as spilled. A post-pass clean-up phase deals with the spills. If an instruction uses a spilled register, the clean-up phase either introduces a memory operand referring to the spilled memory location, or moves the spilled value into a scratch register. On PowerPC, the original register allocator reserved three registers for use as scratch, so finding a free scratch register was almost always trivial. On IA32, we did not reserve any scratch registers, so the register allocator must create scratch registers upon demand.

Figure 5 details the algorithm for dealing with spilled values. The algorithm makes one pass over the statements. As it progresses, the algorithm keeps track of which symbolic register values are cached in each scratch register. Before each statement, the algorithm vacates any scratch registers which are needed for the next statement. Then, the algorithm processes a statement. It attempts to replace spilled symbolic registers with memory operands representing the spill location. If this is infeasible, the algorithm chooses a physical register as a victim to be vacated, so the victim can be used as a scratch register. The victim will continue to cache the symbolic value until either the physical register is needed for a future statement, or the victim is chosen to cache a different symbolic register. Scratch register mappings are not maintained across basic block boundaries; all victims are vacated at block exits.

As a side effect of vacating and introducing scratch registers, the code updates stack maps required for type-exact GC.

 ```for each statement s do if s can leave the basic block via a call, jump, fall-through, or exception then vacate cached value and restore original value for each scratch register before s vacate cached value and restore original value for any scratch register used by s for each spilled symbolic register r in s do if r is currently cached in scratch register p then replace r with p in statement s else if s needs a scratch register for r then choose a scratch register victim p to hold r in s vacate current value of p cache value of r in p replace r with p in statement s else replace r with a memory operand representing r's spill location done done ```

With the IA32 limited register set, spills are common and have a huge impact on performance. Subsequent to the initial functional port, we introduced several heuristics and optimizations to improve performance.

We now evaluate the following heuristics to reduce register pressure. We provide only a high-level overview of each heuristic; consult the open-source code for more details.

Intelligent scratch victim selection
The initial implementation of the algorithm in Figure 5 chooses a victim arbitrarily. Furthermore, the initial implementation does not use liveness to determine whether the victim needs to be vacated and restored. This heuristic uses liveness computed during linear scan and attempts to choose a victim that does not currently hold a live value. Furthermore, this optimization uses the same information to avoid vacating and restoring dead values.
Smarter linear scan spill heuristic
When facing a spill situation, the original linear scan implementation chooses a symbolic register to spill at random. This heuristic estimates the cost of spilling based on appearances of the register, weighted by loop depth, and chooses spill candidates accordingly.
Register-pressure-aware BURS
This optimization introduces a heuristic into instruction selection to attempt to generate code in an order to minimize overlapping live ranges. When faced with multiple expression trees which can be emitted in any order, this heuristic chooses the tree that consumes the most register values.
Global Coalescing
We implemented a separate pass to coalesce registers; basically, if there is a MOV x = y, if the live ranges of x and y do not overlap, then we replace all appearances of y with x.

Figure 6 shows the marginal performance improvement gained by enabling each of these four optimizations cumulatively, compared to performance with none enabled. Thus, the bar labeled ``Scratch Victim'' enables intelligent scratch victim selection; the next bar ``LS Spill'' further enables the linear scan spill heuristic, etc.

Altogether, the register pressure heuristics improve performance on average by 15%. The results show that the spill heuristics and scratch register selection heuristics are most effective across the board. Anomalously, the scratch victim heuristic hurts mpegaudio; we currently have no explanation for this. Instruction selection re-ordering has a minimal impact, slightly improving compress and db. Coalescing is usually insignificant; except it causes a substantial improvement for mpegaudio. Mtrt degrades as we add more optimizations; however, the floating-point results here should be taken with a grain of salt, since Figure 3 shows that Jikes RVM IA32 floating-point performance is mediocre, even with all optimizations enabled.

Next: Floating point Up: Improving performance Previous: Instruction Selection
Stephen Fink 2002-05-23