Check out the new USENIX Web site. next up previous
Next: Register Allocation Up: Improving performance Previous: VM Register Conventions

Instruction Selection

The heart of the instruction selection phase relies on a BURS-based tree-pattern-matching system. We have extended iburg [7] to generate Java code (instead of C) and work on a general dependence graph. The key idea is to partition the dependence graphs into a forest of expression trees based on their register-true dependencies. The BURS pattern matching system then processes each tree in the forest to perform instruction selection and emit code, one tree at a time, in an order that respects the inter-tree ordering constraints encoded by the original dependency graph [5,10].

In the initial port, we defined a ``bare bones'' grammar that described a straightforward translation of the 124 LIR operators into MIR operators using 142 rules and 7 non-terminals. As the performance work progressed, the grammar tripled in size. The current grammar includes additional patterns for optimizing floating point computations and conditional branching, exploiting memory operands, and other miscellaneous enhancements such as recognizing complex addressing modes, exploiting special instructions such as LEA, TEST, INC, etc., and avoiding needless sign extension of byte/short loads. Table 1 reports the contribution of each group of enhancements to the size of the rules.

Table 1: The Basic grammar was enhanced with 446 rules and 18 non-terminals to support optimizations in instruction selection. BURS must recognize 124 LIR operators.
Description Number of Rules Number of Non-terminals
Basic 142 7
Floating Point Stack 90 2
Conditional Branches 42 6
Memory Operands 170 8
Misc. Other Enhancements 144 2
Complete 588 25

In addition to extending the grammar as described above, we also enhanced our BURS driver with heuristics to reduce register pressure. We label each tree node with an estimate of the number of live values required to compute it, using the algorithm from section 9.10 of the Dragon book [1]. The BURS driver uses this numbering to choose which child node to emit first when generating code for a given tree and to select from the set of ready trees12which tree to emit next. In both cases, choosing the node/tree with the largest number of registers first tends to reduce the number of simultaneously live values and thus reduce register pressure. As reported in section 4.3, this heuristic made a small but measurable difference in overall performance and was extremely simple to implement.

While the PowerPC architecture supports three-operand ALU operations (a=b+c), IA32 supports two-operand ALU operations(x+=y). The IA32 compiler converts the three-operand LIR to two-operand form in a pre-pass to BURS, which tripped some subtle issues. Initially, this pre-pass took obvious actions to avoid introducing useless move instructions. For example, it would transform a=a+b to a+=b. In other cases, it would use local liveness information to avoid inserting moves. For example, if b is dead after a=b+c and this is the only definition of a, then the pass would emit b+=c and the replace uses of a with uses of b. This second optimization tripped a subtle difficulty; it can hinder instruction selection of other expression trees within the basic block by introducing additional inter-tree anti and output dependencies and by extending the live range of b beyond the current basic block. Thus, in some cases it is actually better to translate a=b+c into b+=c; a=b and rely on the register allocator to coalesce away the move instruction.

Table 2: Percentage speedup obtained by Complete rules over the initial Basic rules.
Benchmark % Speedup
compress 1.9
jess 7.2
db 1.4
javac -0.5
mpegaudio 52.3
mtrt 5.1
jack 8.0
geo. mean 9.7

Table 2 shows the performance improvements obtained by adding all of the enhancements to the basic grammar. The most important enhancement was adding rules to exploit the floating point stack (mpegaudio, mtrt), but the other enhancements also resulted in modest gains. The impact of instruction selection and register allocation on floating point performance is explored in more detail below.

next up previous
Next: Register Allocation Up: Improving performance Previous: VM Register Conventions
Stephen Fink 2002-05-23