Check out the new USENIX Web site. next up previous
Next: Conclusions Up: Improving performance Previous: Register Allocation

Floating point

The IA32 architecture provides an abstraction of a floating-point stack, a sharp difference from the flat floating-point register set of the PowerPC.

In the original functional port, in order to minimize changes to the system, we treated the floating-point stack locations as independent physical floating-point registers. During instruction selection, BURS treated symbolic floating-point registers just like symbolic integer registers. The linear scan register allocator allocated the symbolic floating-point registers to seven floating-point stack locations, as if these were seven physical registers. The eighth stack location was reserved for use as a scratch register downstream, in order to generate code that moves values between stack locations and to memory.

This original scheme had the advantage that the linear scan allocator had full freedom to allocate the stack locations using global analysis. However, this scheme has a severe drawback. Since the BURS instruction selection saw only orthogonal symbolic floating-point registers, it could not generate code to exploit the stack operations available in the IA32 instruction set.

An alternative scheme could allow BURS to generate floating-point stack code freely within a basic block. With this scheme, instruction selection could use the floating-point stack resources freely within a basic block. However, since the linear scan algorithm does not understand stack locations, it could not allocate values to stack locations across basic blocks. In effect, all register allocation would be constrained to a single basic block, spilling values to memory across basic blocks.

We chose a hybrid scheme. We give instruction selection the freedom to place a floating-point value either on the floating-point stack or in a symbolic floating-point register. The register allocator allocates symbolic registers to free stack locations. Note that if BURS allocates a value to a floating-point stack location, that stack location is not available for use by the register allocator. We model this by inserting dummy def and use instructions for physical stack locations reserved by instruction selection.

Table 3: Performance comparison of alternative floating-point code generation strategies (speed normalized to ``None''). ``RA'' allows inter-block register allocation, while ``BURS'' allows intra-block generation of floating-point stack code for expressions.
  None RA only BURS only Both
mpegaudio 1 1.548 1.544 1.957
mtrt 1 0.668 1.251 1.181

Table 3 compares performance on the two floating-point SPECjvm98 codes. The Table shows that each technique helps mpegaudio, but shows an anomaly where inter-block register allocation hurts mtrt. Our initial functional port used only the ``RA'' register allocation strategy, as this option most closely matches the extant PowerPC port. Later we also added the BURS floating-point stack code generation. We didn't seriously consider the other two possibilities, but include them to enable comparisons.

Although we have improved RVM floating point performance compared to the initial functional port, performance still lags behind the IBM product DK. We still face the two anomalies reported for floating-point performance: recall that the smart scratch victim selection hurts mpegaudio and floating point register allocation degrades mtrt. We have not yet investigated these anomalies, and we hope to improve Jikes RVM floating-point performance in the future.

next up previous
Next: Conclusions Up: Improving performance Previous: Register Allocation
Stephen Fink 2002-05-23