Implementation and Performance Evaluation

For ease of presentation, we have used vtable entry numbers in the offset table entries in the examples. In an actual implementation, a ``processed offset'' can be used instead. This ``processed offset'' is the vtable entry number multiplied by the size of a vtable entry (size of a pointer to code). Thus we transferred some of the burden from run time to compile time.

Bryce McKinlay implemented part of our solution for GCJ. His implementation so far provides full support for virtual methods and partial support for interface methods. The support for fields is still work in progress. This implementation is to be included in the future GCC release 3.1.

When compiling with the -O2 optimization flag, it turns out that our new scheme generates one more assembly instruction for each virtual method invocation. Figure 10 shows the result of compiling a virtual method invocation. In our indirect dispatch scheme with the offset table involved, the code fetches the offset from the corresponding offset table (otable) entry, and adds it to the vtable pointer before calling the method. In contrast, the original direct vtable dispatch scheme generates code which adds the offset of the method to the vtable pointer and calls it. When the same call occurs in a loop (or in succession), the compiler moves the otable load out of the loop, so the overhead is reduced.

**Figure 10:** Sample method invocation code
------------------------------------------------- \| movl _ZN4java4lang6System3outE, %eax ;object \| \| movl (%eax), %edx ;vtable \| \| movl %eax, (%esp) ;this \| \| movl _CD_C+4, %eax ;argument \| \| movl %eax, 4(%esp) \| \| call 116(%edx) \| ------------------------------------------------- `direct dispatch (vtable)` ------------------------------------------------- \| movl _ZN4java4lang6System3outE, %ecx ;object \| \| movl _CD_C+4, %eax ;argument \| \| movl (%ecx), %edx ;vtable \| \| movl %ecx, (%esp) ;this \| \| movl %eax, 4(%esp) \| \| move otable+4, %eax ;offset \| \| call (%eax,%edx) \| ------------------------------------------------- `indirect dispatch (offset table)`

 -------------------------------------------------
| movl _ZN4java4lang6System3outE, %eax  ;object   |
| movl (%eax), %edx                     ;vtable   |
| movl %eax, (%esp)                     ;this     |
| movl _CD_C+4, %eax                    ;argument |
| movl %eax, 4(%esp)                              |
| call *116(%edx)                                 |
 -------------------------------------------------

direct dispatch (vtable)

 -------------------------------------------------
| movl _ZN4java4lang6System3outE, %ecx  ;object   |
| movl _CD_C+4, %eax                    ;argument |
| movl (%ecx), %edx                     ;vtable   |
| movl %ecx, (%esp)                     ;this     |
| movl %eax, 4(%esp)                              |
| move otable+4, %eax                   ;offset   |
| call *(%eax,%edx)                               |
 -------------------------------------------------

indirect dispatch (offset table)

Our tests are based on the Java Grande 2.0 benchmarks [9] (the current version of GCJ cannot compile the SPECjvm98 benchmark suite). All results were obtained on a DELL Precision 410 workstation running Red Hat Linux 7.1. The machine has 512MB of main memory and 500MHz Pentium III processor with 512KB of cache. The average results over 3 rounds of tests, using dynamic linking with -O2 flag turned on for both the direct vtable dispatch scheme of GCJ (Direct) and our indirect offset table dispatch scheme (Indirect), are shown in Table 3. In the ``Ratio (%)'' column, numbers less than 100 indicate performance slowdown using our scheme, while numbers greater than 100 indicate performance speedup.

Table 3: Java Grande 2.0 benchmarks

Benchmark	Direct	Indirect	Unit	Ratio (%)
Same:Instance	34.59	34.84	ns/call	99.27
Other:Instance	38.55	37.29	ns/call	103.39
Crypt (A)	7.26	7.27	s	99.82
HeapSort (A)	2.14	2.15	s	99.67
Series (A)	46.90	47.62	s	98.49
Crypt (B)	48.41	48.49	s	99.82
HeapSort (B)	14.69	14.67	s	100.14
Series (B)	485.87	493.12	s	98.53
AlphaBeta	24.76	25.04	s	98.89
MonteCarlo	32.89	32.64	s	100.77
Euler	327.55	328.65	s	99.66
RayTracer	45.21	44.81	s	100.90
MolDyn	518.09	529.13	s	97.91

The first two benchmarks are taken from benchmark suite section 1 (Low Level Operations). They test the performance of invoking virtual (instance) methods on an object of the same class and of another class. These two benchmarks perform a large number of iterations over 17 method invocations; every invoked method simply increases a global static counter. The result indicates that much of the offset table overhead was optimized away in these cases. Somewhat surprisingly the performance on ``Other:Instance'' was improved, possibly due to the different instruction scheduling.

The rest of the benchmarks (IDEA Encryption, Heap Sort, Fourier Coefficient Analysis, Alpha Beta Search, Monte Carlo Simulation, Computational Fluid Dynamics, 3D Ray Tracer, and Molecular Dynamics Simulation) are chosen from Sections 2 (Kernel) and 3 (Large Scale Applications) of the benchmark suite. We chose those with the most method invocations involved. Some of them are run on different data sizes (A/B). The performance penalty is on average less than 2%. Again we see some performance speedup in the test cases.

Another interesting observation is on the size of the object files. The new indirect dispatch scheme for binary compatibility puts extra offset tables in the object files. However, the vtables are no longer needed. When testing with the Java Grande 2.0 benchmarks, it turned out that the object file size using the new scheme is on average 1% less than using the standard vtable dispatch scheme.