Check out the new USENIX Web site. next up previous
Next: Other Compiler Assisted Techniques Up: Case Study: System Call Previous: Looped multi-calls.

Experimental results.

A number of experiments have been performed to identify both the potential and actual benefits of this approach. All tests were run on a Pentium II-266 Mhz laptop running Linux 2.4.4-2.

Table 1: CPU cycles for entry and exit
  Entry Exit
System Call 140 (173-33) 189 (222-33)
Procedure Call 3 (36-33) 4 (37-33)

As a baseline, we first measured the cost of a system call versus a procedure call. Table 1 gives the results of these experiments; these results were obtained using the rdtscl call, which reads the lower half of the 64 bit hardware counter Read Time Stamp Counter, RDTSC, provided on Intel Pentium processors. These results indicate that clustering even two system calls and replacing them with a multi-call can result in savings of over 300 cycles every time the pair of system calls is executed.

Table 2: Optimization of a copy program with block size of 4096
File Size Original Multi-call Looped Multi-Call
  Cycles ($ 10^6$ ) Cycles ($ 10^6$ ) % Savings Cycles ($ 10^6$ ) % Savings
80K 0.3400 0.3264 4% 0.3185 6.3%
925K 4.371 4.235 3.1% 4.028 7.8%
2.28M 10.93 10.65 2.6% 10.37 5.2%

Table 2 gives the results of applying system call clustering using both the multi-call and the looped multi-call to the copy program shown in figure 1. To do this, the multi-call or looped multi-call was assigned system call number 240 and added as a loadable kernel module. The numbers reported in table 2 were calculated by taking the average of 10 runs on files of 3 sizes ranging from a small 80K file to large files with size around 2MB. The block was chosen as 4096 bytes since it was the page size and hence, the optimal block size for both the optimized and unoptimized versions of the copy program. The maximum benefit in this example is for small and medium file sizes, since the cost of disk and memory operations dominates for larger files.

Table 3: Optimization of mpeg_play using multi-calls
  CPU Cycles ($ 10^9$ ) %
Size Original Optimized Savings
4.7M 23.75 21.74 8.47%
9.5M 63.65 52.09 18.17%
9.5M 31.00 21.70 30.00%
10.3M 51.51 41.12 20.17%
15.1M 60.18 52.10 13.42%

The second example program is the popular mpeg_play video software decoder [18]. The effects of optimizing this program using our approach are shown in table 3. Although several candidate system call sequences were revealed by profiling, only one was optimized since the others existed partially or completely in the X-windows libraries used by the player. The program was executed using different input files taken from [13] with sizes varying from 4.7MB to 15MB. Overall, our approach shows a more dramatic effect than for the copy program, largely because the system calls here are not I/O bound as was the case for copy. In addition to the savings in CPU cycles, this optimization also improved the frame-rate and performance of mpeg_play. Specifically, there was an average 25% improvement in the frame rate and 20% reduction in execution time across all file sizes.

More details on these and other examples can be found in [17].

next up previous
Next: Other Compiler Assisted Techniques Up: Case Study: System Call Previous: Looped multi-calls.
Mohan Rajagopalan 2003-06-16