Experimental results.

Table 1: CPU cycles for entry and exit

	Entry	Exit
System Call	140 (173-33)	189 (222-33)
Procedure Call	3 (36-33)	4 (37-33)

As a baseline, we first measured the cost of a system call versus a procedure call. Table 1 gives the results of these experiments; these results were obtained using the rdtscl call, which reads the lower half of the 64 bit hardware counter Read Time Stamp Counter, RDTSC, provided on Intel Pentium processors. These results indicate that clustering even two system calls and replacing them with a multi-call can result in savings of over 300 cycles every time the pair of system calls is executed.

Table 2: Optimization of a copy program with block size of 4096

File Size	Original	Multi-call		Looped Multi-Call
	Cycles ( )	Cycles ( )	% Savings	Cycles ( )	% Savings
80K	0.3400	0.3264	4%	0.3185	6.3%
925K	4.371	4.235	3.1%	4.028	7.8%
2.28M	10.93	10.65	2.6%	10.37	5.2%

Table 2 gives the results of applying system call clustering using both the multi-call and the looped multi-call to the copy program shown in figure 1. To do this, the multi-call or looped multi-call was assigned system call number 240 and added as a loadable kernel module. The numbers reported in table 2 were calculated by taking the average of 10 runs on files of 3 sizes ranging from a small 80K file to large files with size around 2MB. The block was chosen as 4096 bytes since it was the page size and hence, the optimal block size for both the optimized and unoptimized versions of the copy program. The maximum benefit in this example is for small and medium file sizes, since the cost of disk and memory operations dominates for larger files.

Table 3: Optimization of mpeg_play using multi-calls

	CPU Cycles ( )		%
Size	Original	Optimized	Savings
4.7M	23.75	21.74	8.47%
9.5M	63.65	52.09	18.17%
9.5M	31.00	21.70	30.00%
10.3M	51.51	41.12	20.17%
15.1M	60.18	52.10	13.42%

The second example program is the popular mpeg_play video software decoder [18]. The effects of optimizing this program using our approach are shown in table 3. Although several candidate system call sequences were revealed by profiling, only one was optimized since the others existed partially or completely in the X-windows libraries used by the player. The program was executed using different input files taken from [13] with sizes varying from 4.7MB to 15MB. Overall, our approach shows a more dramatic effect than for the copy program, largely because the system calls here are not I/O bound as was the case for copy. In addition to the savings in CPU cycles, this optimization also improved the frame-rate and performance of mpeg_play. Specifically, there was an average 25% improvement in the frame rate and 20% reduction in execution time across all file sizes.