Proxy server performance

We then measured the actual performance of our simple proxy server, using either select() or our new API. In these experiments, all requests are for the same (static) 1 Kbyte file, which is therefore always cached in the Web server's memory. (We ran additional tests using 8 Kbyte files; space does not permit showing the results, but they display analogous behavior.)

In the first series of tests, we always used 32 hot connections, but varied the number of cold connections between 0 and 2000. The hot-connection S-Clients are configured to generate requests as fast as the proxy system can handle; thus we saturated the proxy, but never overloaded it. Figure 8 plots the throughput achieved for three kernel configurations: (1) the ``classical'' implementation of select(), (2) our improved implementation of select(), and (3) the new API described in this paper. All kernels use a scalable version of the ufalloc() file-descriptor allocation function [4]; the normal version does not scale well. The results clearly indicate that our new API performs independently of the number of cold connections, while select() does not. (We also found that the proxy's throughput is independent of array_max.)

**Figure 8:** HTTP rate vs. cold connections
$\begin{figure} \centerline{\psfig{figure=throughput1k.ps, height=2.0in, width=3in}} \end{figure}$

**Figure 9:** Latency vs. request rate
$\begin{figure} \centerline{\psfig{figure=response1k.ps, height=2.0in, width=3in}} \end{figure}$

In the second series of tests, we fixed the number of cold connections at 750, and measured response time (as seen by the clients). Figure 9 shows the results. When using our new API, the proxy system exhibits much lower latency, and saturates at a somewhat higher request load (1348 requests/sec., vs. 1291 request/sec. for the improved select() implementation).

Table 2 shows DCPI profiles of the proxy server in the three kernel configurations. These profiles were made using 750 cold connections, 50 hot connections, and a total load of 400 requests/sec. They show that the new event API significantly increases the amount of CPU idle time, by almost eliminating the event-notification overhead. While the classical select() implementation consumes 34% of the CPU, and our improved select() implementation consumes 12%, the new API consumes less than 1% of the CPU.

Table 2: Effect of event API on system CPU profile

Classical	Scalable	New event
select()	select()	API
CPU %	CPU %	CPU %	Procedure	Mode
18.09%	33.01%	59.01%	all idle time	kernel
33.51%	12.02%	0.68%	all kernel select or event functions	kernel
13.78%	N.A.	N.A.	soo_select()	kernel
9.11%	N.A.	N.A.	selscan()	kernel
8.40%	N.A.	N.A.	undo_scan()	kernel
2.22%	12.02%	N.A.	select()	kernel
N.A.	0.57%	N.A.	new_soo_select()	kernel
N.A.	0.47%	N.A.	new_selscan_one()	kernel
N.A.	N.A.	0.40%	get_next_event()	kernel
N.A.	N.A.	0.15%	declare_interest()	kernel
N.A.	N.A.	0.13%	revoke_interest()	kernel
2.01%	1.95%	1.71%	_Xsyscall()	kernel
1.98%	1.88%	1.21%	main()	user
1.91%	1.90%	1.69%	_doprnt()	user
1.63%	1.58%	1.54%	memset()	user
1.29%	1.31%	1.47%	read_io_port()	kernel
1.11%	1.15%	1.20%	syscall()	kernel
1.09%	1.11%	1.11%	_XentInt()	kernel
1.08%	1.06%	1.19%	malloc()	kernel

750 cold connections, 50 hot connections, 400 requests/second, 1KB/request