The NetBIOS interface allows an application to wait for incoming data on multiple network connections. NetBIOS does not provide a procedure-call interface; instead, an application creates a ``Network Control Block'' (NCB), loads its address into specific registers, and then invokes NetBIOS via a software interrupt. NetBIOS provides a command's result via a callback.
The NetBIOS ``receive any'' command returns (calls back) when data arrives on any network ``session'' (connection). This allows an application to wait for arriving data on an arbitrary number of sessions, without having to enumerate the set of sessions. It does not appear possible to wait for received data on a subset of the active sessions.
The ``receive any'' command has numerous limitations, some of which are the result of a non-extensible design. The NCB format allows at most 254 sessions, which obviates the need for a highly-scalable implementation. The command does not allow an application to discover when a once-full output buffer becomes writable, nor does it apply to disk files.
In the Win32 programming environment, the use of NetBIOS is strongly discouraged. Win32 includes a procedure named WaitForMultipleObjects(), declared as:
DWORD WaitForMultipleObjects( DWORD cObjects, // number of handles in handle array CONST HANDLE * lphObjects, // address of object-handle array BOOL fWaitAll, // flag: wait for all or for just one DWORD dwTimeout // time-out interval in milliseconds );This procedure takes an array of Win32 objects (which could include I/O handles, threads, processes, mutexes, etc.) and waits for either one or all of them to complete. If the fWaitAll flag is FALSE, then the returned value is the array index of the ready object. It is not possible to learn about multiple objects in one call, unless the application is willing to wait for completion on all of the listed objects.
This procedure, like select(), might not scale very well to large numbers of file handles, for a similar reason: it passes information about all potential event sources every time it is called. (In any case, the object-handle array may contain no more than 64 elements.) Also, since WaitForMultipleObjects must be called repeatedly to obtain multiple events, and the array is searched linearly, a frequent event rate on objects early in the array can starve service for higher-indexed objects.
Windows NT 3.5 added a more advanced mechanism for detecting I/O events, called an I/O completion port (IOCP)[10,21]. This ties together the threads mechanism with the I/O mechanism. An application calls CreateIoCompletionPort() to create an IOCP, and then makes an additional call to CreateIoCompletionPort() to associate each interesting file handle with that IOCP. Each such call also provides an application-specified ``CompletionKey'' value that will be associated with the file handle.
An application thread waits for I/O completion events using the GetQueuedCompletionStatus() call:
BOOL GetQueuedCompletionStatus( HANDLE CompletionPort, LPDWORD lpNumberOfBytesTransferred, LPDWORD CompletionKey, LPOVERLAPPED *lpOverlapped, DWORD dwMillisecondTimeout);Upon return, the CompletionKey variable holds the value associated, via CreateIoCompletionPort(), with the corresponding file handle. Several threads might be blocked in this procedure waiting for completion events on the same IOCP. The kernel delivers the I/O events in FIFO order, but selects among the blocked threads in LIFO order, to reduce context-switching overhead.
The IOCP mechanism seems to have no inherent limits on scaling to large numbers of file descriptors or threads. We know of no experimental results confirming its scalability, however.
Once a handle has been associated with an IOCP, there is no way to disassociate it, except by closing the handle. This somewhat complicates the programmer's task; for example, it is unsafe to use as the CompletionKey the address of a data structure that might be reallocated when a file handle is closed. Instead, the application should use a nonce value, implying another level of indirection to obtain the necessary pointer. And while the application might use several IOCPs to segregate file handles into different priority classes, it cannot move a file handle from one IOCP to another as a way of adjusting its priority.
Some applications, such as the Squid proxy[5,18], temporarily ignore I/O events on an active file descriptor, to avoid servicing data arriving as a lengthy series of small dribbles. This is easily done with the UNIX select() call, by removing that descriptor from the input bitmap; it is not clear if this can be done using an IOCP.
Hu et al. discuss several different NT event dispatching and concurrency models in the context of a Web server, and show how the server's performance varies according to the model chosen. However, they did not measure how performance scales with large numbers of open connections, but limited their measurements to at most 16 concurrent clients.
In summary, the IOCP mechanism in Windows NT is similar to the API we propose for UNIX, and predates our design by several years (although we were initially unaware of it). The differences between the designs may or may not be significant; we look forward to a careful analysis of IOCP performance scaling. Our contribution is not the concept of a pending-event queue, but rather its application to UNIX, and our quantitative analysis of its scalability.