################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org An Analysis of Process and Memory Models to Support High-Speed Networking in a UNIX Environment BJ Murphy Computer Laboratory, University of Cambridge, UK S Zeadally, CJ Adams Department of Computer Science, University of Buckingham, UK Abstract In order to reap the benefits of high-speed networks, the performance of the host operating system must at least match that of the underlying network. A barrier to achieving high throughput is the cost of copying data within current host architectures. We present a performance comparison of three styles of network device driver designed for a conventional monolithic UNIX kernel. Each driver performs a different number of copies. The zero-copy driver works by allowing the memory on the network adapter to be mapped directly into user address space. This maximises performance at the cost of: 1) breaking the semantics of existing network APIs such as BSD sockets and SVR4 TLI; 2) pushing responsibility for network buffer management up from the kernel into the application layer. The single-copy driver works by copying data directly between user space and adapter memory obviating the need for an intermediate copy into kernel buffers in main memory. This approach can be made transparent to existing application code but, like the zero-copy case, relies on an adapter with a generous quantity of on-board memory for buffering network data. The two-copy driver is a conventional STREAMS driver. The two-copy approach sacrifices performance for generality. We observe that the STREAMS overhead for small packets is significant. We report on the benefit of the hardware cache in ameliorating the effect of the second copy, although we note that streaming network data through the cache reduces the level of cache residency seen by the rest of the system. A barrier to achieving low jitter is the non-deterministic nature of many operating system schedulers. We describe the implementation and report on the performance of a kernel streaming driver that allows data to be copied between a network adapter and another I/O device without involving the process scheduler. This provides performance benefits in terms of increased throughput, increased CPU availability and reduced jitter. Introduction The integrated support of distributed continuous media systems poses design issues for the architecture of the network over which such a system is distributed and for the operating system running in the end-user terminals. This paper addresses the subject of operating system support for audio and video, particularly focussing on the reduction of delay and jitter. The UNIX operating system was not designed to provide optimal support for continuous media. Traditionally, fair sharing of resources between multiple users has been the prime concern. However, as UNIX moves from main-frame to desktop, high throughput and deterministic response times become increasingly important. This paper considers a number of techniques that are relevant to the support of high-speed networks in general and continuous media in particular within a conventional, and unmodified, monolithic UNIX kernel. We make two contributions. Firstly, we present the design of a network interface based on a zero-copy memory model, we compare its performance with single-copy and conventional two-copy models, and we discuss compatibility, protection and other implications. Secondly, we present the design of a process model in which data is streamed between device drivers without traversing the user-kernel boundary. Our motivation was to explore the options for integrating continuous media into a traditional operating system. Architectural Background Data Copying The most important factor that limits application-network throughput in current generation workstations is the high cost of copying data relative to the cost of processing that data. This is because improvements in memory performance have not kept pace with improvements in processor performance [11]. Furthermore, improvements in network technology now mean that memory bandwidth is often within the same order of magnitude as network bandwidth. In order to exploit the throughput potential of the network, therefore, the number of copies between application and network adapter must be kept to a minimum. A conventional network subsystem in a monolithic kernel such as UNIX requires two copies of the data in both the transmit and receive directions: one copy is required between application and kernel; an additional copy is required between kernel and network adapter. The network subsystem generally performs some level of protocol processing depending on the requirements of the application. For example, in order to provide a reliable byte-stream service, the network subsystem implements a communications protocol such as TCP which performs flow control and error recovery. The network subsystem manages kernel buffers according to the specification of the protocol in order to handle functions such as packet retransmission and re-assembly. One opportunity to improve performance is to eliminate the copy from kernel to network adapter. This requires that the network adapter be equipped with on-board memory which can be mapped into host address space. In the transmit direction, the network subsystem copies data from application address space directly into adapter memory. Protocol processing may be performed and protocol headers inserted before the adapter is instructed to transmit the packet. In the receive direction, the network adapter assembles an incoming packet in adapter memory. The network subsystem may then perform protocol processing on the packet before copying the data directly into application address space. This is the approach taken by the designers of the Medusa FDDI adapter [1] and the Afterburner [3] TCP/IP accelerator board. A further optimisation is the addition of hardware assist for on-the-fly checksum calculations as the data is copied between kernel and user space. The only changes that are required are to the network subsystem; existing applications reap the benefits of reduced copying without change. Another opportunity to improve performance is to eliminate the copy between application and kernel. A number of virtual memory (VM) techniques have been proposed including page-remapping (in which pages are unmapped from one protection domain and mapped into the other) and copy-on-write (in which pages are shared between protection domains and copying is delayed until a process in one domain attempts to write to a shared page). These techniques are particularly relevant to micro-kernel architectures in which data may traverse multiple protection domains [5]. An alternative technique for eliminating the user-kernel copy is by the use of statically shared memory. All of these techniques normally require some degree of co-operation from the application (for example, using page-aligned buffers or pinning pages into memory) as well as possible changes to the VM and network subsystems. By themselves, each approach (eliminating kernel-adapter copies and eliminating user-kernel copies) provides single-copy transfers between application and network. In combination, and with the appropriate hardware, zero-copy transfers are achievable. Context Switches Another factor that limits application-network throughput is the cost of a context switch. Normally, a hardware interrupt is generated every time a packet arrives from the network [Footnote: An intelligent network adapter may re-assemble lower-level protocol data units (such as ATM cells) in order to reduce the interrupt load on the host.]. On a heavily loaded system, each such interrupt will be accompanied by a context switch from the currently running user process to the user process for which the packet is intended. Context switches between user processes are expensive. The explicit overhead involves the saving of one context (register contents, memory management information, etc) and the restoration of another. However, there is an implicit cache-performance cost that, depending on the host architecture, may dominate other overheads [14]. Apart from these throughput overheads, context switches can have a serious impact on the timely delivery of data. Traditional UNIX systems provide no bounds on context switch latency. This is because a process running in kernel mode cannot be pre-empted. Furthermore, the conventional time-sharing scheduling policy leads to non-deterministic response times. Modern versions of UNIX (such as System V Release 4) attempt to alleviate these problems by providing a pre-emptible kernel together with real-time scheduling classes [10]. Unfortunately, early experiences of such systems have not been encouraging [15]. Applications that involve the transfer of jitter-sensitive continuous media between devices may be better served by a mechanism that completely eliminates the need for context switches. Context switches can be avoided by exploiting kernel-level or hardware-level streaming [4]. Kernel-level streaming involves the transfer of data from source to sink over the system bus without an application process being included in the data path. Since the data passes through memory, data manipulations by the CPU are possible. Hardware-level streaming involves the transfer of data from source to sink over a private data bus or by peer-to-peer DMA over the system bus. No CPU manipulations of the data are possible. These two types of streaming contrast with the more conventional user-level streaming arrangement in which a user process is involved in the data path. Our interest in kernel-level streaming stems from the observation that many continuous media applications involve the transfer of data between a network adapter and another device without requiring the manipulation of that data. For example, a server process may move video between network and disk; a client process may move audio between network and codec. In these cases, there is no requirement to touch the data beyond the presentation-layer conversions performed by the protocol stack. The fact that the data needs to cross the user-kernel boundary (twice) is an artefact of the design of the I/O sub-system. Performing the transfer in the kernel reduces the amount of copying involved and eliminates the context-switch overhead that would otherwise be incurred. This improves throughput, improves CPU availability, and reduces jitter. Implementation Copy Avoidance Three different styles of device driver (two-copy, one-copy and zero-copy) have been implemented for the CHARISMA Asynchronous Transfer Mode (ATM) network adapter. ATM is a switching and multiplexing scheme based on the use of small fixed-size "cells" which is likely to be deployed in many high-speed networks of the future [17]. The CHARISMA ATM adapter has been designed for the EISA bus and features 1 MB of dual-ported RAM which is memory-mapped into host address space. The on-board 25 MHz T801 transputer offers an ATM Adaptation Layer 5 (AAL5) interface to the host processor. AAL5 provides a variable-length packet transfer service with error detection on top of the fixed-length cell relay service provided by ATM [17]. All communication between the two processors is performed by the manipulation of a pair of software FIFOs in shared memory and by hardware interrupts. As noted in [3, 6, 13], the use of a pair of shared memory FIFOs to pass buffer descriptors coupled with the atomicity of load and store instructions allow two processors to exchange data without the need for synchronisation primitives. Except for the small amount of space used by these FIFOs and ancillary control structures, the shared memory is organised as two pools of fixed-sized buffers (4 KB, equal to the VM page size), one pool for transmission and the other for reception. Data is transferred using memory load/store instructions. The CHARISMA host adapter is described in detail in [13]. The two-copy "atm" driver is a STREAMS [19] driver that provides both a DLPI [24] interface and a "raw" interface to the underlying AAL5 service. Within the STREAMS model, device drivers, modules and multiplexors interact by passing messages between uni-directional queues. A STREAM head provides user processes with the point of access to a particular stream (queue-pair). No copying is involved as data is passed between queues since messages are passed by reference. The DLPI interface allows the driver to be linked under the standard IP multiplexor thereby supporting the transmission of IP packets over our ATM network. The raw interface allows a user process to transmit and receive AAL5 packets using the read() and write() system calls. The atm driver provides two-copy transfers between application and network. The single-copy "cpatm" driver is a non-STREAMS character driver that simply copies data between application and network. It provides a read()/write() interface to the underlying AAL5 service. The cpatm driver provides single-copy transfers between application and network but does not permit integration with in-kernel network protocols [Footnote: A STREAMS procedure (esballoc()) enables a device driver to eliminate the copy from a memory-mapped adapter to a kernel buffer by allowing a STREAMS message to be wrapped around a user-supplied buffer. (One of the arguments to esballoc() is a pointer to a function to be called to de-allocate the buffer.) We suggest that the SVR4 STREAMS implementation be modified to provide the STREAM head with the same facility. This would permit the implementation of an integrated single-copy STREAMS-based network subsystem in both the receive and transmit directions. None of our experiments described here made use of the esballoc() procedure.]. The zero-copy "mmatm" driver is a non-STREAMS driver that eliminates copies between a co-operating application and the network. It supplies a mmap() entry point that allows an application process to map buffers (pages) from adapter memory directly into user address space. This allows the data path between application and network to bypass the kernel entirely. The control path is managed by the mmatm driver. Before an application can transmit a packet, it must make an ioctl() request to the driver to obtain a buffer identifier. The buffer identifier specifies the area of adapter memory that has been allocated to the application. The application then writes its data directly into the specified buffer on the adapter. In order to transmit the data, the application makes an ioctl() call to the driver passing the buffer identifier as the argument. In order to receive a packet, the application makes a (potentially blocking) ioctl() call; when a packet arrives, the driver passes back the buffer identifier as a return parameter to this call. The application consumes the data and makes an ioctl() call to release the buffer. Notice that the mmatm driver does not provide read() or write() entry points. Notice also that two system calls (one for buffer management and one for data transfer) are required to transfer a packet in each direction. These issues are discussed later. Context-Switch Avoidance The kernel streaming "kproc" driver is a STREAMS multiplexor under which two device drivers may be linked. The kproc driver can be instructed to copy data from the read queue of one driver to the write queue of the other, and vice-versa (Figure 1). This is exactly the behaviour of a STREAMS-based IP multiplexor that has been configured to act as a router. The presence of the kproc driver is transparent to the devices linked below it; each driver is unaware of whether its queues are connected to a user process or to another driver. The kproc driver allows data to be transferred between a pair of devices using only two copies instead of four (since the two copies across the user-kernel boundary are eliminated) and without any context switches. A key feature of this model is the separation of the control path from the data path [18]. The data path is within the kernel, while an application process may control the flow by means of an out-of-band channel. This is a radical departure from the conventional UNIX I/O model. Although our kproc driver does nothing more than transfer data between devices, it would be possible to perform in-band processing should that be required. For example, flow-specific events (such as video frame boundaries) could be signalled to the application in order to support higher level synchronisation services. Furthermore, additional STREAMS modules could be "pushed" onto the data path to perform higher level protocol processing such as transport-level control and presentation-level format conversions. Experimental Detail In order to evaluate the performance of these models, a simple traffic generator/receiver program was written for the T801 transputer on the network adapter. The program can be instructed (via a shared memory interface) to generate a series of packets of specified size and at a specified rate or to sink a series of packets. In combination with an equivalent program running as a UNIX user process, measurements were taken to determine the performance of each driver. Each program has access to the 1 ms clock provided by the T801; the transputer program copies the value of the clock into shared memory (at a granularity of approximately 4 ms) for the benefit of the UNIX program. The programs timestamp outgoing and incoming packets in local memory. At the end of a test run, the transputer program is instructed to dump its array of timestamps into shared memory. Elapsed times are calculated by simply subtracting corresponding timestamps. These values represent the time taken to transfer a series of packets between a network adapter and a UNIX user-level process. In order to measure the performance of the kernel streaming driver, we equipped the host with two of our network adapters. Each board runs a separate copy of the transputer test program. A UNIX control program first links the two instances of the device driver below the kproc multiplexor. It then configures the transputer program on one board to act as a sink of packets and the other transputer program to act as a source of packets. At the end of the test sequence, the control program instructs each transputer to dump its array of timestamps. Elapsed times are calculated allowing for the offset between the two clocks. These values represent the time taken to transfer a series of packets between a network adapter and another hardware device using kernel-level streaming. Two variations of kernel-level streaming were investigated: synchronous streaming, in which packets are transferred directly by the interrupt handler, and asynchronous streaming, in which packets are only queued by the interrupt handler and transferred some time later by the STREAMS scheduler (with most interrupts enabled). It is possible to configure the style of message passing (synchronous or asynchronous) on a per-stream basis by means of an ioctl() call. The kernel streaming configuration just described was compared with a conventional user-level streaming configuration. A UNIX user-level streaming program was used to read packets from one board and write packets to the other board in the traditional manner using the "raw" read()/write() interface provided by the STREAMS atm driver. The streaming program was run as both a time-sharing and a real-time process. In order to gain a proper understanding of the relative merits of kernel-level streaming versus user-level streaming, we conducted these experiments under a number of different operating conditions. The first set of readings was taken without artificially loading the system. A second set of readings was taken while running a compute-intensive "soak" program comprising a tight loop. The "soak" program ensures that the processor is never idle. It spends no time in the kernel. A third set of readings was taken while running an I/O-intensive "find" program comprising a recursive search of an NFS filing system over Ethernet. The "find" program spends a considerable proportion (over 95%) of its execution time in the kernel performing system calls. Furthermore, the "find" program causes interrupts to be generated by the Ethernet adapter. It spends approximately 85% of the time sleeping (waiting for I/O). The interrupt priority level of the Ethernet driver was the same as that of our ATM driver. The "soak" and "find" processes were run in the default time-sharing scheduling class. All our measurements were made on a 33 MHz Intel 80486 EISA machine running an otherwise unloaded UNIX System V Release 4.2 kernel in the default multi-user state. The machine features 32 MB of RAM, 8 KB of physically-indexed four-way set associative write-through primary cache, and 256 KB of physically-indexed two-way set associative write-back secondary cache. Results Comparison of Device Driver Performance Figures 2 and 3 illustrate the performance of the three types of device driver for different sizes of packet. The data points represent the median of 1000 timing measurements for each size of packet using an inter-packet transmission interval of 10 ms. Relative Cost of Transmission versus Reception The first point to notice is that the cost of reception is approximately twice that of transmission for all three types of driver. This is typical of network subsystems in general. It is explained by the fact that whereas packets are transmitted synchronously with respect to the source they are received asynchronously with respect to the sink. This normally demands that the operating system handle an interrupt, wakeup the waiting process and possibly switch context for every incoming packet. Relative Performance of Zero-Copy versus Single-Copy Drivers The second point to observe is that for small packets (less than approximately 500 bytes) the performance of the single-copy cpatm driver is better than that of the zero-copy mmatm driver. This is because we have included the cost of the system call required to perform buffer management for each packet transfer in our readings for the zero-copy driver. Recall that an mmatm user wishing to transmit a packet must first ask the driver for a free buffer. Likewise, when an mmatm user has finished processing an incoming packet, it must return the buffer to the driver. This additional overhead is included in our analysis in order to permit a fair comparison. Relative Cost of STREAMS versus non-STREAMS Drivers As expected, the STREAMS atm driver provides the worst performance. There are two explanations for this. Firstly, the STREAMS driver performs two copies of the data. Secondly, there is an inherent overhead associated with STREAMS as indicated by the difference between the points of intersection on the ordinate for the STREAMS atm and the non-STREAMS cpatm drivers. That is, for very small packets for which the overhead associated with copying is negligible, the performance of the STREAMS driver is worse (by a factor of approximately 2) than that of the non-STREAMS driver. Part of the reason for this is that every packet on a stream takes the form of a STREAMS message with an associated header that requires allocation and de-allocation by the driver and STREAM head. A further reason is that the movement of a message between driver and application involves processing by the STREAM head and possibly by the STREAMS scheduler in addition to the driver. The inherent STREAMS overhead becomes less significant with increasing packet size as the copying overhead becomes dominant. Effects of the Cache The final point to notice is that the performance of the two-copy driver does not degrade with packet size as quickly as one might anticipate given the extra copy involved relative to the single-copy driver. This is illustrated by the fact that the gradient of the lines relating to the two-copy atm driver is very nearly the same as the gradient of the lines relating to the single-copy cpatm driver. We speculated that this was due to the hardware cache, but decided to investigate further. Figures 4 and 5 show a repeat of the experiments with all caches (primary and secondary) disabled. Under these conditions, the throughput characteristics of the two-copy atm driver demonstrate the cost of the extra (un-cached) copy. Specifically, the gradient of lines relating to the two-copy atm driver is greater (by a factor of approximately 1.7) than the gradient of lines relating to the single-copy cpatm driver. With caches on (Figures 2 and 3), the effect of the copy between main memory and main memory (across the user-kernel boundary) is ameliorated by the fact that the data is already cached. This degree of cache residency is unlikely to be achievable under realistic load conditions in which manipulation of network buffers is interleaved with memory accesses from other processes. Furthermore, the benefit of the cache to a network subsystem that performs multiple copies is at the expense of reduced cache residency seen by other processes in the system. Neither of these drawbacks are revealed by our experiments. The cost of the copy between adapter memory and main memory is almost independent of whether or not caches are enabled. This is illustrated by the fact that the gradient of the lines relating to the single-copy cpatm driver in Figures 2, 3, 4 & 5 is virtually constant. This is because the adapter memory is configured to be non-cacheable to prevent cache consistency problems [20]. The cost of accessing adapter memory over the system bus dominates the cost of accessing main memory regardless of whether or not caches are enabled. Comparison of User-level Streaming versus Kernel-level Streaming Figures 6, 7 and 8 illustrate the performance of the various streaming implementations under three operating conditions. We have adopted the graphical presentation format devised by Faller [9]. The ordinate shows the time taken to transfer a 4072-byte block (the maximum size of an AAL5 service data unit that fits into a 4 KB buffer) between one network adapter and the other. The abscissa shows the percentage cumulative frequency of these transfer times. The value on the ordinate corresponding to P% on the abscissa is called the Pth percentile of the distribution. For example, an 80th percentile of 2 ms means that 80% of the measured transfer times were at or below 2 ms. The value of the 50th percentile is, by definition, the median. A flat graph (ie a horizontal line) represents zero variance. We have used a logarithmic scale on the ordinate in order to dampen the visual distortion that would otherwise be caused by a small number of large readings. We have used an exponential scale on the abscissa in order to highlight the relatively small number of readings of particular interest. Each test run comprised the transfer of 1000 packets with an inter-packet transmission time of 10 ms. The results are plotted at 1% intervals on the abscissa. The slight positive gradient that is observable in the graphs (approximately 130 ms over the 10 s measurement period) is due to drift between the two transputer clocks. Kernel-level Streaming with no Artificial Load Figure 6 shows the performance of the four streaming implementations with no artificial load. Kernel-level streaming provides a throughput increase of approximately 25%. Furthermore, average CPU utilisation (as measured by the public domain top program) is reduced by between 30% and 50% depending on whether the transfer is performed by the STREAMS scheduler or by the interrupt handler respectively. A comparison of the graphs for the two variations of kernel-level streaming demonstrates the overhead imposed by the STREAMS scheduler (approximately 5%). Kernel-level Streaming with Compute-Intensive Load Figure 7 shows the minimal impact of the compute-intensive load on the two user-level streaming models. The only performance penalty is the cost of pre-empting the "soak" process for each packet transfer. This overhead (which is just discernible from a comparison of Figures 6 and 7) is approximately 90 ms. The reason why the "soak" process does not have a more serious impact on the performance of the time-sharing user-level streaming process is that, although both are competing for the processor within the same scheduling class, the scheduler ensures that the priority of the I/O-bound process remains higher. This policy is to help provide better response times for interactive processes compared with compute-bound processes. There is no discernible reduction in the performance of the kernel-level streaming implementations. Kernel-level Streaming with I/O-Intensive Load Figure 8 shows the impact of an I/O-intensive load. The real-time user-level streaming process displays much better jitter characteristics than the time-sharing process. This is because pre-emption points within the kernel allow the real-time streaming process to pre-empt the "find" process while the latter is in the kernel. The synchronous implementation of kernel-level streaming is virtually unaffected by the extra I/O load; this is significant, but not surprising since all the work is being done in the interrupt handler. The asynchronous implementation of kernel-level streaming performs badly for approximately 5% of the data set; we are unable to explain this behaviour. The maximum throughput of the kernel-level streaming model (approximately 18 Mbits/s), which involves two copies of the data, is bounded by adapter memory to main memory bandwidth (approximately 40 Mbits/s). It is reasonable to speculate that higher rates would be achievable using faster memory. Discussion Copy Avoidance The elimination of all copies (using the shared memory technique described here), whilst attractive for performance reasons, raises a number of issues that we address below. Hardware Constraints Not all network adapters offer a memory-mapped interface without which copy elimination is impossible. Many peripherals reside on an I/O bus and make use of DMA or programmed I/O for data transfer; the choice between the two is often dictated by the host bus architecture [1]. We suggest that the potential performance benefits of zero-copy transfers need to be taken into account when making the choice between the memory bus and the I/O bus during the adapter design stage and when evaluating the merits of competing host bus architectures. An additional problem is that on-board memory is a limited resource. Buffering protocol data (in order to provide single-copy transfers) or even application data (in order to provide zero-copy transfers) in adapter memory rather than main memory may not be realistic for high volume flows. Nevertheless, where memory-mapped designs are applicable, while the discrepancy between CPU performance and memory performance remains, and while the cost of memory continues to fall, we believe that there is a strong case for furnishing the network adapter with generous quantities of RAM. Application Program Interface Current network APIs, such as BSD sockets and SVR4 TLI, allow applications to allocate arbitrarily aligned buffers located anywhere in process address space and which are contiguous in virtual (but not necessarily physical) address space. The cost of this amenity is generally an extra copy by the network subsystem thereby limiting the scope for copy avoidance. We endorse a model in which network buffers are allocated by, and at the convenience of, the network subsystem rather than the application [4]. Responsibility for buffer de-allocation is shared between the network subsystem and the application. The transmission paradigm becomes one in which the application requests a buffer from the network subsystem, writes to the buffer, and then requests transmission. The reception paradigm becomes one in which the application requests reception of data (optionally specifying a limit on size), is handed a buffer by the network subsystem, consumes the data and de-allocates the buffer. A network buffer should be represented as an abstract data type rather than a single contiguous buffer [12]. This leaves the network subsystem free to implement such types in an optimal manner. The advantage of such an API is its generality. Where hardware permits, it allows zero-copy transfers. Where hardware or other constraints rule out copy elimination, it facilitates single-copy transfers without having to "bend" existing APIs (by mandating page-aligned buffers, for example). We suggest that user-level protocols [7, 23] would benefit from a new API: the semantics of existing APIs are likely to force a copy even when the protocol is implemented as a user-level library and no protection domains are being crossed. The disadvantage of such an API is that it is incompatible with current network APIs and therefore cannot be of benefit to the great many existing applications that are based on these APIs. Nevertheless, we believe that there is a new class of applications that could benefit from the performance improvements made possible by an API that is better integrated with the network subsystem. Clearly, a variety of APIs can exist side by side: continuous media applications would benefit from the new API and backward compatibility could be provided by traditional APIs at the cost of reduced performance. Protection A ramification of this new API is that responsibility for the de-allocation of buffers used by the network subsystem is pushed up from the kernel into the application layer. Furthermore, there are protection implications when such an API is underpinned by a shared memory implementation such as that described here. In order to circumvent these problems, buffers need to be managed as pools each of which is owned by a separate process. The size of each pool could be based on the throughput requirements of the application as specified at call setup time or on the dynamics of the flow subject to a system-specified per-application limit. The device driver would permit the application to map in only the shared memory pages belonging to its pool. Thereafter, the VM subsystem ensures that an application cannot interfere with data belonging to another process. A malicious or badly programmed application that does not free its buffers only exhausts its private pool and does not compromise the operation of the entire system. Operating System Integration The zero-copy driver discussed here does not add value to the network service provided by the network adapter. Higher-level protocols could be implemented in user-space based on this design. If it is necessary to implement such protocols in the kernel, however, changes need to be made to the operating system to enable the network subsystem to handle the abstract buffer type outlined above. Early Demultiplexing A key requirement in this design is that an early demultiplexing decision can be made on incoming data [4]. In ATM networks, the VCI is ideal for this purpose provided that multiple higher-level protocols are not multiplexed over the same channel [22]. Performance For Small Packets A problem with our zero-copy implementation is that, for small packets, the buffer management overhead outweighs the benefits of copy elimination. This is due to the cost of making a system call. A solution is to employ a shared memory interface between application and driver for the purpose of buffer management much like the shared memory interface between driver and adapter for the purpose of data transfer. Other researchers have noted this problem and propose a similar solution [21]. Cache Performance A feature of the zero-copy model presented here is that network data is not brought into the cache unless and until it is explicitly copied by the processor. This hurts protocol stack implementations that perform multiple touches of the data since all accesses to network buffers go over the bus. However, there are a number of benefits. Firstly, the level of cache residency seen by the rest of the system increases if network data does not enter the cache. Secondly, incoming network data is only brought into the cache if and when the application consumes the data (ie as late as possible). This maximises cache residency by eliminating the potential for context switches between the data being brought into the cache (by the network subsystem) and the data being consumed by the application [16]. Note that the performance penalty incurred by making non-cacheable accesses to adapter memory is reduced with protocols that touch only part of a packet (eg the header) rather than the entire packet. Such protocols generally sacrifice error detection (by eliminating the checksum, for example), but many "raw" multimedia flows are resilient to some degree of data corruption. Furthermore, implementation strategies based on integrated layer processing [2] ensure that the cost of accessing adapter memory is kept to a minimum. Context-Switch Avoidance Kernel-level streaming has the potential for delivering increased throughput, increased CPU availability, and reduced jitter. In order to realise these benefits in a generic fashion, however, the entire I/O sub-system needs restructuring. For example, in order to be able to stream data directly from disk to the network, the filing system needs to offer an appropriate interface. Nevertheless, we believe that this technique is applicable to a large class of multimedia I/O devices. An added feature of this model is that careful implementations can avoid streaming continuous media through the cache. For example, if an adapter is equipped with on-board memory, data could be transferred from that device by copying pointers to the data (in the form of STREAMS messages wrapped around adapter buffers, for example) rather than the data itself. Alternatively, data could be transferred between adapters using DMA via main memory and, provided that protocol processing does not require touching the entire packet, the impact on the cache is minimised. The concepts of kernel-level and hardware-level streaming are not new [4, 8, 18]. We believe that kernel-level streaming is a good compromise between between hardware-level streaming and user-level streaming. Hardware-level streaming does not involve the processor: all manipulation of the data, including protocol processing, must be performed by intelligent peripherals. User-level streaming is the most flexible but incurs the highest costs in terms of performance and resource usage. We observe that, unlike other implementations [8], the STREAMS I/O sub-system supports kernel-level streaming without requiring modifications to the operating system. We have not addressed the issue of streaming multiple flows between devices within the kernel. Under such circumstances, there will be "cross-talk" between the flows. The FIFO scheduling discipline provided by our prototype implementation does not serve the needs of multiple flows with conflicting "quality of service" requirements. In passing, we note that the STREAMS scheduler supports multiple priority levels which could be exploited in a more sophisticated implementation to minimise cross-talk. Nevertheless, we believe that a better solution would be an operating system that provided a single integrated scheduling mechanism for all CPU activity including that related to inter-device flows, conventional application-terminated flows, and compute-bound tasks. Conclusions We have reported on the performance characteristics of a number of process and memory models to support high-speed networks within a monolithic UNIX kernel. We have presented the design of a zero-copy device driver for an ATM network adapter. The design uses shared memory which is mapped directly into user address space. We have demonstrated the performance benefits and discussed the implications of such a design. We suggest that existing APIs, such as BSD sockets and SVR4 TLI, are a barrier to achieving high network throughput. We advocate a new style of API in which buffers are allocated by the network subsystem rather than by the application. We suggest that such an API improves the scope for copy avoidance. We have demonstrated that streaming data between two devices in the kernel provides performance benefits in terms of increased throughput, increased CPU availability and reduced jitter. We are not suggesting that kernel-level streaming is the ideal mechanism to manage the resources of a multimedia workstation. Nor do we casually make the suggestion that application functionality be embedded in the kernel. Rather, we view kernel-level streaming as a pragmatic approach to improving support for continuous media within the constraints of a conventional operating system. Acknowledgements We gratefully acknowledge the contribution of our colleagues at the Rutherford Appleton Laboratory, UK, who designed and built the ATM network adapter on which our experiments were based. We would also like to thank the various reviewers who commented on earlier versions of this paper. The work was conducted at the University of Buckingham, UK, as part of the CHARISMA project and funded by the European RACE II research programme (project number R2071). References [1] D Banks and M Prudence, "A High-Performance Network Architecture for a PA-RISC Workstation", IEEE Journal On Selected Areas in Communications, Vol 11, No 2, February 1993. [2] DD Clark and DL Tennenhouse, "Architectural Considerations for a New Generation of Protocols", Proc ACM SIGCOMM '90, Philadelphia, USA, September 1990. [3] C Dalton, G Watson, D Banks, C Calamvokis, A Edwards and J Lumley, "Afterburner", IEEE Network, Vol 7, No 4, pp 36-43, July 1993. [4] P Druschel, M Abbot, M Pagels, L Peterson, "Network Subsystem Design", IEEE Network, Vol 7, No 4, pp 8-17, July 1993. [5] P Druschel and L Peterson, "Fbufs: A High-Bandwidth Cross-Domain Transfer Facility", Proc 14th Symposium on Operating System Principles, 1993. [6] P Druschel, L Peterson, B Davie, "Experience with a High-Speed Network Adaptor: A Software Perspective", Proc ACM SIGCOMM '94, London, England, September 1994. [7] A Edwards, G Watson, J Lumley, D Banks, C Calamvokis, C Dalton, "User-space Protocols Give High Performance to Applications on a Low-cost Gb/s LAN", Proc ACM SIGCOMM '94, London, England, September 1994. [8] K Fall and J Pasquale, "Exploiting In-Kernel Data Paths to Improve I/O Throughput and CPU Availability", Proc USENIX Winter Technical Conference, San Diego, CA, USA, January 1993. [9] Newton Faller, "Measuring the Latency Time of Real-Time Unix-like Operating Systems", ftp://icsi.berkeley.edu/pub/techreports/1992/tr-92-037.ps.Z, 1992. [10] Berny Goodheart and James Cox, The Magic Garden Explained: The Internals of UNIX System V Release 4, Prentice Hall, ISBN 013-098138-9, 1994 [11] JL Hennessy and DA Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers Inc, ISBN 1-55860-188-0, 1990. [12] NC Hutchinson and LL Peterson, "The x-Kernel: An Architecture for Implementing Network Protocols", IEEE Transactions on Software Engineering, Vol 17, No 1, January 1991. [13] BJ Murphy, CJ Adams, S Zeadally, "The CHARISMA ATM Host Interface", Proc 3rd International Conference on Broadband Islands, Hamburg, Germany, June 1994. Also available electronically as ftp://ftp.cl.cam.ac.uk/users/bjm22/bris94.ps.gz. [14] JC Mogul and A Borg, "The Effect of Context Switches on Cache Performance", Fourth International Conference on Architectural Support for Programming Languagues and Operating Systems, April 1991. [15] J Nieh, JG Hanko, J Duane Northcutt and GA Wall, "SVR4 UNIX Scheduler Unacceptable for Multimedia Applications", Proc 4th International Workshop on Network and Operating System Support for Digital Audio and Video, Lancaster, UK, 1993. [16] MA Pagels, P Druschel and LL Peterson, "Cache and TLB Effectiveness in the Processing of Network Data", Technical Report TR 93-4, Dept of Computer Science, University of Arizona, Tucson, USA, 1993. [17] C Partridge, Gigabit Networking, Addison-Wesley Professional Computing Series, ISBN 0-201-56333-9, 1993. [18] J Pasquale, "I/O System Design for Intensive Multimedia I/O", Proc IEEE Workshop on Workstation Operating Systems, Key Biscayne, FL, USA, April 1992. [19] Stephen A Rago, UNIX System V Network Programming, Addison-Wesley Professional Computing Series, ISBN 0-201-56318-5, 1993. [20] Curt Schimmel, UNIX Systems for Modern Architectures, Addison-Wesley Professional Computing Series, ISBN 0-201-63338-8, 1994. [21] JM Smith and CBS Traw, "Giving Applications Access to Gb/s Networking", IEEE Network, Vol 7, No 4, pp 44-52, July 1993. [22] DL Tennenhouse, "Layered Multiplexing Considered Harmful", Proc IFIP Workshop on Protocols for High-Speed Networks, Zurich, Switzerland, May 1989. [23] C Thekkath, T Nguyen, E Moy, E Lazowska, "Implementing Network Protocols at User Level", Proc ACM SIGCOMM '93, San Francisco, USA, September 1993. [24] UNIX International OSI Special Interest Group, Data Link Provider Interface Specification, UNIX International, Waterview Corporate Center, 20 Waterview Boulevard, Parsippany, NJ 07054, USA, Revision 2.0.0, August 1991. Author Information Brendan Murphy is a post-doctoral Research Associate at the University of Cambridge currently working on the integration of continuous media "streams" into a CORBA environment. His other research interests include network protocol design and operating system support for high-speed networks. Sherali Zeadally is finishing his PhD in Computer Science at the University of Buckingham. His research interests include operating system support for multimedia, I/O sub-system design, especially to support high-speed networks, and distributed systems. Chris Adams is Professor of Computer Science at the University of Buckingham and a member of the Advanced Communications Unit at the Rutherford Appleton Laboratory. His research interests range from global network strategies to hacking processor microcode. They include satellite communications, multimedia applications, multiservice networks and operating systems. The authors can be reached via electronic mail to charisma@buck.ac.uk. An electronic version of this paper is available as ftp://ftp.cl.cam.ac.uk/users/bjm22/usenix96.ps.gz. Further information on the CHARISMA project can be found via https://www.brookes.ac.uk/cms/research/dsproj.html#CHARISMA.