OSDI '06 Paper

OSDI '06 Paper

Pp. 293–306 of the Proceedings

Connection Handoff Policies for TCP Offload Network Interfaces

Hyong-youb Kim and Scott Rixner
Rice University
Houston, TX 77005
{hykim, rixner}@rice.edu

Abstract

This paper presents three policies for effectively utilizing TCP offload network interfaces that support connection handoff. These policies allow connection handoff to reduce the computation and memory bandwidth requirements for packet processing on the host processor without causing the resource constraints on the network interface to limit overall system performance. First, prioritizing packet processing on the network interface ensures that its TCP processing does not harm performance of the connections on the host operating system. Second, dynamically adapting the number of connections on the network interface to the current load avoids overloading the network interface. Third, the operating system can predict connection lifetimes to select long-lived connections for handoff to better utilize the network interface. The use of the first two policies improves web server throughput by 12-31% over the baseline throughput achieved without offload. The third policy helps improve performance when the network interface can only handle a small number of connections at a time. Furthermore, by using a faster offload processor, offloading can improve server throughput by 33-72%.

1 Introduction

In order to address the increasing bandwidth demands of modern networked computer systems, there has been significant interest in offloading TCP processing from the host operating system to the network interface card (NIC). TCP offloading can potentially reduce the number of host processor cycles spent on networking tasks, reduce the amount of local I/O interconnect traffic, and improve overall network throughput [14,[18,[27,[30,[31]. However, the maximum packet rate and the maximum number of connections supported by an offloading NIC are likely to be less than those of a modern microprocessor due to resource constraints on the network interface [24,[28,[30]. Since the overall capabilities of an offloading NIC may lag behind modern microprocessors, the system should not delegate all TCP processing to the NIC. However, a custom processor with fast local memory on a TCP offloading NIC should still be able to process packets efficiently, so the operating system should treat such a NIC as an acceleration coprocessor and use as much resources on the NIC as possible in order to speed up a portion of network processing.

Connection handoff has been proposed as a mechanism to allow the operating system to selectively offload a subset of the established connections to the NIC [18,[22,[23]. Once a connection is handed off to the NIC, the operating system switches the protocol for that connection from TCP to a stateless bypass protocol that simply forwards application requests to send or receive data to the NIC. The NIC then performs all of the TCP processing for that connection. If necessary, the operating system can also reclaim the connection from the NIC after it has been handed off. Thus, the operating system retains complete control over the networking subsystem and can control the division of work between the NIC and host processor(s). At any time, the operating system can easily opt to reduce the number of connections on the NIC or not to use offload at all. Furthermore, with handoff, the NIC does not need to make routing decisions or allocate port numbers because established connections already have correct routes and ports.

While previous proposals have presented interfaces and implementations of TCP offload NICs that support connection handoff, they have not presented policies to utilize these NICs effectively. This paper shows that there are three main issues in systems that utilize connection handoff and evaluates policies to address these issues. First, the NIC must ensure that TCP processing on the NIC does not degrade the performance of connections that are being handled by the host. Second, the operating system must not overload the NIC since that would create a bottleneck in the system. Finally, when the NIC is only able to store a limited number of connections, the operating system needs to hand off long-lived connections with high packet rates in order to better utilize the NIC.

Full-system simulations of four web workloads show that TCP processing on the NIC can degrade the performance of connections being handled by the host by slowing down packet deliveries to the host processor. The amount of time for a packet to cross the NIC increases from under 10 microseconds without handoff to over 1 millisecond with handoff. For three of the four workloads, the resulting request rate of the server is 14-30% lower than the baseline request rate achieved without handoff. The NIC can minimize delays by giving priority to those packets that must be delivered to the host. The use of this host first packet processing on the NIC increases the server's request rate over the baseline by up to 24%, when the NIC is not overloaded. However, the NIC still becomes overloaded when too many connections are handed off to the NIC, which reduces the request rate of the server by up to 44% below the baseline performance. The NIC can avoid overload conditions by dynamically adapting the number of connections to the current load indicated by the length of the receive packet queue. By using both techniques, handoff improves the request rate of the server by up to 31% over the baseline throughput. When the NIC can support a large number of connections, handing off connections in a simple first-come, first-served order is sufficient to realize these performance improvements. However, when the NIC has limited memory for storing connection state, handing off long-lived connections helps improve performance over a simple first-come, first-served handoff policy. Finally, using a NIC with a faster offload processor, handoff improves server throughput by 33-72%.

The rest of the paper is organized as follows. Section 2 briefly describes connection handoff. Section 3 presents techniques to control the division of work between the NIC and the operating system. Section 4 describes the experimental setup, and Section 5 presents results. Section 6 discusses related work. Section 7 draws conclusions.

2 Connection Handoff

TCP packet processing time is dominated by expensive main memory accesses, not computation [10,[17]. These main memory accesses occur when the network stack accesses packet data and connection data structures, which are rarely found in the processor caches. Accesses to packet data are due to data touching operations like data copies and checksum calculations. These accesses can be eliminated by using zero-copy I/O techniques to avoid data copies between the user and kernel memory spaces [9,[12,[26] and checksum offload techniques to avoid computing TCP checksums on the host processor [19]. These techniques have gained wide acceptance in modern operating systems and network interfaces. However, no such techniques exist to eliminate accesses to connection data structures. While these structures are small, around 1KB per connection, a large number of connections can easily overwhelm modern processor caches, significantly degrading performance. Previous experiments show that main memory accesses to connection data structures can degrade performance as much as data touching operations [17].

TCP offload can be used to reduce the impact of expensive main memory accesses to connection data structures. By moving TCP processing to the NIC, connection data structures can be efficiently stored in fast, dedicated memories on the NIC. Typically, all TCP processing is moved to the NIC. However, such full TCP offload is not scalable. First, the resource limitations of a peripheral device will limit the maximum processing capability and memory capacity of a TCP offload NIC. Second, TCP offload complicates the existing software architecture of the network stack, since the operating system and the NIC now need to cooperatively manage global resources like port numbers and IP routes [24]. Connection handoff solves these problems by enabling the host operating system to select a subset of the established connections and move them to the network interface [18,[22,[23]. Using handoff, the operating system remains in control of global resources and can utilize TCP offload NICs to accelerate as many TCP connections as the resources on the NIC will allow.

Figure 1: Modified network stack architecture to support connection handoff. The shaded region indicates the path used by connections on the NIC.

Using connection handoff, all connections are established within the network stack of the host operating system. Then, if the operating system chooses to do so, the connection can be handed off to the NIC. Once a connection is handed off to the NIC, the NIC handles all TCP packets for that connection. Figure 1 shows a diagram of a network stack that supports connection handoff. The left portion of the diagram depicts the regular (non-offload) stack, and the dotted lines show data movement. When a connection is offloaded to the NIC, the operating system switches its stack to the right (shaded) portion of the diagram. This new stack executes a simple bypass protocol on the host processor for offloaded connections, and the rest of the TCP/IP stack is executed directly on the NIC. The bypass protocol forwards socket operations between the socket layers on the host and the NIC via the device driver, as shown by the dashed lines in the figure. For an offloaded connection, packets are generated by the NIC and also completely consumed by the NIC, so packet headers are never transferred across the local I/O interconnect. The solid lines show the packet movement within the NIC. The lookup layer on the NIC determines whether an incoming packet belongs to a connection on the NIC or to a connection on the host. It only adds a small amount of overhead to all incoming packets by using a hash-based lookup table [21].

The bypass layer communicates with the device driver using a connection handoff API. So, the operating system can transparently support multiple, heterogeneous NICs using the same API. Furthermore, the API also allows the actual socket buffer data to be stored in main memory in order to reduce the amount of buffering required on the NIC.

Previous experiments show that TCP offload, whether it is based on connection handoff or not, can reduce cycles, instructions, and cache misses on the host CPU as well as traffic across the local I/O interconnect [14,[18]. Since socket-level operations occur less frequently than Ethernet packet transmits and receives, handoff can reduce the number of message exchanges across the local I/O interconnect. For instance, because acknowledgment packets (ACKs) are processed by the NIC, the NIC may aggregate multiple ACKs into one message so that the host operating system can drop the acknowledged data in a single socket operation.

3 Connection Handoff Framework

Figure 2: Framework for connection handoff and dynamically controlling the load on the NIC.

In addition to the features described in the previous section, both the operating system and the network interface must also implement policies to ensure that the performance of connections that are being handled by the host operating system is not degraded, that the network interface does not become overloaded, and that the appropriate connections are handed off to the network interface. Figure 2 illustrates the proposed framework for the implementation of such packet prioritization, load control, and connection selection policies. These policies are integral to the effective utilization of connection handoff network interfaces. The proposed framework and policies will be discussed in detail in the following sections-all supporting data was collected while running a simulated web server using the methodology that will be described in Section 4.

3.1 Priority-based Packet Processing

Traditional NICs act as a bridge between the network and main memory. There is very little processing required to send or receive packets from main memory (host packets), so the network interface can process each packet fairly quickly. However, with connection handoff, the NIC must perform TCP processing for packets that belong to connections that have been handed off to the network interface (NIC packets). Because NIC packets require significantly more processing than host packets, NIC packet processing can delay the processing of host packets, which may reduce the throughput of connections that remain within the host operating system. In the worst case, the NIC can become overloaded with TCP processing and drop host packets, which further reduces the throughput of host connections.

Offloaded Connections	Packet Priority	Host			NIC	Requests/s
		Packet Delay (usec)		Idle (%)	Idle (%)
		Send	Receive	Idle (%)	Idle (%)
0	no handoff	2	2	0	62	23031
256	FCFS	3	3	0	49	23935
1024	FCFS	679	301	62	3	16121
1024	Host first	10	6	0	5	26663

Table 1: Impact of a heavily loaded NIC on the networking performance of a simulated web server. Packet delays are median values, not averages. Idle represents the fraction of total cycles that are idle.

Table 1 illustrates the effect of connection handoff on host packet delays and overall networking performance for a web workload that uses 2048 simultaneous connections. The first row of the table shows the performance of the system when no connections are handed off to the NIC. In this case, the web server is able to satisfy 23031 requests per second with a median packet processing time on the NIC of only 2 us for host packets. The second row shows that when 256 of the 2048 connections (12.5%) are handed off to the NIC, the request rate increases by 4% with only slight increases in host packet processing time. However, as shown in the third row, when 1024 connections are handed off to the NIC, the NIC is nearly saturated and becomes the bottleneck in the system. The median packet delay on the NIC for host packets increases dramatically and the NIC drops received packets as the receive queue fills up. As a result, the server's request rate drops by 33%.

For both the second and third rows of Table 1, the NIC processes all packets on a first-come, first-served (FCFS) basis. As the load on the NIC increases, host packets suffer increasing delays. Since an offloading NIC is doing additional work for NIC packets, increased host packet delays are inevitable. However, such delays need to be minimized in order to maintain the performance of host connections. Since host packets require very little processing on the NIC, they can be given a priority over NIC packets without significantly reducing the performance of connections that have been handed off to the NIC. The priority queues, shown in Figure 2, enable such a prioritization. As discussed in Section 2, all received packets must first go through the lookup layer, which determines whether a packet belongs to a connection on the NIC. Once the lookup task completes, the NIC now forms two queues. One queue includes only host packets, and the other queue stores only NIC packets. The NIC also maintains a queue of host packets to be sent and another queue of handoff command messages from the host. In order to give priority to host packets, the NIC always processes the queue of received host packets and the queue of host packets to be sent before NIC packets and handoff command messages.

The fourth row of Table 1 shows the impact of using the priority queues to implement a host first packet processing policy on the NIC. With the host first policy on the NIC, the median packet delay of host packets is about 6-10 us even though 1024 connections are handed off to the NIC. Handoff now results in a 16% improvement in request rate. Thus, this simple prioritization scheme can be an effective mechanism to ensure that TCP processing on the NIC does not hurt the performance of connections that are handled by the host operating system. Further evaluation is presented in Section 5.

3.2 Load Control

Packet prioritization can ensure that host packets are handled promptly, even when the NIC is heavily loaded. However, this will not prevent the NIC from becoming overloaded to the point where there are not enough processing resources remaining to process NIC packets. In such an overloaded condition, the network interface becomes a system bottleneck and degrades the performance of connections that have been handed off to the network interface. A weighted sum of the packet rate of NIC packets and the packet rate of host packets is an approximate measure of the load on the network interface. The number of connections that have been handed off to the NIC indirectly determines this load. In general, increasing the number of connections on the NIC increases the load on the NIC because they tend to increase overall packet rates. Likewise, decreasing the number of connections generally reduces the load on the NIC.

Due to the finite amount of memory on the NIC, there is a hard limit on the total number of connections that can be handed off to the NIC. However, depending on the workload and the available processing resources on the NIC, the NIC may become saturated well before the number of connections reaches the hard limit. Therefore, the network interface must dynamically control the number of connections that can be handed off based on the current load on the network interface.

As discussed in Section 3.1, the NIC maintains a queue of received NIC packets. As the load on the NIC increases, the NIC cannot service the receive queue as promptly. Therefore, the number of packets in the receive queue (queue length) is a good indicator of the current load on the NIC. This holds for send-dominated, receive-dominated, and balanced workloads. For send-dominated workloads, the receive queue mainly stores acknowledgment packets. A large number of ACKs on the receive queue indicate that the load is too high because the host operating system is sending data much faster than the NIC can process ACKs returning from a remote machine. For receive-dominated workloads, the receive queue mainly stores data packets from remote machines. A large number of data packets on the receive queue indicates that data packets are being received much faster than the NIC can process and acknowledge them. In balanced workloads, a combination of the above factors will apply. Therefore, a large number of packets in the receive queue indicates that the NIC is not processing received packets in a timely manner which will increase packet delays for all connections.

The NIC uses six parameters to control the load on the network interface: Hard_limit, Soft_limit, Qlen, Hiwat, Lowat, and Cnum. Hard_limit is the maximum possible number of connections that can be handed off to the network interface and is determined based on the amount of physical memory available on the network interface. Hard_limit is set when the network interface firmware is initialized and remains fixed. Soft_limit is the current maximum number of connections that may be handed off to the NIC. This parameter is initially set to Hard_limit, and is always less than or equal to Hard_limit. Qlen is the number of NIC packets currently on the receive packet queue. Hiwat is the high watermark for the receive packet queue. When Qlen exceeds Hiwat, the network interface is overloaded and must begin to reduce its load. A high watermark is needed because once the receive packet queue becomes full, packets will start to be dropped, so the load must be reduced before that point. Similarly, Lowat is the low watermark for the receive packet queue, indicating that the network interface is underloaded and should allow more connections to be handed off, if they are available. As with Hard_limit, Hiwat and Lowat are constants that are set upon initialization based upon the processing capabilities and memory capacity of the network interface. For example, a faster NIC with larger memory can absorb bursty traffic better than a slower NIC with smaller memory, so it should increase these values. Currently, Hiwat and Lowat need to be set empirically. However, since the values only depend on the hardware capabilities of the network interface, only the network interface manufacturer would need to tune the values, not the operating system. Finally, Cnum is the number of currently active connections on the NIC.

Figure 3: State machine used by the NIC firmware to dynamically control the number of connections on the NIC.

Figure 3 shows the state machine employed by the NIC in order to dynamically adjust the number of connections. The objective of the state machine is to maintain Qlen between Lowat and Hiwat. When Qlen grows above Hiwat, the NIC is assumed to be overloaded and should attempt to reduce the load by reducing Soft_limit. When Qlen drops below Lowat, the NIC is assumed to be underloaded and should attempt to increase the load by increasing Soft_limit.

The state machine starts in the monitor state. When Qlen becomes greater than Hiwat, the NIC reduces Soft_limit, sends a message to the device driver to advertise the new value, and transitions to the decrease state. While in the decrease state, the NIC waits for connections to terminate. Once Cnum drops below Soft_limit, the state machine transitions back to the monitor state to assess the current load. If Qlen decreases below Lowat, then the NIC increases Soft_limit, sends a message to the device driver to notify it of the new Soft_limit, and transitions to the increase state. In the increase state, the NIC waits for new connections to arrive. If Cnum increases to Soft_limit, then the NIC transitions to the monitor state. However, while in the increase state, if Qlen increases above Hiwat, then the NIC reduces Soft_limit, sends a message to the device driver to alert it of the new value, and transitions to the decrease state. The state machine is simple and runs only when a packet arrives, the host hands off a new connection to the NIC, or an existing connection terminates. Thus, the run-time overhead of the state machine is insignificant.

As mentioned above, the NIC passively waits for connections to terminate while in the decrease state. Instead, the NIC may also actively restore the connections back to the host operating system and recover from an overload condition more quickly. The described framework easily supports such active restoration to the operating system. However, for this to be effective, the NIC would also need a mechanism to determine which connections are generating the most load, so should be restored first. Actively restoring connections in this manner was not necessary for the workloads studied in this paper, but it may help improve performance for other types of workloads.

While the receive packet queue length is easy to exploit, there are other measures of load such as idle time and packet rate. The network interface could calculate either of these metrics directly and use them to control the load. However, packet processing time is extremely dependent of the workload. Therefore, metrics such as the packet rate are difficult to use, as they are not directly related to the load on the NIC. This makes it more desirable to control the load based on resource use, such as the length of the receive queue or the idle time, than based on packet rate. The receive queue length was chosen over idle time because it requires no NIC resources to compute.

3.3 Connection Selection

Whenever the network interface can handle additional connections, the operating system attempts to hand off established connections. The connection selection policy component of the framework depicted in Figure 2 decides whether the operating system should attempt to hand off a given connection. As described previously, the device driver then performs the actual handoff. The operating system may attempt handoff at any time after a connection is established. For instance, it may hand off a connection right after it is established, or when a packet is sent or received. If the handoff attempt fails, the operating system can try to handoff the connection again in the future. For simplicity, the current framework invokes the selection policy upon either connection establishments or send requests by the application and does not consider connections for handoff if the first handoff attempt for that connection fails.

The simplest connection selection policy is first-come, first-served. If all connections in the system have similar packet rates and lifetimes, then this is a reasonable choice, as all connections will benefit equally from offload. However, if connections in the system exhibit widely varying packet rates and lifetimes, then it is advantageous to consider the expected benefit of offloading a particular connection. These properties are highly dependent on the application, so a single selection policy may not perform well for all applications. Since applications typically use specific ports, the operating system should be able to employ multiple application-specific (per-port) connection selection policies.

Furthermore, the characteristics of the NIC can influence the types of connections that should be offloaded. Some offload processors may only be able to handle a small number of connections, but very quickly. For such offload processors, it is advantageous to hand off connections with high packet rates in order to fully utilize the processor. Other offload processors may have larger memory capacities, allowing them to handle a larger number of connections, but not as quickly. For these processors, it is more important to hand off as many connections as possible.

The expected benefit of handing off a connection is the packet processing savings over the lifetime of the connection minus the cost of the handoff. Here, the lifetime of a connection refers to the total number of packets sent and/or received through the connection. Therefore, it is clear that offloading a long-lived connection is more beneficial than a short-lived connection. The long-lived connection would accumulate enough per-packet savings to compensate for the handoff cost and also produce greater total saving than the short-lived connection during its lifetime.

Figure 4: Distribution of connection lifetimes from SPECweb99 and the IBM and World Cup traces. Connection rank is based on the number of sent packets.

In order for the operating system to compute the expected benefit of handing off a connection, it must be able to predict the connection's lifetime. Fortunately, certain workloads, such as web requests, show characteristic connection lifetime distributions, which can be used to predict a connection's lifetime. Figure 4 shows the distribution of connection lifetimes from several web workloads. The figure plots the cumulative fraction of sent packets and sent bytes of all connections over the length of the run. As shown in the figure, there are many short-lived connections, but the number of packets due to these connections account for a small fraction of total packets and bytes. For example, half of the connections are responsible for sending less than 10% of all packets for all three workloads. The other half of the connections send the remaining 90% of the packets. In fact, more than 45% of the total traffic is handled by less than 10% of the connections. The data shown in Figure 4 assumes that persistent connections are used. A persistent connection allows the client to reuse the connection for multiple requests. Persistent connections increase the average lifetime, but not the shape of distribution of lifetimes. Previous studies have shown that web workloads exhibit this kind of distribution [2,[7,[8]. The operating system may exploit such distribution in order to identify and hand off long-lived connections. For instance, since the number of packets transfered over a long-lived connection far exceeds that of a short connection, the system can use a threshold to differentiate long and short-lived connections. The operating system can simply keep track of the number of packets sent over a connection and hand it off to the NIC only when the number reaches a certain threshold.

4 Experimental Setup

The authors have previously implemented connection handoff within FreeBSD 4.7 with the architecture described in Section 2 [18]. To evaluate the policies described in Section 3, this prototype is augmented with these policies using the existing framework. Since there are no offload controllers with open specifications, at least to the authors' best knowledge, an extended version of the full-system simulator Simics [20] is used for performance evaluations. Simics models the system hardware with enough detail that it can run complete and unmodified operating systems.

4.1 Simulation Setup

Simics is a functional full system simulator that allows the use of external modules to enforce timing. For the experiments, Simics has been extended with a memory system timing module and a network interface card module. The processor core is configured to execute one x86 instruction per cycle unless there are memory stalls. The memory system timing module includes a cycle accurate cache, memory controller, and DRAM simulator. All resource contention, latencies, and bandwidths within the memory controller and DRAM are accurately modeled [29]. Table 2 summarizes the simulator configuration.

The network interface simulator models a MIPS processor, 32 MB of memory, and several hardware components: PCI and Ethernet interfaces and a timer. The PCI and Ethernet interfaces provide direct memory access (DMA) and medium access control (MAC) capabilities, respectively. These are similar to those found on the Tigon programmable Gigabit Ethernet controller from Alteon [1]. Additionally, checksums are computed in hardware on the network interface. The firmware of the NIC uses these checksum values to support checksum offload for host packets and to avoid computing the checksums of NIC packets in software. The NIC does not employ any other hardware acceleration features such as hardware connection lookup tables [15]. The processor on the NIC runs the firmware and executes one instruction per cycle at a rate of 400, 600, or 800 million instructions per second (MIPS). The instruction rate is varied to evaluate the impact of NIC performance. Modern embedded processors are capable of such instruction rates with low power consumption [11]. At 400MIPS, the NIC can achieve 1Gb/s of TCP throughput for one offloaded connection and another 1Gb/s for a host connection simultaneously, using maximum-sized 1518B Ethernet frames. The maximum number of connections that can be stored on the NIC is also varied in order to evaluate the impact of the amount of memory dedicated for storing connections. The network wire is set to run at 10Gb/s in order to eliminate the possibility of the wire being the bottleneck. The local I/O interconnect is not modeled due to its complexity. However, DMA transfers still correctly invalidate processor cache lines, as others have shown the importance of invalidations due to DMA [5].

The testbed consists of a server and a client machine, directly connected through a full-duplex 10Gb/s wire. Both are simulated using Simics. The server uses the configuration shown in Table 2, while the client is completely functional, so will never be a performance bottleneck.

	Configuration
CPU	Functional, single-issue, 2GHz x86 processor Instantaneous instruction fetch
L1 cache	64KB data cache Line size: 64B, associativity: 2-way Hit latency: 1 cycle
L2 cache	1MB data cache Line size: 64B, associativity: 16-way Hit latency: 15 cycles Prefetch: next-line on a miss
DRAM	DDR333 SDRAM of size 2GB Access latency: 195 cycles
NIC	Functional, single-issue processor Varied instruction rates for experiments Varied maximum number of connections 10Gb/s wire

Table 2: Simulator configuration.

4.2 Web Workloads

The experiments use SPECweb99 and two real web traces to drive the Flash web server [25]. SPECweb99 emulates multiple simultaneous clients. Each client issues requests for both static content (70%) and dynamic content (30%) and tries to maintain its bandwidth between 320Kb/s and 400Kb/s. The request sizes are statistically generated using a Zipf-like distribution in which a small number of files receive most of the requests. For static content, Flash sends HTTP response data through zero-copy I/O (the sendfile system call). All other types of data including HTTP headers and dynamically generated responses are copied between the user and kernel memory spaces.

The two web traces are from an IBM web site and the web site for the 1998 Soccer World Cup. A simple trace replayer program reads requests contained in the traces and sends them to the web server [4]. Like SPECweb99, the client program emulates multiple simultaneous clients. Unlike SPECweb99, it generates requests for static content only and sends new requests as fast as the server can handle. Both the replayer and SPECweb99 use persistent connections by default. The replayer uses a persistent connection for the requests from the same client that arrive within a fifteen second period in the given trace. SPECweb99 statistically chooses to use persistent connections for a fraction of all requests. To compare SPECweb99 against the two traces, the experiments also evaluate SPECweb99 that uses a non-default configuration where all requests are for static content.

For all experiments, the first 400000 packets are used to warm up the simulators, and measurements are taken during the next 600000 packets. Many recent studies based on simulations use purely functional simulators during the warmup phase to reduce simulation time. However, one recent study shows that such method can produce misleading results for TCP workloads and that the measurement phase needs to be long enough to cover several round trip times [16]. In this paper, the warmup phase simulates timing, and 600000 packets lead to at least one second of simulated time for the experiments presented in the paper.

5 Experimental Results

5.1 Priority-based Packet Processing and Load Control

Configuration shorthands have the form
Workload-NIC Connections-Packet Priority-Load Control-Selection Policy-NIC MIPS.

Workload	Web server workload W: World Cup, 2048 clients I: IBM, 2048 clients D: SPECweb99, 2048 clients S: SPECweb99-static, 4096 clients
NIC Connections	Maximum number of connections on the NIC 0 means handoff is not used.
Packet Priority	host first priority-based packet processing on the NIC P: Used N: Not used
Load Control	Load control mechanism on the NIC L: Used N: Not used
Selection Policy	Connection selection policy used by the operating system FCFS: First-come, first-served Tn: Threshold with value n
NIC MIPS	Instruction rate of the NIC in million instructions per second

Table 3: Configuration shorthands used in Section 5

Figure 5: Impact of host first packet processing and load control on the simulated web server.

Figure 5 shows the execution profiles of the simulated web server using various configurations. The Y axis shows abbreviated system configurations (see Table 3 for an explanation of the abbreviations). The first graph shows the fraction of host processor cycles spent in the user application, the operating system, and idle loop. The second graph shows the amount of idle time on the NIC. The third and fourth graphs show connection and packet rates. These graphs also show the fraction of connections that are handed off to the NIC, and the fraction of packets that are consumed and generated by the NIC while processing the connections on the NIC. The last two graphs show server throughput in requests/s, and HTTP content in megabits/s. HTTP content throughput only includes HTTP response bytes that are received by the client. Requests/s shows the request completion rates seen by the client.

W-0-N-N-FCFS-400 in Figure 5 shows the baseline performance of the simulated web server for the World Cup trace. No connections are handed off to the NIC. The host processor has zero idle time, and 57% of host processor cycles (not shown in the figure) are spent executing the network stack below the system call layer. Since the NIC has 62% idle time, handoff should be able to improve server performance. However, simply handing off many connection to the NIC can create a bottleneck at the NIC, as illustrated by W-1024-N-N-FCFS-400.

In W-1024-N-N-FCFS-400, the NIC can handle a maximum of 1024 connections at a time. At first, 2048 connections are established, and 1024 of them are handed off to the NIC. As the NIC becomes nearly saturated with TCP processing (only 3% idle time), it takes too long to deliver host packets to the operating system. On average, it now takes more than 1 millisecond for a host packet to cross the NIC. Without handoff, it takes less than 10 microseconds. The 62% idle time on the host processor also shows that host packets are delivered too slowly. So, the connections on the NIC progress and terminate much faster than the connections on the host. When the client establishes new connections, they are most likely to replace terminated connections on the NIC, not the host. Consequently, the NIC processes a far greater share of new connections than the host. Overall, 88% of all connections during the experiment are handed off to the NIC. Note that at any given time, roughly half the active connections are being handled by the NIC and the other half are being handled by the host. Since the NIC becomes a bottleneck in the system and severely degrades the performance of connections handled by the host, the request rate drops by 30%. This configuration clearly shows that naive offloading can degrade system performance. In W-1024-P-N-FCFS-400, the NIC still has a maximum of 1024 connections but employs host first packet processing to minimize delays to host packets. The mean time for a host packet to cross the NIC drops to less than 13 microseconds even though the NIC is still busy with TCP processing (only 5% idle time). The fraction of connections handed off to the NIC is now 48%, close to one half, as expected. The host processor shows no idle time, and server throughput continues to improve.

In W-4096-P-N-FCFS-400, the NIC can handle a maximum of 4096 connections at a time. 100% of connections are handed off to the NIC since there are only 2048 concurrent connections in the system. The NIC is fully saturated and again becomes a bottleneck in the system. Processing each packet takes much longer, and there are also dropped packets. As a result, the host processor shows 64% idle time, and the request rate drops by 52% from 26663/s to 12917/s. Thus, giving priority to host packets cannot prevent the NIC from becoming the bottleneck in the system. Note that host first packet processing still does it job, and host packets (mainly packets involved in new connection establishment) take only several microseconds to cross the NIC.

Figure 6: Dynamic adjustment of the number of connections on the NIC by the load control mechanism for configuration W-4096-P-L-FCFS-400.

In W-4096-P-L-FCFS-400, the NIC can handle a maximum of 4096 connections at a time, just like W-4096-P-L-FCFS-400, but uses the load control mechanism discussed in Section 3.2. Figure 6 shows how the NIC dynamically adjusts the number of connections during the experiment. Initially 2048 connections are handed off to the NIC, but received packets start piling up on the receive packet queue. As time progresses, the NIC reduces connections in order to keep the length of the receive packet queue under the threshold 1024. The number of connections on the NIC stabilizes around 1000 connections. The resulting server throughput is very close to that of W-1024-P-N-FCFS-400 in which the NIC is manually set to handle up to 1024 concurrent connections. Thus, the load control mechanism is able to adjust the number of connections on the NIC in order to avoid overload conditions. The NIC now has 9% idle time, slightly greater than 5% shown in W-1024-P-N-FCFS-400, which indicates that the watermark values used in the load control mechanism are not optimal. Overall, handoff improves server throughput by 12% in packet rate, 12% in request rate, and 10% in HTTP content throughput (compare W-0-N-N-FCFS-400 and W-4096-P-L-FCFS-400). The server profiles during the execution of the IBM trace also show that both host first packet processing and the load control on the NIC are necessary, and that by using both techniques, handoff improves server throughput for the IBM trace by 19% in packet rate, 23% in request rate, and 18% in HTTP content throughput (compare I-0-N-N-FCFS-400 and I-4096-P-L-FCFS-400).

Unlike the trace replayer, SPECweb99 tries to maintain a fixed throughput for each client. Figure 5 also shows server performance for SPECweb99 Static and SPECweb99. The static version is same as SPECweb99 except that the client generates only static content requests, so it is used to compare against the results produced by the trace replayer. S-0-N-N-FCFS-400 shows the baseline performance for SPECweb99 Static. Since each client of SPECweb99 is throttled to a maximum of 400Kb/s, 4096 connections (twice the number used for the trace replayer) are used to saturate the server. Like W-0-N-N-FCFS-400, the host processor has no idle cycles and spends more than 70% of cycles in the kernel, and the NIC has 69% idle time. When 2048 connections are handed off, the request rate actually drops slightly. As in W-1024-N-N-FCFS-400, host packets are delivered to the operating system too slowly, and the host processor shows 50% idle time. The use of host first packet processing on the NIC overcomes this problem, and server throughput continues to increase. Increasing the number of connections further will simply overload the NIC as there is only 8% idle time. S-4096-P-L-FCFS-400 uses both host first packet processing and the load control mechanism on the NIC. Although the NIC can store all 4096 connections, the load control mechanism reduces the number of connections to around 2000 in order to avoid overload conditions. Overall, by using host first packet processing and the load control mechanism on the NIC, handoff improves the request rate for SPECweb99 Static by 31%. These techniques help improve server performance for regular SPECweb99 as well. Handoff improves the request rate by 28%.

5.2 Connection Selection Policy

As mentioned in Section 3.3, the system may use a threshold to differentiate long-lived connections that transfers many packets from short-lived ones. Handing off long-lived connections has the potential to improve server performance when the NIC has limited memory for a small number of connections. For instance, offload processors may use a small on-chip memory to store connections for fast access. In this case, it is necessary to be selective and hand off connections that transfer many packets in order to utilize the available compute power on the NIC as much as possible. On the other hand, when the NIC can handle a much larger number of connections, it is more important to hand off as many connections as possible, and a threshold-based selection policy has either negligible impact on server throughput or degrades it because fewer packets are processed by the NIC.

Figure 7: Impact of first-come, first-served and threshold-based connection selections on the simulated web server.

Figure 7 compares FCFS and threshold-based connection selection policies when the maximum number of connections on the NIC is much smaller than the value used in the previous section. For threshold-based policies, denoted by Tn, the trailing number indicates the minimum number of enqueue operations to the send socket buffer of a connection that must occur before the operating system attempts to hand off the connection. The number of enqueue operations is proportional to the number of sent packets. For instance, using T4, the operating system attempts a handoff when the fourth enqueue operation to the connection's send socket buffer occurs. As shown in the figure, the use of threshold enables the operating system to hand off longer connections than FCFS, but the resulting throughput improvements are small. For instance, W-256-P-L-FCFS-400 shows a case in which the NIC can handle up to 256 connections and the operating system hands off connections on a FCFS basis. 13% of connections and 12% of packets are processed by the NIC, as expected. The NIC shows 47% idle time. When a threshold policy is used (W-256-P-L-T20-400), the NIC now processes 24% of packets, and the request rate improves by 6%. However, the NIC still has 34% idle time. The lifetime distribution shown in Figure 4 suggests that if the operating system were able to pick longest 10% of connections, the NIC would process over 60% of packets. Thus, with a more accurate selection policy, the NIC would be able to process a greater fraction of packets and improve system performance further.

5.3 NIC Speed

Figure 8: Impact of the instruction rate of the NIC on the simulated web server.

The results so far have shown that the NIC must employ host first packet processing and dynamically control the number of connections. As the instruction rate of the NIC increases, the NIC processes packets more quickly. The load control mechanism on the NIC should be able to increase the number of connections handed off to the NIC. Figure 8 shows the impact of increasing the instruction rate of the NIC. W-4096-P-L-FCFS-400 in the figure is same as the one in Figure 5 and is used as the baseline case. As the instruction rate increases from 400 to 600 and 800MIPS, the fraction of connections handed off to the NIC increases from 45% to 70% and 85%. Accordingly, the request rate of the server increases from 25830/s to 29398/s and 36532/s (14% and 41% increases). For the IBM trace, increasing the instruction rate from 400 to 600MIPS results in a 21% increase in request rate. At 600MIPS, nearly all connections (95%) are handed off to the NIC. So, the faster 800MIPS NIC improves the request rate by only 3%.

Faster NICs improve server throughput for SPECweb99 Static as well. As the instruction rate increases from 400 to 600MIPS, the request rate improves by 16%. The 800MIPS NIC further improves the request rate by 13%. Faster NICs do not benefit SPECweb99 because the 400MIPS NIC already achieves more than the specified throughput. With 2048 connections, SPECweb99 aims to achieve a maximum HTTP throughput of about 819Mb/s = 2048 × 400Kb/s. In reality, throughput can become greater than the specified rate as it is difficult to maintain throughput strictly under the specified rate. With the 400MIPS NIC, HTTP content throughput is near 1Gb/s. So, faster NICs simply have greater idle time.

These results show that the system can transparently exploit increased processing power on the NIC by using the load control mechanism and host first packet processing on the NIC. Thus, hardware developers can improve NIC capabilities without worrying about software changes as the firmware will adapt the number of connections and be able to use the increased processing power.

Finally, HTTP response times, measured as the amount of time elapsed between when a request is sent and when the full response is received, follow server request rates, as expected. For instance, the mean response time for the World Cup trace is 61ms without offload (W-0-N-N-FCFS-400). It increases to 99ms when 1024 connections are offloaded without host first packet processing or load control (W-1024-N-N-FCFS-400). The use of both host first packet processing and load control drops the mean response time to 60ms (W-1024-P-L-FCFS-400). Increasing the instruction rate of the NIC from 400 to 600 and 800MIPS further reduces the mean response time to 53ms and 40ms, respectively. Mean response times for other workloads follow trends similar to that of the World Cup trace, except that mean response times for SPECweb99 are larger than those for the World Cup and IBM traces because of throttling and dynamic content generation.

6 Related Work

There are a number of previous studies on full TCP offload to both network interfaces and dedicated processors in the system. TCP servers was an early TCP offload design [27]. TCP servers, based on the Split-OS concept [3], splits TCP and the rest of the operating system and lets a dedicated processor or a dedicated system execute TCP. Brecht et al. expand this concept by providing an asynchronous I/O interface to communicate with the dedicated TCP processing resources [6]. Intel has dubbed such approaches, which dedicate a general-purpose processor to TCP processing, TCP onloading [28]. Regardless of the name, these approaches are effectively full TCP offload, as TCP and the rest of the system's processing are partitioned into two components.

Freimuth et al. recently showed that full offload reduces traffic on the local I/O interconnect [14]. They used two machines for evaluations, one acting as the NIC and the other as the host CPU. A central insight is that with offload, the NIC and the operating system communicate at a higher level than the conventional network interface, which gives opportunities for optimizations. Westrelin et al. also evaluated the impact of TCP offload [31]. They used a multiprocessor system in which one processor is dedicated to executing TCP, like TCP onloading, and show a significant improvement in microbenchmark performance. Finally, an analytical study on performance benefits of TCP offload shows that offload can be beneficial but its benefits can vary widely depending on application and hardware characteristics [30].

However, while these studies have shown the benefits of TCP offload, they have not addressed the problems that have been associated with full TCP offload. These problems include creating a potential bottleneck at the NIC, difficulties in designing software interfaces between the operating system and the NIC, modifying the existing network stack implementations, and introducing a new source of software bugs (at the NIC) [24].

Connection handoff, which addresses some of these concerns, has been previously proposed and implemented. Microsoft has proposed to implement a device driver API for TCP offload NICs based on connection handoff in the next generation Windows operating system, as part of the Chimney Offload architecture [22]. Mogul et al. argued that exposing transport (connection) states to the application creates opportunities for enhanced application features and performance optimizations, including moving connection states between the operating system and offload NICs [23]. The authors have implemented both the operating system and network interface components of connection handoff, with the architecture described in Section 2 [18]. The policies presented in this paper apply to all of these previous proposals and implementations and will improve their efficiency and performance, and prevent the network interface from becoming a performance bottleneck.

A recent study shows that a commercial offloading NIC can achieve over 7Gb/s and substantially improve web server throughput [13]. This is an encouraging result since it shows that a specialized offload processor can handle high packet rates.

7 Conclusion

Offloading TCP processing to the NIC can improve system throughput by reducing computation and memory bandwidth requirements on the host processor. However, the NIC inevitably has limited resources and can become a bottleneck in the system. Offload based on connection handoff enables the operating system to control the number of connections processed by the host processor and the NIC, thereby controlling the division of work between them. Thus, the system should be able to treat the NIC as an acceleration coprocessor by handing off as many connections as the resources on the NIC will allow.

A system that implements connection handoff can employ the policies presented in this paper in order to fully utilize the offload NIC without creating a bottleneck in the system. First, the NIC gives priority to those packets that belong to the connections processed by the host processor. This ensures that packets are delivered to the operating system in timely manner and that TCP processing on the NIC does not degrade the performance of host connections. Second, the NIC dynamically controls the number of connections that can be handed off. This avoids overloading the NIC, which would create a performance bottleneck in the system. Third, the operating system can differentiate connections and hand off only long-lived connections to the NIC in order to better utilize offloading NICs that lack memory capacity for a large number of connections. Full-system simulations of web workloads show that without any of the policies handoff reduces the server request rate by up to 44%. In contrast, connection handoff augmented with these polices successfully improves server request rate by 12-31%. When a faster offload processor is used, the system transparently exploits the increased processing capacity of the NIC, and connection handoff achieves request rates that are 33-72% higher than a system without handoff.

Acknowledgments

The authors thank Alan L. Cox for his interest and comments on the paper. The authors also thank Robbert van Renesse for shepherding and the reviewers for their valuable comments. This work is supported in part by a donation from Advanced Micro Devices and by the National Science Foundation under Grant Nos. CCR-0209174 and CCF-0546140.

References

[1]: Alteon Networks. Tigon/PCI Ethernet Controller, Aug. 1997. Revision 1.04.
[2]: M. F. Arlitt and C. L. Williamson. Internet Web Servers: Workload Characterization and Performance Implications. IEEE/ACM Transactions on Networking, 5(5):631-645, Oct. 1997.
[3]: K. Banerjee, A. Bohra, S. Gopalakrishnan, M. Rangarajan, and L. Iftode. Split-OS: An Operating System Architecture for Clusters of Intelligent Devices. Work-in-Progress Session at the 18th Symposium on Operating Systems Principles, Oct. 2001.
[4]: G. Banga and P. Druschel. Measuring the Capacity of a Web Server. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Dec. 1997.
[5]: N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network-Oriented Full-System Simulation using M5. In Proceedings of the Sixth Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), pages 36-43, Feb. 2003.
[6]: T. Brecht, G. J. Janakiraman, B. Lynn, V. Saletore, and Y. Turner. Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous I/O. In Proceedings of EuroSys 2006, pages 265-278, Apr. 2006.
[7]: L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Schenker. Web Caching and Zipf-like Distributions: Evidence and Implications. In Proceedings of IEEE INFOCOM '99, volume 1, pages 126-134, Mar. 1999.
[8]: P. Cao and S. Irani. Cost-Aware WWW Proxy Caching Algorithms. In Proceedings of the 1997 USENIX Symposium on Internet Technology and Systems, pages 193-206, Dec. 1997.
[9]: H. K. J. Chu. Zero-Copy TCP in Solaris. In Proceedings of the 1996 Annual USENIX Technical Conference, pages 253-264, 1996.
[10]: D. D. Clark, V. Jacobson, J. Romkey, and H. Salwen. An Analysis of TCP Processing Overhead. IEEE Communications Magazine, pages 23-29, June 1989.
[11]: L. T. Clark, E. J. Hoffman, J. Miller, M. Biyani, Y. Liao, S. Strazdus, M. Morrow, K. E. Velarde, and M. A. Yarch. An Embedded 32-b Microprocessor Core for Low-Power and High-Performance Applications. IEEE Journal of Solid-State Circuits, 36(11):1599-1608, Nov. 2001.
[12]: P. Druschel and L. L. Peterson. Fbufs: A High-Bandwidth Cross-Domain Transfer Facility. In Proceedings of the 14th Symposium on Operating Systems Principles (SOSP-14), pages 189-202, Dec. 1993.
[13]: W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and D. K. Panda. Performance Characterization of a 10-Gigabit Ethernet TOE. In Proceedings of the 13th IEEE Symposium on High-Performance Interconnects, 2005.
[14]: D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum, P. Pradhan, and J. Tracey. Server Network Scalability and TCP Offload. In Proceedings of the 2005 Annual USENIX Technical Conference, pages 209-222, Apr. 2005.
[15]: Y. Hoskote, B. A. Bloechel, G. E. Dermer, V. Erraguntla, D. Finan, J. Howard, D. Klowden, S. Narendra, G. Ruhl, J. W. Tschanz, S. Vangal, V. Veeramachaneni, H. Wilson, J. Xu, and N. Borkar. A TCP Offload Accelerator for 10 Gb/s Ethernet in 90-nm CMOS. IEEE Journal of Solid-State Circuits, 38(11):1866-1875, Nov. 2003.
[16]: L. R. Hsu, A. G. Saidi, N. L. Binkert, and S. K. Reinhardt. Sampling and Stability in TCP/IP Workloads. In Proceedings of the First Annual Workshop on Modeling, Benchmarking, and Simulation (MoBS), pages 68-77, 2005.
[17]: H. Kim and S. Rixner. Performance Characterization of the FreeBSD Network Stack. Computer Science Department, Rice University, June 2005. Technical Report TR05-450.
[18]: H. Kim and S. Rixner. TCP Offload through Connection Handoff. In Proceedings of EuroSys 2006, pages 279-290, Apr. 2006.
[19]: K. Kleinpaste, P. Steenkiste, and B. Zill. Software Support for Outboard Buffering and Checksumming. In Proceedings of the ACM SIGCOMM '95 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pages 87-98, Aug. 1995.
[20]: P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hállberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. Computer, 35(2):50-58, 2002.
[21]: P. E. McKenney and K. F. Dove. Efficient Demultiplexing of Incoming TCP Packets. In Proceedings of the ACM SIGCOMM '92 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pages 269-279, 1992.
[22]: Microsoft Corporation. Scalable Networking: Network Protocol Offload - Introducing TCP Chimney, Apr. 2004. WinHEC Version.
[23]: J. Mogul, L. Brakmo, D. E. Lowell, D. Subhraveti, and J. Moore. Unveiling the Transport. ACM SIGCOMM Computer Communication Review, 34(1):99-106, 2004.
[24]: J. C. Mogul. TCP offload is a dumb idea whose time has come. In Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, pages 25-30, 2003.
[25]: V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, pages 199-212, June 1999.
[26]: V. S. Pai, P. Druschel, and W. Zwaenepoel. I/O-Lite: A Unified I/O Buffering and Caching System. In Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation, pages 15-28, Feb. 1999.
[27]: M. Rangarajan, A. Bohra, K. Banerjee, E. V. Carrera, R. Bianchini, L. Iftode, and W. Zwaenepoel. TCP Servers: Offloading TCP/IP Processing in Internet Servers. Design, Implementation, and Performance. Computer Science Department, Rutgers University, Mar. 2002. Technical Report DCR-TR-481.
[28]: G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. TCP Onloading for Data Center Servers. Computer, 37(11):48-58, Nov. 2004.
[29]: S. Rixner. Memory Controller Optimizations for Web Servers. In Proceedings of the 37th Annual International Symposium on Microarchitecture, pages 355-366, Dec. 2004.
[30]: P. Shivam and J. S. Chase. On the Elusive Benefits of Protocol Offload. In Proceedings of the ACM SIGCOMM Workshop on Network-I/O Convergence, pages 179-184, 2003.
[31]: R. Westrelin, N. Fugier, E. Nordmark, K. Kunze, and E. Lemoine. Studying Network Protocol Offload With Emulation: Approach And Preliminary Results. In Proceedings of the 12th Annual IEEE Symposium on High Performance Interconnects, pages 84-90, Aug. 2004.

File translated from T_EX by T_TH, version 3.74.
On 31 Aug 2006, 11:53.

Last changed: 9 Oct. 2006 ch