USENIX '05 Paper
[USENIX '05 Technical Program]
Server Network Scalability and TCP Offload
Server Network Scalability and TCP Offload
Doug Freimuth, Elbert Hu, Jason LaVoie, Ronald Mraz,
Erich Nahum, Prashant Pradhan, John Tracey
IBM T. J. Watson Research Center
Hawthorne, NY, 10532
Server network performance is increasingly dominated by poorly scaling
operations such as I/O bus crossings, cache misses and interrupts.
Their overhead prevents performance from scaling even with increased
CPU, link or I/O bus bandwidths. These operations can be reduced by
redesigning the host/adapter interface to exploit additional
processing on the adapter. Offloading processing to the adapter is
beneficial not only because it allows more cycles to be applied but also
of the changes it enables in the host/adapter interface. As opposed to
other approaches such as RDMA, TCP offload provides benefits without
requiring changes to either the transport protocol or API.
We have designed a new host/adapter interface that exploits offloaded
processing to reduce poorly scaling operations. We have implemented a
prototype of the design including both host and adapter software
components. Experimental evaluation with simple network benchmarks
indicates our design significantly reduces I/O bus crossings and holds
promise to reduce other poorly scaling operations as well.
Server network throughput is not scaling with CPU speeds. Various
studies have reported CPU scaling factors of 43%
, 60% , and 33%
to 68%  which fall short of an ideal
scaling of 100%. In this paper, we show that even increasing CPU
speeds and link and bus bandwidths does not generate a commensurate
increase in server network throughput. This lack of scalability points
to an increasing tendency for server network throughput to become the
key bottleneck limiting system performance. It motivates the need for
an alternative design with better scalability.
Server network scalability is limited by operations heavily used in
current designs that themselves do not scale well, most notably
bus crossings, cache misses and interrupts. Any significant
improvement in scalability must reduce these operations. Given that the
problem is one of scalability and not simply performance, it will not
be solved by faster processors. Faster processors merely expend more
cycles on poorly scaling operations.
Research in server network performance over the years has yielded
significant improvements including: integrated checksum and copy,
checksum offload, copy avoidance, interrupt coalescing, fast path
protocol processing, efficient state lookup, efficient timer management and
segmentation offload, a.k.a. large send. Another technique, full TCP
offload, has been pursued for many years. Work on offload has
generated both promising and less than compelling results
Good performance data and analysis on offload is scarce.
Many improvements in server scalability were described more than
fifteen years ago by Clark et al. . The authors
demonstrated that the overhead incurred by network protocol processing,
per se, is small compared to both per-byte (memory access) costs and
operating system overhead, such as buffer and timer management. This
motivated work to reduce or eliminate data touching operations, such as
copies, and to improve the efficiency of operating system services heavily
used by the network stack. Later work  showed that
overhead of non-data touching operations is, in fact, significant for real
workloads, which tend to feature a preponderance of small messages.
Today, per-byte overhead has been greatly reduced through checksum
offload and zero-copy send. This leaves per-packet overhead,
operating system services and zero-copy receive as the main remaining
areas for further improvement.
Nearly all of the enhancements described by Clark et al. have seen
widespread adoption. The one notable exception is "an efficient
network interface." This is a network adapter with a fast
general-purpose processor that provides a much more efficient
interface to the network than the current frame-based interface
devised decades ago. In this paper, we describe an effort to develop a much
more efficient network interface and to make this enhancement a
reality as well.
Our work is pursued in the context of TCP for three reasons:
1) TCP's enormous installed base,
2) the methodology employed with TCP will transfer to other protocols, and
3) the expectation that key new architectural features, such as zero copy
receive, will ultimately demonstrate their viability with TCP.
The work described here is part of a larger effort to improve server
network scalability. We began by analyzing server network performance
and recognizing, as others have, a significant scalability problem.
Next, we identified specific operations to be the cause, specifically:
bus crossings, cache misses, and interrupts. We formulated a design
that reduces the impact of these operations. This design exploits
additional processing at the network adapter, i.e. offload, to improve
the efficiency of the host/adapter interface which is our primary
focus. We have implemented a prototype of the new design which
consists of host and adapter software components and have analyzed the
impact of the new design on bus crossings. Our findings indicate that
offload can substantially decrease bus crossings and holds promise to
reduce other scalability limiting operations such as cache misses.
Ultimately, we intend to evaluate the design in a cycle-accurate hardware
simulator. This will allow us to comprehensively quantify the impact of
design alternatives on cache misses, interrupts and overall performance
over several generations of hardware.
This paper is organized as follows.
Section 2 provides motivation and background.
Section 3 presents our design,
and the current prototype implementation is described in
Section 5 presents our experimental infrastructure and
Section 6 surveys and contrasts related work, and
Section 7 summarizes
our contributions and plans for future work.
2 Motivation and Background
To provide the proper motivation and background for our work,
we first describe the current best practices of techniques
and optimizations for network server performance.
Using industry standard benchmarks we then show that,
despite these practices, servers are still not
scaling with CPU speeds via several benchmarks.
Since TCP offload has been a controversial topic in the research
community, we review the critiques of offload, providing counterarguments
to each point. How TCP offload addresses these scaling
issues is described in more detail in Section 3.
2.1 Current Best Practices
Current high-performance servers have adopted many techniques to
maximize performance. We provide a brief overview of them here.
Sendfile with zero copy. Most operating systems have a sendfile
or transmitfile operation that allows sending a file over a socket
without copying the contents of the file into user space. This can
have substantial performance benefits .
However, the benefits are limited to send-side processing; it does not
affect receive-side processing. In addition, it requires the server
application to maintain its data in the kernel, which may not be
feasible for systems such as application servers, which generate
Checksum offload. Researchers have shown that calculating
the IP checksum over the body of the data can be expensive
. Most high-performance adapters have the
ability to perform the IP checksum over both the contents of the data
and the TCP/IP headers. This removes an expensive data-touching
operation on both send and receive. However, adapter-level checksums
will not catch errors introduced by transferring data over the I/O
bus, which has led some to advocate caution with checksum offload
Researchers have shown
that interrupts are costly, and generating an interrupt for each
packet arrival can severely throttle a system .
In response, adapter vendors have enabled the ability to delay interrupts
by a certain amount of time or number of packets in an effort to batch
packets per interrupt and amortize the costs .
While effective, it can be difficult to determine the proper trigger
thresholds for firing interrupts, and large amounts of batching may
cause unacceptable latency for an individual connection.
Large send/segmentation offload.
TCP/IP implementers have long known that larger MTU sizes provide
greater efficiency, both in terms of network utilization (fewer headers
per byte transferred) and in terms of host CPU utilization (fewer
per-packet operations incurred per byte sent or received).
Unfortunately, larger MTU sizes are not usually available due to
Ethernet's 1516 byte frame size. Gigabit Ethernet provides "jumbo
frames" of 9 KB, but these are only useful in specialized local
environments and cannot be preserved across the wide-area Internet.
As an approximation, certain operating systems, such as AIX and Linux,
provide large send or TCP segmentation offload (TSO) where the
TCP/IP stack interacts with the network device as if it had a large MTU size.
The device in turn segments the larger buffers into 1516-byte Ethernet
frames and adjusts the TCP sequence numbers and checksums accordingly.
However, this technique is also limited to send-side processing.
In addition, as we demonstrate in Section 2.2,
the technique is limited by the way TCP performs congestion control.
Efficient connection management. Early networked servers
did not handle large numbers of TCP connections efficiently,
for example by using a linear linked-list to manage state
. This led to operating systems using
hash table based approaches 
and separating table entries in the TIME_WAIT state .
Asynchronous interfaces. To maximize concurrency,
high-performance servers use asynchronous interfaces as not
to block on long-latency operations .
Server applications interact using an event notification interface
such as select() or poll(), which in turn can have
performance implications .
Unfortunately, these interfaces are typically only for network
I/O and not file I/O, so they are not as general as they
In-kernel implementations. Context switches, data
copies, and system calls can be avoided altogether by implementing
the server completely in kernel space [17,18].
While this provides the best performance, in-kernel implementations
are difficult to implement and maintain, and the approach is hard to
generalize across multiple applications.
RDMA. Others have also noticed these scaling problems,
particularly with respect to data copying, and have offered RDMA as
a solution. Interest in RDMA and Infiniband  is
growing in the local-area case, such as in storage networks or
cluster-based supercomputing. However, RDMA requires
modifications to both sides of a conversation, whereas Offload can be
deployed incrementally on the server side only. Our interest is in
supporting existing applications in an inter-operable way, which
precludes using RDMA.
While effective, these optimizations are limited in that they do not
address the full range of scenarios seen by a server. The main
restrictions are: 1) that they do not apply to the receive side, 2)
they are not fully asynchronous in the way they interact with the
operating system, 3) they do not minimize the interaction with the
network interface, or 4) they are not inter-operable. Additionally,
many techniques do not address what we believe to be the fundamental
performance issue, which is overall server scalability.
2.2 Server Scalability
The recent arrival of 10 gigabit Ethernet and the promise of 40 and
100 gigabit Ethernet in the near future show that raw network bandwidth
is scaling at least as quickly as CPU speed. However, it is well-known
that memory speeds are not scaling as quickly as CPU speed increases
. As a consequence of this and other factors,
researchers have observed that the performance of host TCP/IP
implementations is not scaling at the same rate as CPU speeds in spite
of raw network bandwidth increases.
Table 1: Properties for Multiple Generations of Machines
To quantify how performance scales over time, we ran a number of experiments
using several generations of machines, described in detail in Table
1. We break the machines into 2 classes: desk-side
workstations and rack-mounted servers with aggressive memory systems
and I/O busses. The workstations include a a 500 MHz Intel Pentium 3,
a 933 MHz Intel Pentium 3, and a a 1.7 GHz Pentium 4. The servers
include a 450 MHz Pentium II-Xeon, a 1.6 GHz P4 Xeon, and a 3.2 GHz P4
Xeon. In addition, each of the P4-Xeon servers have 1 MB L3 caches.
Each machine runs Linux 2.6.9 and has a number of Intel E1000 MT
server gigabit Ethernet adapters, connected via a Dell gigabit switch.
Load is generated by five 3.2 GHz P4-Xeons acting as clients, each using an
E1000 client gigabit adapter and running Linux 2.6.5. We chose
the E1000 MT adapters for the servers since these have been shown
to be one of the highest-performing conventional adapters on the market
, and we did not have access to a 10 gigabit
|Machine ||BIOS ||Clock ||Cycle ||Bus ||Bus ||L1 ||L2 ||E1000 |
|Release ||Speed ||Time ||Width ||Speed ||Size ||Size ||NICs |
|Date ||(MHz) ||(ns) ||(bits) ||(MHz) ||(KB) ||(KB) ||(num) |
|500 MHz P3 ||Jul 2000 ||500 ||2.000 ||32 ||33 ||32 ||512 ||1 |
|933 MHz P3 ||Mar 2001 ||933 ||1.070 ||32 ||33 ||32 ||256 ||1 |
|1.7 GHz P4 ||Sep 2003 ||1700 ||0.590 ||64 ||66 ||8 ||256 ||2 |
|450 MHz P2-Xeon ||Jan 2000 ||450 ||2.200 ||64 ||33 ||32 ||2048 ||2 |
|1.6 GHz P4-Xeon ||Oct 2001 ||1600 ||0.625 ||64 ||100 ||8 ||256 ||3 |
|3.2 GHz P4-Xeon ||May 2004 ||3200 ||0.290 ||64 ||133 ||8 ||512 ||4 |
Table 2: Memory Access Times for Multiple Generations of Machines
We measured the time to access various locations in the memory hierarchy
for these machines, including from the L1 and L2 caches, main memory, and
the memory-mapped I/O registers on the E1000. Memory hierarchy times
were measured using LMBench . To measure
the device I/O register times, we added some modifications to the
initialization routine of the Linux 2.6.9 E1000 device driver code.
Table 2 presents the results.
Note that while L1 and L2 access times remain relatively consistent in
terms of processor cycles, the time to access main memory and the device
registers is increasing over time. If access times were improving at the
same rate as CPU speeds, the number of clock cycles would remain constant.
|Machine ||L1 Cache ||L2 Cache ||Main ||I/O Register ||I/O Register |
|Hit ||Hit ||Memory ||Read ||Write |
|Time ||Clock ||Time ||Clock ||Time ||Clock ||Time ||Clock ||Time ||Clock |
|(ns) ||Cycles ||(ns) ||Cycles ||(ns) ||Cycles ||(ns) ||Cycles ||(ns) ||Cycles |
|500 MHz P3 ||6 ||3 ||44 ||22 ||162 ||80 ||600 ||300 ||300 ||150 |
|933 MHz P3 ||3.25 ||3 ||7.5 ||7 ||173 ||161 ||700 ||654 ||400 ||373 |
|1.7 GHz P4 ||1.2 ||2 ||10.9 ||18 ||190 ||323 ||800 ||1355 ||100 ||169 |
|450 MHz P2-Xeon ||6.75 ||3 ||38.3 ||17 ||207 ||93 ||800 ||363 ||200 ||90 |
|1.6 GHz Xeon ||1.37 ||2 ||11.57 ||18 ||197 ||315 ||900 ||1440 ||300 ||480 |
|3.2 GHz Xeon ||0.6 ||2 ||5.8 ||18 ||111 ||376 ||500 ||1724 ||200 ||668 |
Table 3: SPECWeb99 Performance Scalability over Multiple Generations of Machines
To see how actual server performance is scaling over time, we ran the
static portion of SPECweb99  using a recent version of Flash
In these experiments, Flash exploits all
the available performance optimizations on Linux, including sendfile()
with zero copy, TSO, and checksum offload on the E1000.
Table 3 shows the results.
Observe that server performance is not scaling with CPU speed, even though
this is a heavily optimized server making use of all current best practices.
This is not because of limitations in the network bandwidth; for example,
the 3.2 GHz Xeon-based machine has 4 gigabit interfaces and multiple
10 gigabit PCI-X busses.
|Machine ||Throughput ||Requested ||Conforming ||Scale ||Scale ||Ratio |
|(ops/sec) ||Connections ||Connections ||(achieved) ||(ideal) ||(%) |
|500 MHz P3 ||1231 ||375 ||375 ||1.00 ||1.00 ||100 |
|933 MHz P3 ||1318 ||400 ||399 ||1.06 ||1.87 ||56 |
|1.7 GHz P4 ||3457 ||1200 ||1169 ||3.20 ||3.40 ||94 |
|450 MHz P2-Xeon ||2230 ||700 ||699 ||1.00 ||1.00 ||100 |
|1.6 GHz P4-Xeon ||8893 ||2800 ||2792 ||4.00 ||3.56 ||112 |
|3.2 GHz P4-Xeon ||11614 ||2500 ||3490 ||5.00 ||7.10 ||71 |
2.3 Offload: Critiques and Responses
In this paper, we study TCP offload as a solution to the scalability
problem. However, TCP offload has been hotly debated by
the research community, perhaps best exemplified by Mogul's paper,
"TCP offload is a dumb idea whose time has come" .
That paper effectively summarizes the criticisms of TCP offload, and so,
we use the structure of that paper to offer our counterarguments here.
Limited processing requirements.
One argument is that Clark et al.  show that the
main issue in TCP performance is implementation, not the TCP protocol
itself, and a major factor is data movement; thus Offload does not address
the real problem. We point out that
Offload does not simply mean TCP header processing; it includes the
entire TCP/IP stack, including poorly-scaling, performance-critical
components such as data movement, bus crossings, interrupts, and
Offload provides an improved interface to the adapter that reduces
the use of these scalability-limiting operations.
Moore's Law: Moore's Law states that CPU speeds are doubling every
18 months, and thus one claim is that Offload cannot compete with
general-purpose CPUs. Historically, chips used by adapter vendors have not
increased at the same rate as general-purpose CPUs due to the economies of
scale. However, offload can use commodity CPUs with software implementations,
which we believe is the proper approach. In addition, speed needs only to
be matched with the interface (e.g., 10 gigabit Ethernet), and we argue
proper design reduces the code path relative to the non-offloaded case
(e.g. with fewer memory copies).
Sarkar et al.  and Ang 
show that when the NIC CPU is under-provisioned with respect to the host
CPU, performance can actually degrade. Clearly the NIC processing capacity
must be sized properly.
Finally, increasing CPU speeds does not address the scalability issue,
which is what we focus on here.
Efficient host interface: Early critiques are that TCP Offload Engines
(TOE) vendors recreated "TCP over a bus". Development of an elegant and
efficient host/adapter interface for offload is a fundamental research
challenge, one we are addressing in this paper.
Bad buffer management: Unless Offload engines
understand higher-level protocols, there is still an application-layer
header copy. While true, copying of application headers is not as
performance-critical as copying application data. One complication is
the application combining its own headers on the same connection with
its data. This can only be solved by changing the application, which
is already proposed in RDMA extensions for NFS and iSCSI
Connection management overhead: Unlike conventional NICs,
offload adapters must maintain per-connection state. Opponents argue
that offload cannot handle large numbers of connections, but Web
server workloads have forced host TCP stacks to discover techniques to
efficiently manage 10,000's of connections. These techniques
are equally applicable for an interface-based implementation.
Resource management overhead: Critics argue that tracking resource
management is "more difficult" for offload. We do not believe this is
the case. It is straightforward to extend the notion of resource management
across the interface without making the adapter aware of every process
as we will show in Sections 3 and
Event management: The claim is that offload does not address
managing the large numbers of events that occur in high-volume servers.
It is true that offload, per se, does not address application
visible events, which are better addressed by the API.
However, offload can shield the host operating system from spurious
unnecessary adapter events, such as TCP acknowledgments or window
advertisements. In addition, it allows batching of other events to
amortize the cost of interrupts and bus crossings.
Partial offload is sufficiently effective: Partial offload
approaches include checksum offload and large send (or TCP Segmentation
Offload), as discussed in Section 2.1.
While useful, they have limited value and do not fully solve the scalability
problem as was shown in Section 2.2.
Other arguments include that checksum offload actually
masks errors to the host . In contrast,
offload allows larger batching and the opportunity to perform more
rigorous error checking (by including the CRC in the data descriptors).
Maintainability: Opponents argue that offload-based approaches
are more difficult to update and maintain in the presence of security
and bug patches. While this is true of an ASIC-based approach, it is
not true of a software-based approach using general-purpose hardware.
Quality assurance: The argument here is that offload is harder
to test to determine bugs. However, testing tools such as TBIT  and ANVL  allow remote testing of the
offload interface. In addition, software based approaches based on
open-source TCP implementations such as Linux or FreeBSD
facilitate both maintainability and quality assurance.
System management interface: Opponents claim that offload adapters
cannot have the same management interface as the host OS. This is incorrect:
one example is SNMP. It is trivial to extend this to an offload adapter.
Concerns about NIC vendors: Third-party vendors may go out of
business and strand the customer. This has nothing to do with offload;
it is true of any I/O device: disk, NIC, or graphics card.
Economic incentives seem to address customer needs.
In addition, one of the largest NIC vendors is Intel.
3 System Design
In this Section we describe our Offload design and how it addresses scalability.
3.1 How Offload Addresses Scalability
A higher-level interface.
Offload allows the host operating system to interact with the device at
a higher level of abstraction. Rather than simply queuing MTU-sized
packets for transmission or reception, the host issues commands at
the transport layer (e.g., connect(), accept(), send(),
close()). This allows the adapter to shield the host from
transport layer events (and their attendant interrupt costs) that may
be of no interest to the host, such as arrivals of TCP acknowledgments
or window updates. Instead, the host is only notified of meaningful
events. Examples include a completed connection establishment or termination
(rather than every packet arrival for the 3-way handshake or 4-way
tear-down) or application-level data units.
Sufficient intelligence on the adapter can determine the appropriate
time to transfer data to the host, either through knowledge of
standardized higher-level protocols (such as HTTP or NFS) or through
a programmable interface that can provide an application signature
(i.e., an application-level equivalent to a packet filter).
By interacting at this higher level of abstraction, the host will
transfer less data over the bus and incur fewer interrupts and
device register accesses.
Ability to move data in larger sizes. As described in Section
2.1, the ability to use large MTUs has a significant
impact on performance for both sending and receiving data. Large
send/TSO only approximates this optimization, and only for the send
side. In contrast, offload allows the host to send and receive data
in large chunks unaffected by the underlying MTU size. This reduces
use of poorly scaling components by making more efficient use of the
I/O bus. Utilization of the I/O bus is not only affected by the data
sent over it, but also by the DMA descriptors required to describe
that data; offload reduces both. In addition, data that is typically
DMA'ed over the I/O bus in the conventional case is not transferred
here, for example TCP/IP and Ethernet headers.
Improving memory reference behavior.
We believe offload will not only increase available cycles to the
application but improve application memory reference behavior.
By reducing cache and TLB pollution, cache hit rates and CPI will
improve, increasing application performance.
3.2 Current Adapter Designs
Figure 1: Conventional Protocol Stack
Perhaps the simplest way to understand an architecture that offloads
all TCP/IP processing is to outline the ways in which offload differs from
conventional adapters in the way it interacts with the OS. Figure
1 illustrates a conventional protocol
architecture in an operating system. Operating systems tend to
communicate with conventional adapters only in terms of data transfer
by providing them with two queues of buffers. One queue is made up of
ready-made packets for transmission; the other is a queue of empty
buffers to use for packet reception. Each queue of buffers is
identified, in turn, by a descriptor table that describes the size and
location of each buffer in the queue. Buffers are typically described
in physical memory and must be pinned to ensure that they are
accessible to the card, i.e., so that they are not paged out. The
adapter provides a memory-mapped I/O interface for telling the adapter
where the descriptor tables are located in physical memory, and provides an
interface for some control information, such as what interrupt number
to raise when a packet arrives. Communication between the host CPU
and the adapter tends to be in one of three forms, as is shown in
Figure 1: DMA's of buffers and descriptors to
and from the adapter; reads and writes of control information to and
from the adapter, and interrupts generated by the adapter.
3.3 Offloaded Adapter Design
Figure 2: Offload Architecture
An architecture that seeks to offload the full TCP/IP stack has both
similarities and differences in the way it interacts with the adapter.
Figure 2 illustrates our offload architecture. As in
the conventional scenario, queues of buffers and descriptor tables are
passed between the host CPU and the adapter, and DMA's, reads, writes
and interrupts are used to communicate. In the offload architecture,
however, the host and the adapter communicate using a higher level of
abstraction. Buffers have more explicit data structures imposed on
them that indicate both control and data interfaces. As with a
conventional adapter, passed buffers must be expressed as physical
addresses and must be in pinned memory. The control interface allows
for the host to command the adapter (e.g., what port numbers to listen
on) and for the adapter to instruct the host (e.g., to notify the host
of the arrival of a new connection). The control interface is
invoked, for example, by conventional socket functions that control
connections: socket(), bind(), listen(),
connect(), accept(), setsockopt(), etc. The data
interface provides a way to transfer data on established connections
for both sending and receiving and is invoked by socket functions such
as send(), sendto(), write(), writev(),
read(), readv(), etc. Even the data interface is at a higher
layer of abstraction, since the passed buffers consist of
application-specific data rather than fully-formed Ethernet frames
with TCP/IP headers attached. In addition, these buffers need to
identify which connection that the data is for. Buffers containing
data can be in units much larger than the packet MTU size. While
conceptually they could be of any size, in practice they are unlikely
to be larger than a VM page size.
As with a conventional adapter, the interface to the offload adapter
need not be synchronous. The host OS can queue requests to the
adapter, continue doing other processing, and then receive a
notification (perhaps in the form of an interrupt) that the operation
is complete. The host can implement synchronous socket operations by
using the asynchronous interface and then block the application until
the results are returned from the adapter. We believe asynchronous
operation is key in order to ameliorate and amortize fixed overheads.
Asynchrony allows larger-scale batching and enables other
optimizations such as polling-based approaches to servers
The offload interface allows supporting conventional user-level APIs,
such as the socket interface, as well as newer APIs that allow more
direct access to user memory such as DAFS, SDP, and RDMA. In
addition, offload allows performing zero-copy sends and receives
without changes to the socket API. The term zero-copy refers to
the elimination of memory-to-memory copies by the host. Even in the
zero-copy case, data is still transferred across the I/O bus by the
adapter via DMA.
For example, in the case of a send using a conventional adapter, the
host typically copies the data from user space into to a pinned kernel buffer,
which is then queued to the adapter for transmission.
With an intelligent adapter, the host can block the user application
and pin its buffers, then invoke the adapter to DMA the data directly
from the user application buffer.
This is similar to previous "single-copy" approaches
[13,20], except that the transfer
across the bus is done by the adapter DMA and not via an explicit copy
by the host CPU.
Observe from Figure 2 that the interaction between
the host and the adapter now occurs between the socket and TCP layers.
A naive implementation may make unnecessary transfers across the
PCI bus for achieving socket functionality.
For example, accept() would now cause a bus crossing in addition
to a kernel crossing, as could setsockopt() for actions such as
changing the send or receive buffer sizes or the Nagle algorithm.
However, each of these costs can be amortized via batching multiple
requests into a single request that crosses the bus. For example,
multiple arrived connections can be aggregated into a single accept()
crossing which then translates into multiple accept() system calls.
On the other hand, certain events that would generate bus crossings
with a conventional adapter might not do so with a offload adapter, such
as ACK processing and generation.
The relative weight of these advantages and disadvantages depends on
the implementation and workload of the application using the adapter.
4 System Implementation
To evaluate our design and the impact of design decisions, we
implemented a software prototype. Our decision to implement the
prototype purely in software, rather than building or modifying actual
adapter hardware, was motivated by several factors.
Since our goal is to study not just performance, but scalability, we
ultimately intend to model different hardware characteristics, for
both the host and adapter, using a cycle accurate hardware simulator.
Limiting our analysis to only currently available hardware would
hinder our evaluation for future hardware generations.
Ultimately, we envision an adapter with a general purpose processor,
in addition to specialized hardware to accelerate specific operations
such as checksum calculation. Our prototype software is intended
to serve as a reference implementation for a production adapter.
Our prototype is composed of three main components:
At the moment, OSLayer is implemented as a library that is statically linked
with the application.
Ultimately, it will be decomposed into two components: a library linked
with applications and a component built in the kernel.
Event-driven TCP currently runs as a user-level process that accesses
the actual network via a raw socket. It will eventually become the
main software loop on the adapter.
The IOLib implementation currently communicates via TCP sockets, but the
design allows for implementations that communicate over a PCI bus or
other interconnects such as Infiniband.
This provides a vehicle for experimentation and analysis and allows us
to measure bus traffic without having to build a detailed simulation of
a PCI bus or other interconnect.
Figure 3: Offload Prototype
We used the Flash Web server for our evaluation, with Flash and OSLayer
running on one machine and Event-driven TCP running on another.
We use httperf
running on a separate machine to drive load. To compare the behavior
of the prototype with the conventional case, we evaluate a similar
client-server configuration using an E1000 device driver that has been
instrumented to measure bus traffic. Figure 3
illustrates how the components fit together.
The implementation is described in more detail below.
- OSLayer, an operating system layer that provides the socket
interface to applications and maps it to the descriptor interface
shared with the adapter;
- Event-driven TCP, our offloaded TCP implementation;
- IOLib, a library that encapsulates interaction
between OSLayer and Event-driven TCP.
OSLayer is essentially the socket interface decoupled from the TCP
OSLayer is a library that exposes an asynchronous socket interface to
As seen in Figure 3,
the application employs the socket API and OSLayer communicates with
Event-driven TCP via IOLib through descriptors, discussed in more detail
in Section 4.3.
After creating the descriptor appropriate to the particular API
function, the call is returned.
OSLayer and Event-driven TCP interact via a byte stream abstraction.
Towards that end, OSLayer transfers buffers of 4 KB to Event-driven TCP.
For large transfers, this reduces the number of DMA's and
bus crossings significantly.
To further limit bus crossings and increase scalability, OSLayer employs
several techniques. Descriptors can be batched before transferring them
to the card. In the current implementation, we allow batching using a
configurable batching level. However, a timer is used to ship over
descriptors that have been waiting sufficiently long before the batching
level has been reached.
Even if batching is set to one, descriptors can still batch significantly
with large data transfers.
In addition, OSLayer performs buffer coalescing (similar
to the TCP_CORK socket option on Linux), which is utilized by the Flash Web
server. When using the sendfile() operation, this allows HTTP headers
and data to be aggregated, thus sharing a descriptor and therefore a
transfer. While the conventional Linux stack is limited to two sk_bufs,
OSLayer can combine any number of sk_buffs into one.
4.2 Event Driven TCP
Event-driven TCP (EDT) performs the majority of the TCP processing for the
adapter and was derived from the Arsenic user-level TCP stack
Normally a TCP stack running on the host is animated by three types of
events: a calling user-space application process or thread, a packet
arrival, or a timer interrupt.
Since there are no applications on the adapter, an event-driven architecture
was chosen since it scales better than a process or thread-based
EDT is thus a single-threaded event-based closed loop, implemented
as a stand-alone user-space process. On every
iteration of the loop, each of the following are checked:
pending packets, new descriptors, DMA completion and TCP timers.
Execution is thus animated by packet and descriptor arrivals,
DMA completions, and TCP timer firings.
Event-driven TCP does not necessarily notify OSLayer of every packet.
For example, instead of informing OSLayer about every acknowledgment,
OSLayer is only alerted when an entire transfer completes.
OSLayer receives only a single event for a connection establishment or
termination, rather than each packet of the TCP handshake. This reduces
the number of descriptors (and their corresponding events) to be
transferred and processed by the host.
EDT communicates to OSlayer by passing descriptors through IOLib,
discussed in Sections 4.3 and 4.4.
Since it is a user-space process, EDT sends and receives packets
over the network using raw sockets and libpcap .
Descriptors are a software abstraction intended to capture the hardware-level
communication mechanism that occurs over the I/O bus between host and
Descriptors are primarily typed by how they are used: request
descriptors for issuing commands (e.g., CONNECT, SEND), and response
descriptors for the results of those commands (e.g., success, failure,
Descriptors are further categorized as control (SOCKET, BIND, etc.)
or data (SEND, RECV).
Two separate sets of tables are used for each transfer direction.
When an application calls send, a SEND request descriptor is transferred
from OSLayer to Event-driven TCP, containing the address and length
of the buffer to be sent. A request for DMA is queued at Event-driven
TCP and the next descriptor is processed. After the DMA completes,
the event is picked up by Event-driven TCP, and a response descriptor
is created with the address of the buffer. This descriptor informs
OSLayer that the buffer is no longer used. Upon receipt of this SEND
response, OSLayer cleans up the send buffer. Of course, many send
descriptors can be sent to Event-driven TCP at once. Buffers described
by a SEND descriptor can be up to 4 KB. We chose 4 KB since this is the
standard page size in most architectures; however, our implementation
has the ability to transfer up to 64K bytes.
When receive() is called, the RECV descriptor is transferred from
OSLayer to Event-driven TCP, containing the address and length of the
buffer to DMA received data into. If data is available on
Event-driven TCP's receive queue, a DMA is immediately initiated.
Later, after DMA is finished, a RECV response descriptor is created,
notifying OSLayer that the data is available, and Event-driven TCP
can free its own sk_buff.
If data is not available upon receipt of a RECV descriptor, the buffer is
placed on a receive buffer queue for that connection. When the data does
arrive later on the appropriate connection, the buffer is removed from the
queue, the DMA is performed, and a RECV response descriptor is created
and sent to the host.
Most of the control descriptors work in a
relatively straightforward manner; however, the CLOSE operation is
worth describing in more detail. A CLOSE descriptor is
transferred from OSLayer to Event-driven TCP when the application
initiates a close. After sending out all of the sk_buffs on the write
queue, Event-driven TCP will signal the close to the remote peer via
sending a packet with the FIN bit set. After the final ACK is sent
a response descriptor is created. In the event the other side closes the
connection, a CLOSE command descriptor is created in Event-driven TCP and
sent to OSLayer. OSLayer need not reply with a CLOSE response descriptor
in this case; OSLayer just notifies the application and cleans up
All DMA's involving data are initiated by Event-driven TCP, allowing
EDT to control the flow up to the host. Note that a DMA is not necessarily
performed immediately. Since the request for a DMA is queued, it may be
some time before a response to a descriptor is received by OSLayer.
This queued DMA approach required some changes to the TCP stack because
pending sends were not preventing a CLOSE descriptor from being processed
before the send's DMA competed. Since it was difficult to determine how
many sends were queued for DMA (and when they were finished), "empty"
sk_buffs are placed on the write queue with a flag set indicating that
the data is not yet present. When the DMA completes, this flag is set to
true, and the sk_buff is ready for sending. Thus, this flag is checked
before sending any sk_buff. This caused changes in several components
in the TCP stack.
For example, close processing is now split into two pieces.
The first part indicates the connection is in the process of closing;
The second part actually completes the close,
after the last DMA is complete and the buffer is sent.
IOLib provides a communication library to the OSlayer and Event-
driven TCP code by abstracting the I/O layer to a generic Put/Get
interface. We chose this approach for ease of porting
the offload prototype to bus, fabric or serial communication
interfaces. Thus, only IOLib needs to understand the specific
properties of the underlying communication link, while the calls
within OSLayer and Event-driven TCP remain unchanged.
The IOLib Put/Get library has an asynchronous queuing interface for
sending and receiving data. This interface is augmented by virtual
interface registers that can be used for base address references
traditionally used in the PCI bus interface. Communications support for
the Put/Get interface can be provided by several types of
communication: shared memory, message passing, etc. Figure
3 shows an example of how the server and adapter
components communicate using IOLib, where support for the Put/Get
interface is provided over a standard TCP/IP socket.
Since IOLib provides the interface between the host and the adapter,
it is a natural place to monitor traffic between the two.
To facilitate comparisons to conventional adapter implementations,
we instrumented IOLib to measure three different aspects of I/O
traffic: number of DMA's requested, number of bytes transferred,
and number of I/O bus cycles consumed by a transfer.
The model for capturing the number of bus cycles consumed is based
on a 133 MHz, 64 bit PCI-X bus and is calculated as follows:
This is because four cycles are required for initiation and termination
, the bus is eight bytes (64 bits) wide, and transfers
that are less than a full multiple of eight consume the bus for the entire
cycle. Charges are incurred for all data transferred; not only the
packet buffers transferred but also for the DMA descriptors that describe
cycles = 4 + ((transfer_size + 7) / 8) |
4.5 Limitations of the Prototype
OSLayer is still under active development. Many of the socket options not
used by Flash are not fully implemented. OSLayer requires a single-threaded
application because there is no current mechanism to distinguish descriptors
between threads or processes. A feedback mechanism is
still required so that OSLayer knows how many send buffers are available
in Event-driven TCP. If there are no send buffers available,
then OSLayer can return a failure code to the application invoking the send.
More work can be done to improve and extend Event-driven TCP.
For example, it could be made current with the latest version of the
Linux TCP stack. We believe performance would be improved if sk_buffs
could reference multiple noncontiguous pages.
Certain non-essential descriptors are not yet implemented.
An immediate mode descriptor, that is, one with the data embedded directly
in the descriptor, would also reduce the number of bus crossings.
Descriptors for sending status (e.g., the number of available send buffers)
and option querying could also improve performance and allow more dynamic
behavior. Finally, descriptors to cancel a send or a receive are needed.
Several new operations with associated descriptors are planned.
A batching ACCEPT operation will allow OSLayer to instruct Event-driven
TCP to wait for N connections to be established before returning a
response to the host. A single response descriptor would contain all
of the requisite information about each connection. In the appropriate
scenarios, this should reduce ACCEPT descriptor traffic.
The next logical step is to provide the option of delaying the connection
notification until the arrival of the first data on a
Another item is the addition of a "close" option to a SEND descriptor
that lets a close operation be combined with a send.
This eliminates the need for a separate close descriptor, and can
increase the likelihood that the FIN bit is piggybacked on the final
We are also designing cumulative completion descriptors. Instead of
completing each send or receive request individually with its own
SEND/RECV complete descriptor, we intend to have a send complete descriptor
that indicates completion of all requests up to and including that one.
This change requires no syntactic changes to the descriptors; it
simply changes the semantics of the response so that completion of a
send/receive implicitly indicates completion of any previous sends.
This approach is employed by OE , and
we believe the benefits can be achieved in our stack as well.
4.6 E1000 Driver
Figure 4: Baseline Configuration
To provide comparisons against a baseline system, we modified the
To provide comparisons with a baseline system, we modified the
Linux 2.6.9 Intel E1000 device driver code to measure the same
three components of bus traffic as was done for IOlib: DMA's
requested, bytes transferred, and bus cycles consumed. The
bus model is the same as is described in Section 4.4.
Sends are measured in e1000_tx_queue(); receives are
monitored in e1000_clean_rx_irq().
Figure 4 shows how
the instrumented driver is used in our experiments.
5 Experimental Results
In this Section, we present the results of our prototype
described in Section 4 and compare it to
a baseline implementation meant to represent the current
state of the art in conventional (i.e., non-offloaded) systems.
The goal is to show that offload provides a more efficient
interface between the host and the adapter for the metric we are
able to measure, namely, I/O traffic.
We again use a simple Web server workload to evaluate our prototype.
For software we use the Flash Web server  and the httperf
client workload generator.
We use multiple nodes within an IBM Blade Center
to produce the offload prototype configuration depicted in
The baseline configuration is shown in Figure
We examine three scenarios: transferring a small file (1 KB), a
moderately-sized file (64 KB), and a large file (512 KB). This is
intended to capture a spectrum of data transfer sizes and vary the ratio of
per-connection costs to per-byte costs. We measure transfers in both
directions (send and receive) using three metrics for utilizing the
I/O bus: DMA count, which counts the number of times a DMA is
requested from the bus; bus cycles, which measures the number of
cycles consumed on the bus (based on the model in Section
4.4); and bytes transferred, to determine the raw
number of bytes sent over the bus.
5.1 Baseline Results
Table 4: Comparing IO Traffic for E1000 and Offload
Table 4 shows our results. Overall, we see that
offload is effective at reducing bus activity, with improvements up to
70 percent. We look at each transfer size in turn.
Examining the results for 1 KB transfers in Table 4,
note that there is a significant improvement on the receive path,
mainly due to shielding the host from ACK packets.
However, this is a send-side test, and the number of bytes sent
from application to application do not change. Even so, we see
a moderate reduction (4-17 %) in bytes transferred on the send side.
This is partly because TCP, IP and Ethernet headers are not
transferred over the bus in the offload prototype, whereas they
are in the baseline case. Note that the number of DMAs and the
utilization of the bus are also reduced, up to 70 % and 16 %,
Looking at the results for 64 KB transfers, again we see significant
improvement on the receive side. A larger amount of data is being
sent in this experiment, and thus the byte savings on the send side
are relatively small at four percent. However, note that the efficiency
of the bus has greatly improved: the number of send DMA's
requested falls by 64 percent, and the bus utilization is reduced by 9 percent.
The amount of bus cycles consumed has also improved by 9 percent.
These trends are also reflected in the 512 KB results.
| 1 KB || 64 KB || 512 KB |
|E1000 ||Off l oad ||Diff ||E1000 ||Off l oad ||Diff ||E1000 ||Off l oad ||Diff |
|(num) ||(num) ||(%) ||(num) ||(num) ||(%) ||(num) ||(num) ||(%) |
|Recv DMA Count ||12 ||11 ||08 ||62 ||39 ||37 ||363 ||189 ||48 |
|Recv Bus Cycles Consumed ||119 ||78 ||34 ||559 ||269 ||52 ||3260 ||1394 ||57 |
|Recv Bytes Transferred ||538 ||252 ||53 ||2385 ||821 ||66 ||13900 ||4706 ||66 |
|Send DMA Count ||9 ||9 ||0 ||159 ||56 ||64 ||1222 ||370 ||70 |
|Send Bus Cycles Consumed ||237 ||200 ||16 ||9366 ||8532 ||9 ||74244 ||67683 ||8 |
|Send Bytes Transferred ||1572 ||1301 ||17 ||69474 ||66389 ||4 ||552132 ||529131 ||4 |
5.2 Batching Descriptors
Table 5: Comparing IO Traffic for E1000 and Offload Batching Descriptor Traffic.
One obvious method to reduce bus crossings is to transfer multiple
descriptors at a time rather than one. The results presented in Table
5 provide experimental results for a minimum
batching level of ten descriptors at a time, using a idle timeout value
of ten milliseconds. These can be tuned to
the transfer size, but are held constant for these experiments.
Observe that the numbers have improved for the 1K, 64K and 512K transfers
over the previous comparison in Table 4.
The improvements are limited since increasing the minimum batching
threshold and timeouts did not significantly help for this type of traffic.
This is because multiple response and socket descriptor messages
are provided at nearly the same time. This technique is similar in
concept to interrupt coalescing in adapters; the distinction is that
information batched at a higher level of abstraction.
| 1 KB || 64 KB || 512 KB |
|E1000 ||Off l oad ||Diff ||E1000 ||Off l oad ||Diff ||E1000 ||Off l oad ||Diff |
|(num) ||(num) ||(%) ||(num) ||(num) ||(%) ||(num) ||(num) ||(%) |
|Recv DMA Count ||12 ||7 ||42 ||62 ||9 ||85 ||363 ||24 ||93 |
|Recv Bus Cycles Consumed ||119 ||60 ||50 ||559 ||134 ||76 ||3260 ||648 ||80 |
|Recv Bytes Transferred ||538 ||244 ||55 ||2385 ||761 ||68 ||13900 ||4375 ||69 |
|Send DMA Count ||9 ||5 ||44 ||159 ||21 ||87 ||1222 ||148 ||88 |
|Send Bus Cycles Consumed ||237 ||183 ||23 ||9366 ||8385 ||11 ||74244 ||66683 ||10|
|Send Bytes Transferred ||1572 ||1294 ||18 ||69474 ||66319 ||5 ||552132 ||528686 ||4 |
6 Related Work
Several performance studies on TCP offload have been conducted using an
emulation approach which partitions an SMP and uses a processor as
an offload engine. These studies have shown that offloading TCP processing
to an intelligent interface can provide significant performance improvements
when compared to the standard TCP/IP networking stack.
However, the study by Westrelin et al.  lacks an
effective way to model the I/O bus traffic that occurs between
the host and offload adapter. They use the host memory bus to emulate the
I/O bus, but this emulation lacks the characteristics necessary
to capture the performance impact of an I/O bus such as PCI.
In practice, a high-speed memory bus is not representative of the
performance seen by an I/O bus.
Our implementation is designed with a modular I/O library that can
be used to model different I/O bus types.
The focus of our paper considers multiple performance impacts on server
scalability including I/O.
Additionally, when using a partitioned SMP emulation approach, there is
coherency traffic necessary to keep the memory state consistent between
This coherency overhead can affect the results, since it perturbs the
interaction between the host and offload adapter and
includes overhead that will not exist in a real system.
Our modeled offload system does not suffer from this issue.
Rangarajan et al. ,
and Regnier et al.  also use a partitioned SMP
approach and show greater absolute performance when dedicating a processor to
packet processing. This approach can measure server scalability
with respect to the CPU but does not address the underlying scalability
issues that exist in other parts of the system, such as the memory bus.
TCP offload designs that do not address the scalability issues
discussed in this paper might improve CPU utilization on the host for
large block sizes but harm throughput and latency for small block sizes.
The current generation of offload adapters in the market have simply
moved the TCP stack from the host to the offload adapter without the
necessary design considerations for the host and adapter interface. For
some workloads this creates a bottleneck on the adapter
. Handshaking across the host and adapter
interface can be quite costly and reduce performance especially for small
messages. Additionally, Ang  found that there appears
to be no cheap way of moving data between host memory and an intelligent
Performance analysis of current generation network adapters only
reveals the characteristics of networking at a given point in time. In order
to understand the performance impacts of various design tradeoffs, all
of the components of the system need to be modeled so that performance
characteristics that change over time can be revealed.
Binkert et al.  propose the execution-driven
simulator M5 to model network-intensive workloads. M5 is
capable of full system simulation including the OS, the memory model,
caching effects, DMA activity and multiple networked systems. M5 faithfully
models the system so it can boot an unmodified OS kernel and execute
applications in the simulated environment. In Section 7 we describe
the use of Mambo, an instruction level simulator for the
PowerPC, in order to faithfully model
Shivam and Chase  showed that offload
can enable direct data placement, which can serve to eliminate some
communication overheads, rather than of shifting them from the host to
They also provide a simple model to quantify the benefits of offload based
on the ratio of communication to computation and the ratio of the host
CPU processing power to the NIC processing power.
Thus a workload can be characterized based on the parameters of the
model and one can determine whether offload will benefit that workload.
This paper can be seen as an application of Amdahl's Law to TCP
offload. Their analysis suggests that offload best supports low-lambda
applications such as storage servers.
Foong et al.  found that performance is scaling
at about 60 percent of CPU speeds. This implies that generally accepted
rule of thumb that states 1 bps of network link requires 1 Hz of CPU
processing will not hold up over time. They point out that as CPU speed
increases the performance gap widens between it and the memory and I/O bus.
However, their study did not generate an implementation and their results
are from using an emulated offload system. Our work has focused on these
server scalability issues and created a design and implementation to
7 Summary and Future Work
We have presented experimental evidence that quantifies
how poorly server network throughput is scaling with CPU speed
even with sufficient link and I/O bandwidth.
We argue the scalability problem is due to specific operations
that limit scalability,
in particular bus crossings, cache misses and interrupts.
Furthermore, we have shown experimental evidence that quantifies
how bus crossings and cache misses are scaling poorly with CPU speed.
We have designed a new host/adapter interface that exploits
additional processing at the network interface
to reduce scalability-limiting operations.
Experiments with a software prototype of our offloaded TCP stack
show that it can substantially reduce bus crossings.
By allowing the host to deal with network data in fewer pieces,
we expect our design to reduce cache misses and interrupts as well.
Work is ongoing to continue development of the prototype
and extend our analysis to study the effects
on cache misses and interrupts.
As described in Section 4.5,
the current prototype does not yet implement all aspects of the design.
We are continuing development with an emphasis on further aggregation
and reduction of operations that limit scalability.
Future additions include a batching accept operation,
an accept that returns after data arrives on the connection,
a send-and-close function, and cumulative completion semantics.
We are also preparing to evaluate
our prototype in Mambo, a simulation environment
for PowerPC systems .
Running in Mambo provides the ability to measure cache behavior
and quantify the impact of hardware parameters
such as processor clock rates, cache sizes, associativity and miss penalties.
Mambo allows us to run the OSLayer (host) and Event-driven TCP (adapter)
portions of the prototype on distinct simulated processors.
We can thus determine the hardware resources needed on the adapter
to support a given host workload.
Finally, we intend to extend the prototype and simulation to encompass
low-level device interaction.
This will entail replacing the socket-based version of IOLib
with a version that communicates across a hardware interconnect
such as PCI or InfiniBand.
This will allow us to predict throughput and latency on
simulated next-generation interconnects.
Boon S. Ang.
An evaluation of an attempt at offloading TCP/IP protocol
processing onto an i960rn-based NIC.
Technical Report 2001-8, HP Labs, Palo Alto, CA, Jan 2001.
Mohit Aron and Peter Druschel.
TCP implementation enhancements for improving Webserver
Technical Report TR99-335, Rice University Computer Science Dept.,
Mohit Aron and Peter Druschel.
Soft timers: Efficient microsecond software timer support for network
ACM Transactions on Computer Systems, 18(3):197-228, 2000.
The Infiniband Trade Association.
The Infiniband architecture.
Gaurav Banga, Jeffrey Mogul, and Peter Druschel.
A scalable and explicit event delivery mechanism for UNIX.
In Proceedings of the USENIX 1999 Technical Conference,
Monterey, CA, June 1999.
Nathan L. Binkert, Erik G. Hallnor, and Steven Reinhardt.
Network-oriented full-system simulation with M5.
In Proceedings Sixth Workshop on Computer Architecture
Evaluation using Commercial Workloads CAECW, Anaheim, CA, Feb 2003.
Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, and Omer
NFS over RDMA.
In Proceedings ACM SigComm Workshop on Network-I/O Convergence
(NICELI), Karlsruhe, Germany, Aug 2003.
Mallikarjun Chadalapaka, Uri Elzur, Michael Ko, Hemal Shah, and Patricia
A study of iSCSI extensions for RDMA (iSER).
In Proceedings ACM SigComm Workshop on Network-I/O Convergence
(NICELI), Karlsruhe, Germany, Aug 2003.
David D. Clark, Van Jacobson, John Romkey, and Howard Salwen.
An analysis of TCP processing overhead.
IEEE Communications Magazine, 27(6), June 1989.
Small packet traffic performance optimization for 8255x and 8254x
Technical Report Application Note (AP-453), Sept 2003.
ANVL TCP testing tool.
The Standard Performance Evaluation Corporation.
Chris Dalton, Greg Watson, David Banks, Costas Clamvokis, Aled Edwards, and
IEEE Network, 11(2):36-43, July 1993.
Peter Druschel, Larry Peterson, and Bruce Davie.
Experiences with a high-speed network adaptor: A software
In ACM SIGCOMM Symposium, London, England, August 1994.
Annie P. Foong, Thomas R. Huff, Herbert H. Hum, Jaidev P. Patwardhan, and
Greg J. Regnier.
TCP performance re-visited.
In Proceedings International Symposium on Performance Analysis
of Systems and Software ISPASS, Austin, TX, March 2003.
John L. Hennessy and David A. Patterson.
Computer Architecture: A Quantitative Approach (2nd Edition).
Morgan Kaufmann Publishers Inc., San Francisco, CA, 1995.
Red Hat Inc.
The Tux WWW server.
Philippe Joubert, Robert King, Richard Neves, Mark Russinovich, and John
High-performance memory-based Web servers: Kernel and user-space
In Proceedings of the USENIX Annual Technical Conference,
Boston, MA, June 2001.
Jonathan Kay and Joseph Pasquale.
Profiling and reducing processing overheads in TCP/IP.
IEEE/ACM Transactions on Networking, 4(6):817-828, December
Karl Kleinpaste, Peter Steenkiste, and Brian Zill.
Software support for outboard buffering and checksumming.
In ACM SIGCOMM Symposium, pages 196-205, Cambridge, MA, August
The libpcap Project.
Srihari Makineni and Ravi Iyer.
Measurement-based analysis of TCP/IP processing requirrments.
In 10th International Conference on High Performance Computing
(HiPC 2003), Hyderabad, India, December 2003.
Evangelos P. Markatos.
Speeding up TCP/IP: Faster processors are not enough.
In Proceedings 21st IEEE International Performance, Computing,
and Communication Conference IPCCC, pages 341-345, Phoenix, AZ, April
Paul E. McKenney and Ken F. Dove.
Efficient demultiplexing of incoming TCP packets.
In ACM SIGCOMM Symposium, pages 269-279, Baltimore, Maryland,
August 1992. ACM.
Larry McVoy and Carl Staelin.
LMBENCH: Portable tools for performance analysis.
In USENIX Technical Conference of UNIX and Advanced Computing
Systems, San Diego, CA, January 1996.
Jeffrey C. Mogul.
Operating systems support for busy Internet servers.
In Proceedings Fifth Workshop on Hot Topics in Operating Systems
(HotOS-V), Orcas Island, WA, May 1995.
Jeffrey C. Mogul.
TCP offload is a dumb idea whose time has come.
In USENIX Workshop on Hot Topics on Operating Systems (HotOS),
Hawaii, May 2003.
Jeffrey C. Mogul and K. K. Ramakrishnan.
Eliminating receive livelock in an interrupt-driven kernel.
ACM Transactions on Computer Systems, 15(3):217-252, 1997.
David Mosberger and Tai Jin.
httperf - a tool for measuring Web server performance.
In Proceedings 1998 Workshop on Internet Server Performance
(WISP), Madison, WI, June 1998.
Erich M. Nahum, Tsipora Barzilai, and Dilip Kandlur.
Performance issues in WWW servers.
IEEE/ACM Transactions on Networking, 10(2):2-11, Feb 2002.
Jitendra Padhye and Sally Floyd.
On inferring TCP behavior.
In ACM SIGCOMM Symposium, pages 287-298, 2001.
Vijay Pai, Scott Rixner, and Hyong-Youb Kim.
Isolating the performance impacts of network interface cards through
In Proceedings ACM Sigmetrics, New York, NY, June 2004.
Vivek Pai, Peter Druschel, and Willy Zwaenepoel.
Flash: An efficient and portable Web server.
In USENIX Annual Technical Conference, Monterey, CA, June 1999.
Ian Pratt and Keir Fraser.
Arsenic: A user-accessible gigabit ethernet interface.
In Proceedings of the Conference on Computer Communications
(IEEE Infocom), Anchorage, Alaska, April 2001.
Murali Rangarajan, Aniruddha Bohra, Kalpana Banerjee, Enrique V. Carrera,
Ricardo Bianchini, and Liviu Iftode.
TCP servers: Offloading TCP processing in Internet servers,
design, implementation and performance.
Technical Report DCS-TR-481, Rutger University, Department of
Computer Science, Piscataway,NJ, March 2003.
Greg Regnier, Dave Minturn, Gary McAlpine, Vikram Saletore, and Annie Foong.
ETA: Experience with an intel xeon processor as a packet processing
In 11th Annual Symposium on High Performance Interconnects,
Palo Alto, CA, August 2003.
Yaoping Ruan and Vivek Pai.
Making the "box" transparent: System call performance as a
In USENIX Annual Technical Conference, Boston, MA, June 2004.
Prasenjit Sarkar, Sandeep Uttamchandani, and Kaladhar Voruganti.
Storage over IP: When does hardware support help?
In USENIX Conference on File and Storage Technologies (FAST),
San Francisco, CA, March 2003.
H. Shafi, P.J. Bohrer, J. Phelan, C.A. Rusu, and J.L. Peterson.
Design and validation of a performance and power simulator for
IBM Journal of Research and Development, 47(5/6):641-651,
Piyush Shivam and Jeffrey S. Chase.
On the elusive benefits of protocol offload.
In ACM SigComm Workshop on Network-IO Convergence (NICELI),
Germany, August 2003.
Jonathan Stone, Michael Greenwald, Craig Partridge, and James Hughes.
Performance of checksums and CRC's over real data.
IEEE ACM Transactions on Networking, 6(5):529-543, 1998.
R. Westrelin, N. Fugier, E. Nordmark, K. Kunze, and E. Lemoine.
Studying network protocol offload with emulation: Approach and
In 12th Annual IEEE Symposium on High Performance
Interconnects, Stanford, CA, Aug 2004.
File translated from
On 1 Mar 2005, 04:08.