Latency Analysis of TCP on an ATM Network
 

       Alec Wolman, Geoff Voelker, and Chandramohan A. Thekkath
	    Department of Computer Science and Engineering
		       University of Washington
 

(This work was supported in part by the National Science Foundation
under Grants No.  CCR-8907666, CDA-9123308, and CCR-9200832, by the
Washington Technology Center, Apple Computer, Boeing Computer
Services, Digital Equipment Corporation, and the Hewlett-Packard
Corporation.  Chandramohan A. Thekkath was also supported by an Intel
Foundation Graduate Fellowship.)

			       Abstract

    In this paper we characterize the latency of the BSD 4.4 alpha
    implementation of TCP on an ATM network.  Latency reduction is a
    difficult task, and careful analysis is the first step towards
    reduction.  We investigate the impact of both the network controller
    and the protocol implementation on latency.  We find that a low
    latency network controller has a significant impact on the overall
    latency of TCP.  We also characterize the impact on latency of some
    widely discussed improvements to TCP, such as header prediction and
    the combination of the checksum calculation with data copying.


 Introduction 

In this paper we investigate the latency characteristics of the TCP
transport protocol on an Asynchronous Transfer Mode (ATM) network
[FORE].  The characteristics of LAN technologies have changed a great
deal in the last few years.  With faster network hardware, the
disparity between software and hardware costs is even greater.  This
increases the importance of efficient protocol implementations and
efficient operating system interfaces.  The following factors in
network communication make measuring TCP performance, especially
latency, interesting:

  * The existence of a high quality TCP software implementation: the
BSD 4.4 alpha TCP code.
 
  * The availability of low latency network interfaces: e.g., the FORE
TCA-100 ATM interface FORE .
 
  * The wide use of applications and subsystems (like RPC) that can
benefit from reduced latency.
 
Prior studies have concentrated on characterizing and optimizing the
throughput of TCP on substantially different hardware or networks than
the ones we describe here (e.g., [OSF,Kay]).  In addition to focusing
on an ATM network, we investigate how optimizations previously
suggested for improving throughput affect latency.  We believe that
studying the latency characteristics of TCP on ATM networks is
particularly interesting for two reasons.  First, ATM is an emerging
communication standard that is likely to be widely deployed.  Second,
our study allows us to answer the following questions: Can we provide
evidence that TCP is a viable option for a transport layer for RPC?
How have the changes in technology affected the results of earlier
studies (e.g., [Clark89])?  Is latency dominated by the cost of
operating system services, such as buffer management?  If so, can the
use of such services be reduced enough to make latency acceptable for
applications that require low latency?

 System Overview 

All of our experiments were run on a pair of DECstation 5000/200
workstations, which use a MIPS R3000 processor running at 25 MHz.
Each DECstation was equipped with a FORE TCA-100 ATM network interface
on the TurboChannel I/O bus.  The ATM network interface uses a memory
mapped receive FIFO that stores up to 292 53-byte ATM cells, and a
similar transmit FIFO that stores up to 36 cells.  The transmit engine
starts reading from the transmit FIFO as soon as there is one complete
cell in the FIFO.  The ATM driver and adapter implement the Class 3/4
ATM Adaptation Layer (AAL), which is responsible for all segmentation
and reassembly of datagrams and the detection of transmission errors
and dropped cells.

We used the ULTRIX 4.2A kernel as a foundation and replaced its TCP
implementation with the BSD 4.4 alpha TCP implementation.  The BSD TCP
implementation has lower latency than the ULTRIX implementation, in
part because of its use of one fewer mbuf for small packets and its
use of a protocol control block cache (a comparison of the two
implementations can be found in University of Wash. CSE Dept.  Tech.
Report 93-03-02).  Since ULTRIX 4.2A is a BSD derivative, substituting
the new BSD TCP implementation was relatively straightforward because
the two implementations have nearly identical interfaces to the rest
of the protocol stack.

 Measurement Techniques 

Many of our experiments measured the round-trip latency of two
user-level processes roughly simulating client-server communication.
Unless stated otherwise, the processes ran on otherwise idle machines
and communicated over a switchless private ATM network. The client
connected to the server using TCP, started a timer, and then
repeatedly executed the following steps: it sent _size_ bytes to the
server, and then waited to receive _size_ bytes from the server.  It
then stopped the timer and recorded its value.  For all the round-trip
measurements in this paper, we ran 40000 iterations for at least 3
repetitions and took the average to get the final result.

Our measurements used a range of packet sizes.  Based upon previous
studies of RPC and TCP traffic behavior, we chose a variety of packet
lengths sized 500 bytes and smaller [LRPC,Kay].  We also measured
packets of 1400 bytes (the Ethernet MTU minus protocol headers), 4000
bytes (fits on a single memory page including protocol headers), and
8000 bytes (fits on two memory pages including protocol headers, also
close to our ATM MTU of 9K).

Latency measurements typically involve estimates of small code paths
that take on the order of usecs.  To measure at this level of
granularity, we used a real time clock on a TurboChannel card with a
40ns period (The TurboChannel card is the AN-1 controller from DEC SRC
Autonet . Note that we did not employ the AN-1 network in this study,
only the clock on its controller).  One advantage of using this clock
is that we avoid instruction counting as a technique for estimating
the execution time of small code sections.  One disadvantage of this
approach, however, is that our measurements include cache effects.

The clock is initialized at boot time, and user-level processes gain
access to it by issuing a system call that maps the clock address into
the process's address space. Reading the clock is then just a matter
of dereferencing a pointer.  Code inside the kernel can read the clock
in a similar manner.  We also added system calls to extract timings
from the kernel to measure events that started in user space and ended
in the kernel, or vice-versa.

  Paper Outline  

The rest of this paper is organized as follows.  Section 2 summarizes
our measurements of TCP latency on the baseline system.  Sections 3
and 4 study the effect of several modifications motivated by the
results in Section 2.  These modifications are not new and have been
suggested by others to improve throughput [Clark89,Kay2]. However, our
focus here is on the effect of these modifications on latency.

 Measurement of the Baseline System 

The baseline system that we measured is the BSD 4.4 alpha TCP release
operating on an ATM network.  From our measurements, we investigate:
(1) the contribution to latency of the network driver and adapter; (2)
the cost of TCP protocol processing; and (3) the overhead of
protocol-independent operating system mechanisms.

   Effect of the Network on Latency 

To demonstrate the effects of the network driver, adapter, and
physical link on latency, we compared the round-trip times of the BSD
4.4 TCP implementation communicating over the ATM network with the
same TCP implementation communicating over Ethernet.  The results are
listed in Table[atmether].  For the small transfer sizes, the network
has a large effect on overall latency (e.g., a 919 difference in the 4
byte case).  For large transfer sizes, much of the effect can be
attributed to the lower bandwidth of the Ethernet driver, adapter, and
physical link.

  [ Table atmether ]

 Detailed Measurements of Latency 

To obtain detailed latency measurements, we instrumented the transmit
and receive sides separately.  We used the same benchmark program
described above to measure both sides. The results for the transmit
side are shown in Table [b-xmit], and the results for the receive side
are shown in Table [b-receive].

 [ Table b-xmit ]

In characterizing the latency of transmitting data using TCP, we
divided the transmit operation into four time spans.  The first span,
User, measures the time from the write system call to the beginning of
the TCP protocol implementation.  This span of time includes copying
data from user space into kernel mbufs at the socket layer.

 [ Table b-receive ]


The second span, TCP, measures the time spent doing the TCP protocol
output processing.  It consists of three components, checksum, mcopy,
and segment .  Checksum is the time spent calculating the TCP checksum
over the data and header.  Mcopy is the time spent copying data from
the socket mbufs into driver mbufs.  Segment is the remaining TCP
protocol processing time.

The third time span, IP, measures the time spent in IP output
processing, and the last span, ATM, measures the time spent in the ATM
network driver.  To obtain an accurate measurement of latency for the
last span, we only measure up to when the ATM adapter is signaled to
send the last byte of data. We do not include the time of any
operations after that because these operations are effectively
overlapped with network transmission, which is separately accounted
for.

The rows in Table [b-receive] have similar meanings.  The User time
span refers to the time from when the data leaves the TCP layer until
the time the user process runs again (except for the scheduling time,
described below).  TCP is the time spent doing the TCP input
processing, and has a similar breakdown as on the transmit side.
Note, however, that the TCP input processing does not have a mcopy row
because the extra copy operation is only used on the transmit side to
support retransmissions.  IP is the time spent doing IP input
processing, and ATM is the time spent receiving and reassembling
incoming ATM cells.

We also introduced two more time spans on the input side.  The first,
IPQ, measures the IP queue scheduling time, i.e., the time from when
the ATM driver places received data on the IP queue and signals a
software interrupt until the time the data is removed from the IP
queue.  The second, Wakeup, is the user process scheduling time, i.e.,
the time from when the user process is placed on the run queue until
the time it runs.

The nonlinear response of the ATM adapter, as observed in the ATM rows
of the receive data, is due to overlap between sending the data and
receive processing.  When the sending ATM adapter is sending a large
number of cells, the receiving ATM adapter can process the first cells
while the sending adapter is still sending the later cells.  We only
measure the portion of the receive processing that actually
contributes to the overall latency.  This is the time from the arrival
of the last group of ATM cells comprising the last TCP segment of a
data transfer to the time when the read system call returns to the
user-level process.  We use the arrival of the last group of ATM cells
comprising the last TCP segment to initiate our timings because we
know at that point that the sending adapter has finished sending all
of the data for that transmission.

The following subsections present an analysis of the data in these
tables.

   Mbuf Manipulation 

One to eight mbufs are used for transfers of less than 1 KB.  Beyond
this size, cluster mbufs are used.  Cluster mbufs are used for large
transfers because they hold 4 KB of data, the size of a memory page,
whereas normal mbufs hold only 108 bytes of data.  The measured time
to allocate and free an mbuf (independent of type) is just over 7
usecs, making the mbuf manipulation a small cost relative to
the overall cost of sending or receiving data.

The nonlinear response between the 500 and 1400 byte transfer sizes of
the User and mcopy rows of Table [b-xmit] is due to a switch in the
use of mbuf types in the ULTRIX 4.2A socket layer.  Once the data
transfer size grows above 1 KB, ULTRIX uses cluster mbufs to store
user data.

In the User row, the copy from the user buffer to the mbuf takes less
time because user data does not have to be fragmented into multiple
mbufs.

In the mcopy row, using cluster mbufs reduces latency because the
mbuf-to-mbuf copy semantics of cluster mbufs differs from normal
mbufs.  When normal mbufs are copied, the data is actually copied into
separately allocated mbufs.  However, cluster mbufs use reference
counts for copying; no storage is allocated or data copied.  Since TCP
makes a copy of the mbufs passed from the socket layer on the transmit
path, the copy for transfers larger than 1 KB takes less time than for
smaller transfers.

We note, however, that these effects are artifacts of a particular
buffer management implementation choice rather than inherent protocol
behavior.

 Checksum  

The checksum does not scale linearly with the small transfer sizes
because the checksum is done over the data and the TCP/IP header (20
bytes for TCP header + 20 bytes for IP overlay + length of TCP
options).  Also, as transfer sizes grow, the checksum calculation
begins to dominate the cost of protocol processing.  In a later
section, we discuss optimizing the checksum for better latency.

 Data Copies 

The times in three rows of the tables (User, mcopy, and ATM) include
the cost of a data copy: the User time includes copying data between
kernel space and user space; the mcopy row in Table [b-xmit] contains
the time make copies of the data for retransmissions; and the ATM row
includes the time spent copy data between the host and the device.

>From this breakdown we see that data is copied at least twice on both
sends and receives.  The copy in mcopy only occurs on sends, and is
made from the mbuf chain for retransmissions.  Eliminating the
checksum (discussed in Section [sec-elim]) opens the possibility of
eliminating these data copying costs given a network adapter that
supports DMA.  With a combined copy and TCP checksum, Clark et al.
discuss a network adapter design that eliminates the need for a second
copy [Clark89].  In a later section, we investigate how combining a
copy and checksum affects latency using the ATM adapter.

 Scheduling 

The scheduling times (the sum of the IPQ and Wakeup rows) for
switching contexts on the receive side are noticeable for small data
transfers (68 out of 1021 usecs, or 6.7% of the round trip time
for the 4 byte case), particularly when compared to the costs of mbuf
allocation and deallocation.  However, scheduling costs do not
contribute greatly to the overall latency of transferring large
messages.

 Measurement Summary 

The detailed measurements have shown the contributions to latency of
the various layers used in TCP communication.  For large packet sizes,
most of the overall processing time is spent in data copies and the
checksum calculation significantly.  For small packet sizes, the
scheduling time and the time to do the TCP processing become
noticeable when compared with the overhead of mbuf allocation and
deallocation.  However, for large transfers, the checksumming and
copying data operations dominate the round trip times.

For the TCP layer in particular, the protocol processing time can be
split into the time to perform the checksum, the time to do the copy
during transmit, and the remainder.  Although we do not further
address the issue of the data copy, we address the problem of reducing
the remaining protocol processing time using header prediction in the
next section and the problem of optimizing the checksum in a
subsequent section.

 Header Prediction 

Header prediction has often been suggested as a performance benefit
for TCP [Clark89].  There are two distinct kinds of optimizations that
are often called header prediction.  The first, involving prefilling
parts of the transport header, is a known optimization for lowering
latency Peregrine,Firefly , and is not discussed further here.  The
second technique involves exploiting traffic locality to predict the
next incoming packet to avoid the protocol control block (PCB) lookup
cost.  Others have studied using traffic locality to improve
throughput for bulk data transfer protocols [CarterBulk,PacketTrains];
we study its impact on latency.

In the BSD implementation, the TCP input processing engine keeps a
single entry cache of the most recently used PCB.  If the incoming
packet is from the same connection as the previous packet, the call to
the PCB lookup routine is avoided.  The BSD 4.4 alpha TCP also
precomputes the values it expects to find in the next incoming packet
header, and can then execute a faster processing path if the
prediction is correct.  This notion is similar to the RPC ``fast
path'' found in high performance RPC systems such as SRC RPC
[Firefly].

 [ Table hdrpredict and Figure fig-hdrpredict ]

A related issue is the organization of PCBs, so that lookup is
efficient in the case where there is a miss in the PCB cache.  The
insertion algorithm for the linked list of PCBs places the most recent
creation at the head of the list.  The lookup algorithm for the PCBs
is just a linear search through the linked list of PCBs.  McKenney and
Dove study alternative data structures for PCB lookup, and analyze
these data structures by the expected average search length
[McKenney].  However, they do not discuss how long a search of any
given length will take.  While this facilitates comparisons, it is
difficult to study the absolute effect of header prediction.

We measured the cost of a search for a variety of lengths, ranging
from 20 entries (26 usecs) to 1000 entries (1280 usecs), and found
that the results scaled linearly.  The cost per element on a
DECstation 5000/200 is just less than 1.3 usecs.  In addition, the
typical number of active PCBs appears to be quite modest.  For
example, our departmental mail server has less than 250 active PCBs,
and sampling thirty of our department workstations we found that all
had less than 50.  Given the relatively small memory requirements
(even for 1000 PCBs), it seems that a simple hash table implementation
could eliminate the lookup problem entirely.

In light of the above discussion, we decided to neglect the cost of
the lookup and analyze the overall benefit of header prediction given
that lookups are free.  We built a kernel where both the PCB cache and
the precomputation of the next incoming packet header (i.e. the TCP
fast path) were disabled.  By default, in our test environment, there
will only be a very small number of TCP connections because our
machines are only running the standard ULTRIX daemons and our test
program.

Table [hdrpredict] shows the results of this experiment, comparing a
kernel with header prediction disabled to a kernel with it enabled.
Figure [fig-hdrpredict] plots the same data graphically.  For all the
cases less than 8000 bytes, we notice only a very small improvement
with header prediction, which is basically independent of data size.
This small improvement is caused by a hit in the PCB cache, since the
header precomputation and check (TCP fast path) fails in these cases
(as explained below).  In the 8000 byte case, the larger difference
comes from the header precomputation and check succeeding for half the
received packets, as well as the hit in the PCB cache.  The savings
from the PCB cache hit are not large because the number of PCBs is
small (1.3 usecs per PCB), and the TCP connection for our test program
is likely to be near the head of the PCB list since recently created
connections go at the head of the list.  Even if there were many
connections, a hash table implementation of PCBs would yield similar
results.

The precomputation and check of the next header fails in all cases
except the 8000 byte tests, where it succeeds half the time.  In the
8000 byte case, this accounts for a small but noticeable difference.
This is because two packets are being sent in the 8000 byte case, so
the precomputation and check succeeds for the second packet.  Upon
closer inspection of the header prediction code, we discovered that
the BSD 4.4 TCP header prediction only works in the two common cases
of unidirectional data transfer.  As the sender in a unidirectional
transfer, header prediction succeeds when receiving an in-sequence
acknowledgment with no data.  As the receiver in a unidirectional
transfer, header prediction succeeds when receiving an in-sequence
data segment with no acknowledgment.  Our test code creates the common
case for a round-trip RPC style of communication where one receives
data with a piggybacked acknowledgment, and this does not arise in a
single sender, high throughput style of communication, which is what
this code has been optimized for.

To summarize our results concerning header prediction, we found that
the PCB cache accounted for a only a small improvement in latency
(about 4% on average), and that the current implementation of
header precomputation does not improve latency in a bidirectional RPC
style of communication.

 TCP Checksums 

>From the breakdown of the latency costs above, we see that, for large
transfers, the cost of calculating the TCP checksum is a significant
portion of round trip latency.  In this section we introduce an
optimized checksum algorithm and then discuss the kernel
implementation issues of combining the checksum with a data copy.  We
then address the issue of eliminating the checksum for particular
combinations of link types and applications.

 Optimizing the Checksum 

Others have noted that the ULTRIX 4.2A checksum algorithm could be
improved by eliminating halfword accesses and using loop unrolling
[Kay2].  We implemented a similar optimized checksum algorithm; the
performance of this algorithm and the ULTRIX algorithm at user level
are shown in Table [copysum].

An optimization suggested in [OSF,Clark89] combines the checksum
calculation with one of the data copies to eliminate redundant
movement of data over the memory bus.  In ULTRIX 4.2A, data is copied
at least twice on both send and receive in addition to calculating the
TCP checksum.  One copy moves the data between user and kernel space,
and the other copy moves the data between kernel and device memory.

We combined our optimized checksum algorithm with a data copy at user
level to investigate its potential performance benefits.  The results
are show in Table [copysum].  The benefits are large: in the 8000 byte
case, integrating the checksum and copy is 40% faster than performing
the operations separately, and the effective bandwidth limitation
imposed by the combined copy and checksum loop is just above 9 MB/s on
the DECstation 5000/200.

The graph in Figure [fig-copysum] shows the relative performance of
the three methods for calculating the TCP checksum and copying the
data.

	[ Table copysum and Figure fig-copysum ]

We compare the performance of our implementation of the integrated
checksum and copy with a user-level implementation on a Sun-3
described in [Clark89].  Their measurements provide an interesting
comparison of the scale in performance of a combined checksum and copy
algorithm when changing hardware platforms.  For example, with 1 KB of
data they reported 130 usecs to perform the checksum, and 140 usecs to
perform the memory to memory copy.  The cost of their combined
algorithm was 200 usecs.  On the DECstation 5000/200, our optimized
checksum takes 96 usecs to checksum 1 KB of data, and the copy takes
91 usecs.  The combined checksum and copy algorithm takes 111 usecs.
The savings from the combined algorithm on the Sun-3 is 35 usecs, and
on the DECstation 5000/200 is 68 usecs.  The overall improvement when
switching from the Sun to the DECstation is 80 usecs.

 Kernel Implementation Issues 

On the transmit side, the design of our ATM interface makes it
impossible to defer the checksum calculation until the copy from
kernel to device memory.  Recall that it uses a simple memory mapped
transmit FIFO.  As soon as a single cell has been copied into the FIFO
memory, the device begins to send it as later cells are still being
copied to the device; there is no explicit action by the device driver
to trigger the send.  To compute the checksum, one must copy all of
the data, and then write the checksum into the header of the packet.
Therefore, it is impossible to combine the checksum and copy loops at
the driver level given the FORE interface design.

Instead, we chose to integrate calculating the checksum during the
copy from user to kernel space.  The TCP checksum has the convenient
property that one can calculate the checksums for pieces of a packet
and then combine those partial checksums later.  We calculate the
checksum for each chunk of data copied into an mbuf at the socket
layer, and store the partial checksum in the mbuf header.  As long as
all of the data in the mbuf are transmitted in the same TCP segment,
then the TCP layer will not have to recalculate the checksum for that
data.  In our implementation, the socket layer chooses the amount of
data to place in each mbuf independent of the current TCP segment
size.  One possible improvement to this scheme would be for the socket
layer to predict future TCP segment sizes based on recent behavior.
Another alternative would be to split the data in an mbuf into smaller
chunks and calculate more than one checksum per mbuf, thus increasing
the chance that a chunk will be transmitted in a single segment.

On the receive side, it will be difficult to postpone the checksum
calculation until the kernel to user space copy because the protocol
processing needs to know whether or not the incoming data is corrupt.
Therefore, we have implemented the combined copy and checksum from the
device memory to kernel memory.  One disadvantage of this approach is
that the device driver for each network interface needs to be modified
to support this.  Implementing the combined copy and checksum on the
receive side is conceptually much simpler than on the send side
because all the issues of mbufs and partial checksums disappear.
However, the details of the implementation can be quite difficult and
heavily dependent on the details of the device driver.

For comparison, the Digital OSF study also dicusses implementing a
combined copy and checksum in the kernel [OSF].  It appears that their
combined copy and checksum is only used on the receive side for
incoming UDP packets.  Also, their implementation combines the
checksum with the copy from kernel to user space, rather than from
device to kernel memory.  This requires that the user enable this code
with a socket option because, in the case where the checksum fails,
they must overwrite the user buffer with zeros to clear out the data
that was just copied.

	[ Table comb-chksum ]

As an approximate measure of complexity, we added about 800 lines of
code to implement the combined copy and checksum on both send and
receive. The assembly language routines that implement the combined
copy and checksum algorithm were less than half of the total number of
lines of code, the rest was integration with the socket layer, the TCP
layer, and the ATM driver.

The performance of our initial kernel implementation of the combined
copy and checksum is shown in Figure [comb-chksum].  As expected, when
the size of the data transfers increases, the combined checksum
calculation provides significant savings in overall latency.  In the
8000 byte case, the overall improvement has reached 24%.  However, our
initial implementation incurs significant costs in the smaller length
cases, and the break-even point occurs somewhere between 500 and 1400
bytes.

 Eliminating the TCP Checksum 

The previous section has demonstrated that combining the checksum
calculation with a data copy reduces latency. However, it is clear
that latency can be further reduced by eliminating the checksum
calculation altogether.  It is already common practice to eliminate
the UDP checksum for local area NFS traffic.  Kay and Pasquale
describe a mechanism using the Alternate Checksum Option to negotiate
connections that do not use the checksum [Kay].  We therefore restrict
ourselves to an analysis of the error characteristics and the
implications of eliminating the checksum for local-area ATM traffic.
We define local-area traffic as packets that go from source host to
destination host without passing through any IP routers.

To measure the latency effect of eliminating the TCP checksum, we
compare the round-trip times of the various packet sizes with and
without the checksum calculation. Table [no-chksum] shows the results
of eliminating the checksum on round trip measurements.  The packet
sizes are in bytes, and all times are in microseconds.  The Checksum
column shows the average round-trip latency when the checksum is
calculated; No Checksum shows the average round-trip latency when the
checksum is not calculated; and Percentage Saving is the relative
saving when the checksum is eliminated.  On the 4 byte case where the
checksum overhead is minimal, nothing is gained.  But, as the packet
size increases, eliminating the TCP checksum significantly improves
communication latency, e.g., the latency of the 8000 byte case is
reduced by about 40%.

	[ Table no-chksum ]

With proper support from the host-network interface and the
processor-memory subsystem, eliminating the TCP checksum can also
benefit throughput oriented applications.  For example, having DMA
capability in the host-network interface and a snoopy cache as found
in [alpha-manual], allows data to be moved at near bus bandwidth
speeds to the application layer. In contrast, as Section [sec-optcsum]
indicates, even an integrated copy and checksum routine limits
bandwidth to about 9% of the bus bandwidth on the DECstation 5000/200.

 System Issues in Checksum Elimination 

Eliminating the TCP checksum on local-area ATM networks is a delicate
system design decision and we discuss some of the relevant
system-level issues first before discussing its performance implications.

The ``end-to-end argument'', a classic principle in system design,
says that the two ends of a reliable communication path should not
depend on any of the intervening system components for correctness
[end-to-end]. In other words, to assure the integrity of the
communicated data, the communication end points must do a check
independent of any checks done by intermediary components.  If checks
are done by intermediate or internal layers, they serve only as
potential performance optimizations and do not subsume the end-to-end
correctness check.

Intermediate checks can serve as performance optimizations, for
example, by providing an inexpensive check that detects frequently
occurring errors that would otherwise invoke a more expensive
end-to-end recovery mechanism. On the other hand, an intermediate
check can result in overall performance loss if, for example, it is
expensive to perform and detects only infrequent errors; in such a
case, even a very expensive end-to-end recovery could be preferable
since the error seldom happens and the recovery cost can therefore be
amortized.

In cases where TCP is used by a higher level service that performs its
own checks, such as RPC systems that check their arguments, there is
some debate on whether it is prudent to eliminate TCP checksums. The
original environment that TCP was developed in used low-bandwidth
links with little support for link-level error detection in hardware.
Thus, the cost of detecting an error at the application layer impacted
performance significantly.  The use of the TCP checksum was therefore
a performance optimization, consistent with the spirit of the
end-to-end argument.

However, ATM networks with fiber optic links have very low error
rates. For example, the bit error rate is on the order of errors/s
(i.e., one bit error in 3 hours if the network is used continuously at
a bandwidth of 100 Mbits/s) [Greene].  Further, standard ATM
adaptation layers (e.g., AAL3/4 and AAL5) specify end-to-end CRC
checksums on the data, and host-network interfaces implement these in
hardware.  Thus, in cases where TCP is used as an intermediate layer
to deliver data from an ATM network to a higher layer that will
perform its own data integrity checks, eliminating the TCP checksum
seems acceptable both on the grounds of performance and the end-to-end
principle.

The main purpose of layering a TCP checksum over a link-level CRC is
to potentially detect errors that the CRC does not catch. These errors
can arise from four sources: (1) errors introduced by switches in
transferring data between their input and output ports, (2) errors
introduced by the network controllers in moving data between host and
controller memories, (3) erroneous data injected into the network
through external gateways or bridges, and (4) errors introduced by the
link that are theoretically not detectable by the CRC because of the
properties of the erroneous bit pattern and the CRC.

The first source of errors is not a problem since AAL payload
checksums are end-to-end, i.e., intermediate switches do not recompute
the checksum.  The second type of errors is a potential problem, if
for example, a buggy network controller introduces errors in
transferring data between host and controller memory. We view this as
a hardware problem in the controller that can be fixed using a
hardware solution, e.g., incorporating parity or ECC into the
controller memory.

The impact of the third type of errors can be eliminated by the
application selectively using the TCP checksum elimination option only
for local-area traffic. In fact, experiments conducted with and
without wide-area traffic on our departmental Ethernet indicate that
TCP detects two orders of magnitude fewer errors than the Ethernet CRC
when wide-area traffic is included.  Without wide-area traffic, TCP
detected no checksum errors.  We expect similar behavior on a local
ATM network with quieter fibers.  The quieter fiber also reduces the
likelihood of errors from the fourth source.

We do not believe that eliminating the TCP transport entirely will be
an option because the ATM network does not guarantee freedom from cell
loss and some form of reliable transport mechanism will be required.
The cost savings of optionally eliminating the TCP checksum, though,
may be attractive to some applications.  Other applications that are
unable or unwilling to perform the necessary application-specific
checks to guard against data corruption can continue to use the
checksum at a cost.

 Summary 

To summarize, our measurements involving the TCP checksum suggest that
there are definite performance advantages in making it optional.
Further, for certain combinations of link types and applications, we
believe it is possible to eliminate the TCP checksum without
compromising error detection efficiency or violating established
system design principles.


 Conclusions 

In this paper we have investigated the latency characteristics of TCP,
and how optimizations originally proposed to improve throughput affect
latency.

In previous experience with designing high performance ``lightweight''
RPC systems, we found that network driver and adapter design have a
significant impact on performance [Thekkath].  For TCP, we also found
that the network driver, adapter, and physical link have a significant
impact on the latency.

In this paper, we first characterized the latency costs of TCP by
breaking down round trip times for data transfers ranging in size from
4 bytes to 8000 bytes.  We found that operating system services such
as memory allocation contributed little to protocol processing time,
particularly compared to the cost of scheduling.  We also found that
data-touching operations, such as copying and checksumming, dominate
latency for transfers larger then 200 bytes.

We also found that header prediction had only a small impact on
latency, primarily because the TCP fast path for input processing is
not used for the round trip style of communication we measured.

We have found that computing the TCP checksum is a major cost of the
overall TCP processing, and have characterized the effect of two
optimizations to the checksum process.  The first is a optimized
implementation of the checksum algorithm that uses loop unrolling.
The second optimization combines the optimized checksum calculation
with copying the data.  For large transfer sizes, the combined
checksum and copy mechanism decreases round trip latency by as much as
24%.

In addition to optimizing the TCP checksum process, we have argued
that, for particular combinations of link types and applications, it
is possible to eliminate the TCP checksum calculation.  Once the
checksum is eliminated, round trip latency improves by as much as 41%.

 Acknowledgements 

We would like to gratefully acknowledge Ed Lazowska for his
encouragement and comments on this project and report, as well as the
helpful suggestions of the program committee.  We also wish to thank
Brian Bershad for his diligent shepherding that added greatly to the
clarity of the paper.


Bibliography

[LRPC] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and
Henry M. Levy. ``Lightweight Remote Procedure Call.''  In ACM
Transactions on Computer Systems ,8(1):37--55, February, 1990.

[CarterBulk] John B. Carter and Willy Zwaenepoel.  ``Optimistic
Implementation of Bulk Data Transfer.''  In Proceedings of the 1989
ACM SIGMETRICS Conference on Measurement and Modeling of Computer
Systems , May 1989, pp. 61--69.

[OSF] Chran-Ham Chang, Dick Flower, John Forecast, Heather Gray, Bill
Hawe, Ashok Nadkarni, K. K. Ramakrishna, Uttam Shikarpur and Kathy
Wilde.  ``High-Performance TCP/IP and UDP/IP Networking in DEC OSF/1
for Alpha AXP.''  Digital Technical Journal , Vol. 5 No. 1, Winter
1993, pp. 44--61.

[Clark89] David D. Clark, Van Jacobson, John Romkey, and Howard
Salwen.  ``An Analysis of TCP Processing Overhead.''  IEEE
Communications Magazine, June 1989, 23--39.

[alpha-manual] Digital Equipment Corporation, Maynard, MA.  Alpha
Architecture Reference Manual , 1992.

[FORE] FORE Systems.  TCA-100 TURBOchannel ATM Computer Interface,
User's Manual , 1992.

[Greene] Daniel H. Greene and J. Bryan Lyles. ``Reliability of
Adaptation Layers'' Protocols for High-Speed Networks, III , Elsevier
Science Publishers B.V., 1993, pp. 185--200.

[Kay] Jonathan Kay and Joseph Pasquale. ``A Performance Analysis of
TCP/IP and UDP/IP Networking Software for the DECstation 5000'' Tech
Report, CSL U.C. San Diego/Sequoia , December 1992.

[Kay2] Jonathan Kay and Joseph Pasquale. ``Measurement, Analysis, and
Improvement of UDP/IP Throughput for the DECstation 5000'' Tech
Report, CSL U.C. San Diego/Sequoia , January 1993.

[Jacobson92] Van Jacobson, Robert Braden, and David Borman.  ``TCP
Extensions for High Performance.'' RFC 1323, LBL, USC/ISI, and Cray
Research, May 1992.

[Peregrine] David B. Johnson and Willy Zwaenepoel.  The Peregrine
high-performance RPC system.  Software -- Practice and Experience ,
23(2):201--221, February 1993.

[McKenney] Paul E. McKenney and Ken F. Dove. ``Efficient
Demultiplexing of Incoming TCP Packets.''  In Proceedings of SIGCOMM
'92 , August 1992, pp. 269--79.

[Ouster] John K. Ousterhout.  ``Why Aren't Operating Systems Getting
Faster As Fast as Hardware?'' In Proceedings of the USENIX 1990 Summer
Conference , June 1990, pp. 247--256.

[end-to-end] Jerome H. Saltzer, David P. Reed, and David D. Clark.
End-to-end arguments in system design.  ACM Transactions on Computer
Systems , 2(4):277--288, November 1984.

[Firefly] Michael D. Schroeder and Michael Burrows.``Performance of
Firefly RPC.''  ACM Transactions on Computer Systems , 8(1):1--17,
February 1990.

[Autonet] Michael D. Schroeder, Andrew D. Birrell, Michael Burrows,
Hal Murray, Roger M.  Needham, Thomas L. Rodeheffer, Edwin H.
Satterthwaite, and Charles P.  Thacker.  Autonet: A high-speed,
self-configuring local area network using point-to-point links.  IEEE
Journal on Selected Areas in Communications , 9(8):1318--1335, October
1991.

[PacketTrains] Cheng Song and Lawrence Landweber.  ``Optimizing Bulk
Data Transfer Performance: A Packet Train Approach.''  In Proceedings
of SIGCOMM '88 , September 1988, pp. 134--144.

[Thekkath] Chandramohan A. Thekkath and Henry M. Levy.  ``Limits to
Low-Latency Communication on High-Speed Networks.''  ACM Transactions
on Computer Systems ,11(2):179--203, May 1993.


   Alec Wolman is a graduate student at the University of Washington,
currently on leave from Digital Equipment Corporation's Cambridge
Research Lab.  He holds an A.B. in Computer Science from Harvard
University.  His electronic mail address is wolman@cs.washington.edu.

   Geoff Voelker is a graduate student at the University of
Washington.  He received the B.S. in Electrical Engineering and
Computer Science from the University of California at Berkeley in
1992.  His electronic mail address is voelker@cs.washington.edu.

   Chandramohan A. Thekkath is a candidate for the Ph.D. degree in the
Department of Computer Science Engineering at the University of
Washington. His electronic mail address is thekkath@cs.washington.edu.