| ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
4th USENIX Windows Systems Symposium Paper 2000   
[Technical Index]
Pp. 113124 of the Proceedings
WSDLite: A Lightweight Alternative to Windows Sockets Direct Path‡
AbstractThis paper describes WSDLite, a thin software layer that maps a useful subset of the WinSock2 API onto a system area network. The development of WSDLite was motivated by our experience with an early version of Windows Sockets Direct Path (WSDP). WSDP was developed by Microsoft to allow unmodified network applications to exploit the performance and reliability advantages of System Area Networks (SANs). This is accomplished through the use of a software “switch” that, when appropriate, redirects message traffic through the SAN provider protocol stack instead of the standard TCP/IP protocol stack. In addition to the performance advantages, the WSDP architecture offers several other benefits, including automatic support for legacy code, a single well-known API for supporting many different underlying SAN network protocols, and substantially simpler buffer management than that required by the native SAN API. The beta version of WSDP that we examined did not perform as well as expected, achieving only 26% of the native SAN throughput on the system studied. In an effort to determine whether or not this performance difference was intrinsic, we developed WSDLite, a simple alternative to WSDP. WSDLite is a user-level runtime library that implements a small but commonly used subset of the WinSock2 API. For those applications that do not require full WinSock2 functionality, WSDLite provides both the transparency of WSDP and much of the performance benefit of the underlying SAN architecture. In low-level network tests, WSDLite achieves an average of 70% of the native SAN performance. In this paper we describe the design of WSDLite, and present results comparing the performance of both parallel applications and low-level benchmarks using WSDLite, WSDP, TCP, and a native SAN programming library API as the network programming layer. 1. IntroductionSystem area networks (SANs) are
characterized by high bandwidth; low latency (on the order of 10μsec or
less for zero-length messages); a switched network environment; reliable
transport service implemented directly in hardware; no kernel intervention to
send and receive messages; and little or no copying on either the sending or
receiving side. SANs may be used for
enterprise applications such as databases, web servers, reservation systems,
and small to medium scale parallel computing environments. System area networks have not yet
enjoyed wide adoption, in part because of the difficulty associated with
writing applications to take advantage of network programming libraries that
generally ship with SAN hardware. In order to provide low latency, zero or
single-copy messaging between nodes in a SAN, programmers must address a
variety of buffer management and flow control issues not typically associated
with TCP/IP-style network programming.
These issues stem primarily from the use of DMA between the network
interface card and the host memory, a process that allows system area networks
to provide orders-of-magnitude lower latencies and lower processor utilizations
than previous network architectures and protocols. Addressing these requirements can represent a significant burden,
not only to programmers developing new applications, but also to those who wish
to obtain the benefits of system area networks for the many millions of lines
of existing network application code. To address these concerns,
Microsoft, working with SAN implementers, has developed an alternative that
will allow network applications to obtain many of the performance benefits
associated with system area networks while retaining the familiar programming
interface of Berkeley-style sockets in the WinSock2 API. This technology, called Windows Sockets Direct Path (WSDP) [4], fits immediately below the network application and
routes network communication calls to either the standard TCP/IP protocol stack
or to the WinSock SAN Provider stack, which utilizes the SAN’s native network
communication mechanism to achieve low latency, high throughput messaging. One of the principal benefits of WSDP is
that existing WinSock2-compliant applications do not have to be rewritten, or
even recompiled. Currently, WSDP is restricted
to use with the Data Center version of the Windows 2000 operating system. WSDP necessarily implements the
entire WinSock2 API, and as a result, incurs overhead costs associated with
providing full functionality. In the
beta version of WSDP that we have examined, this overhead is quite
substantial. While we expect release
versions of WSDP software to exhibit better performance than the current beta
version, we also believe that there are attractive design alternatives for
those applications that do not require full WinSock2 functionality. This paper explores one such alternative. We have implemented WSDLite, a
protocol layer that implements a subset of the WinSock2 API on top of the raw
programming interface provided by the GigaNet cLAN implementation of the Virtual Interface Architecture (VIA) [3]. The VI
Architecture is the proposed standard for user-level networks developed by
Microsoft, Compaq, and Intel. The cLAN architecture provides 9μsec latency
for zero-byte messages in our system area network environment when using the VI
Programming Library (VIPL) API.
WSDLite, similar to WSDP, allows programs written to use TCP/IP to
obtain the performance benefits associated with an underlying network
architecture that supports VIA. We make
use of the Detours [9] binary rewriting software package to intercept the
TCP calls implemented by WSDLite and route them to the WSDLite implementation
of these functions, while forwarding TCP calls not implemented within WSDLite
to the standard WinSock2 protocol stack. Detours allows us to run
Winsock2-compilaint applications without recompilation. Unlike WSDP, however,
WSDLite only implements a subset (approximately 10%) of WinSock2 functions. The
functions implemented were chosen based upon their common use in a variety of
software available at our site. A
lighter-weight protocol layer such as WSDLite can provide substantial
performance benefit relative to full-functioned protocol layers for
applications that do not need the full TCP/IP functionality provided by
WSDP. Additionally, WSDLite can be used
on any Windows NT or Windows 2000 system for which VIA support is available; it
is not restricted to Windows 2000 Data Center.
We have successfully tested WSDLite on clusters comprised of Windows NT
4.0 workstations and servers, Windows 2000 Professional Workstations, and
Windows 2000 Data Center Servers.
Simple network latency tests show WSDLite to be an average of 59% faster
than the beta WSDP implementation across all message sizes up to 32 Kbytes. We examine the performance of
WSDLite using several network benchmark programs. First, we compare the
performance of a series of low-level benchmarks with (1) TCP/IP using WinSock
only, (2) TCP/IP using WSDP, (3) TCP/IP using WSDLite, and (4) a version
written to use the native VIPL API. For
each of the low-level benchmarks, we report roundtrip latency and network
throughput. We also report processor utilization,
as well as throughput per CPU second, which brings into focus the tradeoff
between network and application performance.
We next examine the overhead associated with the use of the Detours [9] library to provide Winsock2 transparency. Finally, we use the same four network layer
implementations as the messaging layer for the Brazos Parallel Programming
Library. By running a set of parallel applications utilizing Brazos, we can
evaluate the performance of each network alternative on real applications. The rest of this paper is
organized as follows. Section 2
provides a brief overview of the Virtual Interface Architecture in order to
provide the context for the discussion of Windows Sockets Direct Path in
Section 3. Section 4 describes the
design and implementation of
WSDLite. In Section 5 we report
the results of our experimental comparison of WSDLite and WSDP. Related work is described in Section 6. We conclude and discuss future work in
Section 7. 2. Overview of the VI ArchitectureAlthough Windows Sockets Direct
Path is designed to work with a variety of system area network architectures,
we are only aware of current WSDP support in the context of the Virtual
Interface Architecture. In this
section, we present an overview of the VI Architecture as implemented on the
GigaNet cLAN GNN1000 network interface card.
Figure 1 depicts the organization
of the Virtual Interface Architecture.
The VI Architecture is comprised of four basic components: Virtual Interfaces,
Completion Queues, VI Providers, and VI Consumers. The VI Provider consists of
the VI Network Adapter and a Kernel Agent device driver. The VI Consumer is
composed of an application program and an operating system communication
facility such as MPI or sockets, although some “VI-aware” applications
communicate directly with the VI Provider API. After connection setup by the
Kernel Agent, all network actions occur without kernel intervention. This results in significantly lower
latencies than network protocols such as TCP/IP. Traps into kernel mode are only required for creation/destruction
of VI’s, VI connection setup and teardown, interrupt processing, registration
of system memory used by the VI NIC, and error handling. VI Consumers access
the Kernel Agent using standard operating system mechanisms. A VI consists of a Send Queue and
a Receive Queue. VI Consumers post requests (Descriptors) on these queues to
send or receive data. Descriptors
contain all of the information that the VI Provider needs to process the
request, including pointers to data buffers. VI Providers asynchronously
process the posted Descriptors and mark them when completed. VI Consumers
remove completed Descriptors from the Send and Receive Queues and reuse them
for subsequent requests. Both the Send
and Receive Queues have an associated “Doorbell” that is used to notify the VI
network adapter that a new Descriptor has been posted to either the Send or
Receive Queue. The Doorbell is directly implemented on the VI Network Adapter
and no kernel intervention is required to perform this signaling. The Completion Queue allows the VI Consumer
to combine the notification of Descriptor completions of multiple VI’s without
requiring an interrupt or kernel call. 2.1. Memory RegistrationIn order to eliminate the copying
between kernel and user buffers that accounts for a large portion of the
overhead associated with traditional network protocol stacks, the VI
Architecture requires the VI Consumer to register all send and receive memory
buffers with the VI Provider. This
registration process locks down the appropriate pages in memory, which allows
for direct DMA operations into user memory by the VI hardware, without the
possibility of an intervening page fault.
After locking the buffer memory pages in physical memory, the virtual to
physical mapping and an opaque handle for each memory region registered are
provided to the VI Adapter. Memory
registration allows the VI Consumer to reuse registered memory buffers, thereby
avoiding duplication of locking and translation operations. Memory registration also takes page-locking
overhead out of the performance-critical data transfer path. 2.2. Data Transfer ModesThe VI Architecture provides two
different modes of data transfer: traditional send and receive semantics, and direct
reads and writes to and from the memory of remote machines. Remote data reads and writes provide a
mechanism for a process to send data to another node or retrieve data from
another node, without any action on the part of the remote node (other than VI
connection). The send/receive model of
the VI Architecture follows the common approach to transferring data between
two endpoints, except that all send and receive operations complete
asynchronously. The VI Consumers on
both the sending and receiving nodes specify the location of the data. On the
sending side, the sending process specifies the memory regions that contain the
data to be sent. On the receiving side, the receiving process specifies the
memory regions where the data will be placed.
The VI Consumer at the receiving end must post a Descriptor to the
Receive Queue of a VI before the data is sent. The VI Consumer at the sending
end can then post the message to the corresponding VI’s Send Queue.
3. Windows Sockets Direct PathWindows Sockets Direct Path
(WSDP) allows programs written for TCP/IP to transparently realize the
performance advantages of user-level networks such as VIA. Programs developed to the WinSock2 API do
not have to be rewritten to take advantage of changes in underlying network
architecture to a SAN, nor is recompilation of these programs necessary. This enables legacy network code to work
“out of the box” and enjoy at least some benefit of the low message latency
associated with SANs. Although WSDP is
designed to work with a variety of low-latency SAN architectures, we restrict
our discussion here to how WSDP interacts with the cLAN VIA architecture
described in Section 2. WSDP removes many of the pedantic
tasks that must be addressed by programs that directly access the VIPL
API. These include memory registration,
certain aspects of buffer management, and the effort required to port and
recompile a sockets-compliant application to use the VIPL API. In the following sections we describe the
basic technology associated with WSDP as well as some programming
considerations that must be addressed to use WSDP effectively. Figure 2 depicts a block diagram of the WSDP architecture. The key component of the WSDP architecture is the software switch, which is responsible for routing network operations initiated by WinSock2 API calls to either the standard TCP/IP protocol stack, or to the vendor-supplied SAN WS Provider. In addition to providing access to both of these pathways to the network on an operation-by-operation basis, the switch provides several important functions through the use of a lightweight session executed on top of the SAN provider. This session provides OOB (out of band) support, flow control, and support for the select operation. None of these mechanisms are traditionally provided by a typical SAN architecture. There are several operations that require the support of the TCP/IP protocol stack (i.e., do not use WSDP), including: · Connections
to remote subnets. · Socket
creation. · Raw
sockets and UDP sockets - Because SANs support connection-oriented reliable
communication, all connectionless and uncontrolled communication must be
handled by the TCP/IP protocol stack.
This limits the applicability of WSDP to those applications that (a) use
TCP, and (b) do not make use of group communication. In addition to these restrictions
on the use of WSDP, system calls are required to complete most overlapped I/O
calls, increasing the latency of these calls due to induced operating system
overhead. The switch component is also responsible
for taking care of several programming details that usually must be addressed
by the programmer writing directly to the programming library supplied with
SANs. A brief discussion of these
details follows: · Buffer
registration – As discussed in Section 2.1, buffer space used for messaging must be registered
with a SAN provider in order to allow direct DMA into and out of host memory by
the NIC. However, there is no provision for this functionality in the WinSock2
specification, as the operating system handles message buffering through
copying in a standard WinSock environment.
Therefore, the switch component is responsible for ensuring that all buffer
regions used for communication are registered with the SAN provider prior to
use. · Buffer
placement – Another issue relating to the management of buffers in a system
area network requires there to be a buffer posted to a network endpoint prior
to receipt of an incoming message. This
is again related to the use of DMA between the network interface card and the
host memory and the lack of flow control associated with SAN NICs. The switch software pre-posts small buffers
to each connection opened through the WS SAN Provider in order to handle incoming
messages. · Support
for RDMA – Most system area network include support for remote memory
operations, allowing a host node to directly write and/or read data directly
from a remote node’s address space. No
such API exists in the WinSock2 specification. WSDP makes use of the remote
write capability of the cLAN architecture in a manner similar to that of
WSDLite, as discussed in the next section. 4. WSDLiteWSDLite implements approximately
10% of the WinSock2 API. The following
functions are currently implemented by WSDLite: WSAStartup(), WSACleanup(),
WSASocket(), socket(), connect(), listen(), accept(),
bind(), send(), WSASend(), recv(), WSARecv(),
select(), closesocket(), and WSAGetLastError(). When an application calls a
function supported by WSDLite, the function call is intercepted by the Detours
[9] runtime library and redirected to the version of the function implemented
by WSDLite. In order to leverage functionality existing in the WinSock TCP/IP
protocol stack that is not directly related to messaging performance (such as
connection procedures and name resolution), some of the WSDLite functions make
calls to their WinSock counterparts from within the WSDLite library. For instance, during connection procedures,
the WSDLite implementation of bind() calls the WinSock2 version of bind()
internally to check for errors such as two sockets being bound to the same
port. In fact, WSDLite duplicates the
entire connection process internally on the default TCP/IP protocol stack in
order to catch such errors, greatly reducing the code size of the WSDLite
implementation. 4.1. Sending Data in WSDLiteWhen a message is to be sent on a connected pair of sockets, the WSDLite implementation of WSASend() or send() first must register the buffer containing the data to be sent, if it is not already registered with the cLAN NIC. Memory Registration IssuesRegistering memory is an
expensive operation for two reasons.
First, registering and deregistering memory on each network access would
add unacceptable latency to network operations, especially for small messages.
We measured the cost of registering memory for buffer sizes up to 32 Kbytes,
and found that it takes roughly 15 μsec to register and deregister a region
of memory with the VI Provider, regardless of buffer size. This time increases linearly with buffer
size after the size exceeds the 64K segment size used by the NT virtual memory
manager. To address this issue, WSDLite
maintains a hash table of address ranges that have been used as messaging
buffers previously, and this table is consulted before a message can be
sent. There are three possible outcomes
from the initial hash table lookup:
To reduce the amount of registering that must be performed by WSDLite, it is important for application programmers to reuse buffers as much as possible. The second source of overhead
associated with memory registration results from the fact that a part of the
memory registration process involves pinning messaging buffers into physical
memory, which may reduce the resources available for applications. To address
this problem, WSDLite employs a simple garbage collection scheme based on
timestamps to reclaim unused message buffer space before the amount of pinned
RAM impacts application performance. Choosing the Correct Send SemanticWe have found that minimum latency for messages may be obtained in one of two ways, depending on the size of the message. For small messages, the best performance is achieved by copying data out of temporary receive buffers into the application buffers posted by the corresponding receive operation. For larger messages, lower latency can be achieved by taking advantage of VIA’s RDMA capability. When a large message is to be sent, the sending process first sends a setup message to the receiver. This message contains the length of the message to be sent. The receiver registers the memory region to be received into (if it is not already available), and then returns the virtual address and memory region handle to the sending process. The sending process then remote-writes the data directly into the address space of the receiving process, and sends a completion message containing the size of the message written to the receiver when the operation has completed. The message size at which WSDLite switches from memory copying to RDMA depends on the speed of the host processors, the efficiency of the memory hierarchy, and the latency of network operations.
Figure 3. Bandwidth Crossover Point The crossover point can be clearly seen in Figure 3, which shows the sustainable bandwidth of WSDLite when copying is always used regardless of message size (labeled memcpy() ), and when RDMA is always used. In the case of our system, the crossover point occurs between 8K and 16K. More precise measurements pinpoint it at 11.9 Kbytes. In general, if copying a memory region of size n takes less time than the two additional small messages necessary for the RDMA transfer, memory copying will achieve better performance. Because this value is likely to be different on different machines, WSDLite attempts to automatically determine the optimum value for this cutoff the very first time a socket is created. When the first socket on a machine is created, a small test is run that measures the time to copy regions of memory of varying sizes. When a connection is first made to a remote machine, a test to determine the latency of message sizes corresponding to the setup and acknowledgement messages required for RDMA transfer is also run. The cutoff point for this particular machine can then be determined, and this value is stored in a registry entry that is consulted each time an application makes a connection through WSDLite to a specific remote machine. This step only occurs once during the connection to a remote machine. Subsequent network programs that connect to the remote machine can simply retrieve the cutoff value from the registry based on the remote machine to which the connection is being made. The registry value may be deleted by an administrator at any time to force a recalculation of this parameter, or overridden manually. 4.2. Choice of WSDLite FunctionsFinally, we conclude this section
with a brief discussion on the functions that we chose to implement in
WSDLite. We implemented only those
calls that provide the network functionality required by our suite of network
programs used for this evaluation. We believe these to be representative of a
larger class of network applications that only use basic TCP
functionality. By keeping the number of
functions small, and the implementation thin, we are able to realize a high
percentage of the performance available from the SAN. Many other WinSock2 functions
could easily be added to the WSDLite implementation by using our initial
functions as a starting point. The
downside to our strictly user-level approach is that a different version of
WSDLite must be used for each SAN network programming library. However, precisely because we have kept the
number of functions both small and basic, this is not a difficult thing to
do. The approach taken by WSDP, on the
other hand, is one of providing full functionality regardless of the underlying
SAN network. This implies that 1) many
functions, whose implementations may not easily map to the SAN programming API,
will have high overhead; and 2) another level of indirection must exist between
the switch software provided by Microsoft and the hardware vendor-provided SAN
layer. These two observations
necessitate an implementation with higher overhead than a simple user-level
library such as WSDLite. Therefore, WSDLite is proposed as a performance
alternative to WSDP in certain situations, not a replacement for applications
requiring full TCP functionality. 5. Experimental ResultsIn this section we begin by describing our experimental platform. We then
present results comparing several important low-level network performance
measurements run under WSDP, WSDLite, TCP, and VIPL on two uniprocessor
nodes. Next, we discuss these same
measurements when SMP nodes are used. Finally, we conclude the section with
results showing the performance of four scientific parallel applications using
the four network layer alternatives when run on a larger cluster of SMP
servers. 5.1. SAN ConfigurationAll experiments were performed
using a cluster of Compaq Proliant 6400 servers running the Beta 2 release of
Windows 2000 Data Center Server, build 2195.
Each machine contains one to four 500 Mhz Pentium-III processors, 512
Mbytes of SDRAM, and dual 64-bit PCI busses running at 66 Mhz. The interconnection network is implemented
with a single GigaNet GNN1000 NIC in each machine connected via a GNX5000 switch.
The switch cut-through latency is 580 ns.
The unidirectional latency for a zero-byte message on this system is 9
μsec, and the peak sustainable bandwidth that we have observed is 102
Mbytes/sec. 5.2. Low Level ResultsIn this section we compare the performance
of a message ping-pong test that simply sends messages between two nodes in the
cluster. Each node waits for a reply
before sending the next message. We
compare the performance of this test when using WSDLite, the TCP/IP protocol
stack shipped with Windows 2000, WSDP, and the same test written directly to
the VIPL API. Note that the first three
tests are the same executable; no modifications were necessary when using
WSDLite or WSDP to take advantage of the underlying VI hardware. We examine the performance of each of these
schemes for message sizes up to 32Kbytes with respect to roundtrip latency,
peak sustainable bandwidth, processor utilization, and Mbytes/CPU-second. Finally, we look at the overhead associated
with using the Detours [9] package to provide transparent access to WSDLite
through the WinSock2 API. Results in this section have been obtained with a
single processor in each of the two machines being used. The results of making
the same measurements with four processors in each machine is discussed in
Section 5.3. Figures 4 and 5 show the performance of our ping-pong test as measured by roundtrip latency and peak sustainable bandwidth for message sizes from 1 byte to 32 Kbytes. With a single processor in each system, we see that the latency of WSDLite is on average only 19.2% higher than that of native VIPL across all message sizes. The differences between WSDLite and VIPL stem from the extra overhead on each network call of traversing through the TCP-to-VIPL translation layer, the overhead associated with trapping WinSock2 calls using Detours, and the buffer management and flow control that WSDLite must implement. As expected, TCP performs poorly on latency and peak bandwidth measurements with respect to either WSDLite or VIPL. WSDP performs similarly to TCP, but actually has higher latency at all message sizes and averages 28.8% higher than TCP. The performance of WSDP lags that of WSDLite by an average of 67.9% for all message sizes. This performance advantage of WSDLite is slightly higher at smaller message sizes, with a 69.5% improvement for single-byte messages and a 59.1% improvement for 32Kbyte messages.
Figure 4. Roundtrip Latency Figure 5. Peak Sustainable Bandwidth Figure 5 shows that the bandwidth
of TCP and WSDP peak at a maximum of around 30-35 Mbytes/sec, whereas VIPL
achieves nearly 80 Mbytes/sec, and WSDLite around 72 Mbytes/sec. The
performance of WSDLite is restricted below the 16 Kbyte message size from
additional copying out of the pre-posted receive buffers, and from the extra
setup and acknowledgement messages necessary to implement the RDMA transfer at
16 and 32 Kbyte message sizes. However,
these overheads still allow WSDLite to perform within 22% of VIPL. The
significantly higher overheads of WSDP caused by multiple software layering and
polling between these layers results in performance that is worse than just
using TCP directly, regardless of message size. Figure 6 shows the average processor utilization for the uniprocessor execution of our ping benchmark. For small messages, VIPL has a much higher processor utilization than either of the other three implementations, resulting from a time compression effect due to the small amount of time the message requires “on the wire”, and the small fixed costs due to the low overhead of the network protocol. WSDLite and TCP display similar utilizations at small message sizes due to their higher fixed-cost overhead relative to VIPL. WSDP shows the lowest overall utilization for message sizes less than 1K. All implementations that use VI in some layer (WSDP, WSDLite, and VIPL) show low processor utilizations at large message sizes due to the fact that large messages require relatively long DMA times to transfer the message to the NIC hardware, during which time the processor is idle. TCP, on the other hand, buffers and copies messages internally, keeping the utilization high throughout the entire range of message sizes. Figure 6. Processor Utilization The data presented in Figure 6 is misleading, seeming to indicate that WSDP is the most efficient protocol because the processor utilization is lower at smaller message sizes, and the VI Architecture was designed to maximize the performance of small messages [4]. By dividing the peak bandwidth achieved (as presented in Figure 5) by the processor utilization necessary to sustain this bandwidth (as shown in Figure 6), we can track the relative efficiency of a particular network protocol or architecture and find out how much processing time is required to send a fixed amount of data. Figure 7 shows this measurement for the ping test using TCP, WSDP, WSDLite, and VIPL, and is expressed in Mbytes/CPU-second. With only a single processor, TCP and WSDP perform particularly poorly using this metric at small message sizes. The relatively low processor utilization displayed by WSDP in Figure 6 is offset by the extremely low network throughput shown in Figure 5, causing WSDP’s performance to nearly mirror that of TCP for message sizes below 8Kbytes.
|