Next: 4 Performance
Up: High-Performance Local Area Communication
Previous: 2 Problems with TCP/IP

3 Fast Sockets Design

Fast Sockets is an implementation of the Sockets API that provides high-performance communication and inter-operability with existing programs. It yields high-performance communication through a low-overhead protocol layered on top of a low-overhead transport mechanism (Active Messages). Interoperability with existing programs is obtained by supporting most of the Sockets API and transparently using existing protocols for communication with non-Fast Sockets programs. In this section, we describe the design decisions and consequent trade-offs of Fast Sockets.

3.1 Built For The Local Area

Fast Sockets is targeted at local-area networks of workstations, where processing overhead is the primary limitation on communications performance. For low-overhead access to the network with a defined, portable interface, Fast Sockets uses Active Messages [Mainwaring & Culler 1995][Culler et al. 1994][Martin 1994][von Eicken et al. 1992]. An active message is a network packet which contains the name of a handler function and data for that handler. When an active message arrives at its destination, the handler is looked up and invoked with the data carried in the message. While conceptually similar to a remote procedure call [Birrell & Nelson 1984], an active message is constrained in the amount and types of data that can be carried and passed to handler functions. These constraints enable the structuring of an Active Messages layer for high performance. Also, Active Messages uses protected user-level access to the network interface, removing the operating system kernel from the critical path. Active messages are reliable, but not ordered.

Using Active Messages as a network transport involves a number of trade-offs. Active Messages has its own `on-the-wire' packet format; this makes a full implementation of TCP/IP on Active Messages infeasible, as the transmitted packets will not be comprehensible by other TCP/IP stacks. Instead, we elected to implement our own protocol for local area communication, and fall back to normal TCP/IP for wide-area communication. Active Messages operates primarily at user-level; although access to the network device is granted and revoked by the operating system kernel, data transmission and reception is performed by user-level code. For maximum performance, Fast Sockets is written as a user-level library. While this organization avoids user-kernel transitions on communications events (data transmission and reception), it makes the maintenance of shared and global state, such as the TCP and UDP port name spaces, difficult. Some global state can be maintained by simply using existing facilities. For example, the port name spaces can use the in-kernel name management functions. Other shared or global state can be maintained by using a server process to store this state, as described in [Maeda & Bershad 1993b]. Finally, using Active Messages limits Fast Sockets communication to the local-area domain. Fast Sockets supports wide-area communication by automatically switching to standard network protocols for non-local addresses. It is a reasonable trade-off, as endpoint processing overheads are generally not the limiting factor for internetwork communication.

Active Messages does have a number of benefits, however. The handler abstraction is extremely useful. A handler executes upon message reception at the destination, analogous to a network interrupt. At this time, the protocol can store packet data for later use, pass the packet data to the user program, or deal with exceptional conditions. Message handlers allow for a wide variety of control operations within a protocol without slowing down the critical path of data transmission and reception. Handler arguments enable the easy separation of packet data and metadata: packet data (that is, application data) is carried as a bulk transfer argument, and packet metadata (protocol headers) are carried in the remaining word-sized arguments. This is only possible if the headers can fit into the number of argument words provided; for our local-area protocol, the 8 words supplied by Active Messages is sufficient.

Fast Sockets further optimizes for the local area by omitting features of TCP/IP unnecessary in that environment. For example, Fast Sockets uses the checksum or CRC of the network hardware instead of one in the packet header; software checksums make little sense when packets are only traversing a single network. Fast Sockets has no equivalent of IP's `Time-To-Live' field, or IP's internetwork routing support. Since maximum packet sizes will not change within a local area network, Fast Sockets does not support IP-style in-flight packet fragmentation.

3.2 Collapsing Layers

To avoid the performance and structuring problems created by the multi-layered implementation of Unix TCP/IP, Fast Sockets collapses the API and protocol layers of the communications stack together. This avoids abstraction conflicts between the programming interface and the protocol and reduces the number of conversions between layer abstractions, further reducing processing overheads.

The actual network device interface, Active Messages, remains a distinct layer from Fast Sockets. This facilitates the portability of Fast Sockets between different operating systems and network hardware. Active Messages implementations are available for the Intel Paragon [Liu & Culler 1995], FDDI [Martin 1994], Myrinet [Mainwaring & Culler 1995], and ATM [von Eicken et al. 1995]. Layering costs are kept low because Active Messages is a thin layer, and all of its implementation-dependent constants (such as maximum packet size) are exposed to higher layers.

The Fast Sockets layer stays lightweight by exploiting Active Message handlers. Handlers allow rarely-used functionality, such as connection establishment, to be implemented without affecting the critical path of data transmission and reception. There is no need to test for uncommon events when a packet arrives - this is encoded directly in the handler. Unusual data transmission events, such as out-of-band data, also use their own handlers to keep normal transfer costs low.

Reducing the number of layers and exploiting Active Message handlers lowers the protocol- and API-specific costs of communication. While collapsing layers means that every protocol layer-API combination has to be written anew, the number of such combinations is relatively few, and the number of distinct operations required for each API is small.

3.3 Simple Buffer Management

Fast Sockets avoids the complexities of mbuf-style memory management by using a single, contiguous virtual memory buffer for each socket. Data is transferred directly into this buffer via Active Message data transfer messages. The message handler places data sequentially into the buffer to maintain in-order delivery and make data transfer to a user buffer a simple memory copy. The argument words of the data transfer messages carry packet metadata; because the argument words are passed separately to the handler, there is no need for the memory management system to strip off packet headers.

Fast Sockets eliminates send buffering. Because many user applications rely heavily on small packets and on request-response behavior, delaying packet transmission only serves to increase user-visible latency. Eliminating send-side buffering reduces protocol overhead because there are no copies on the send side of the protocol path - Active Messages already provides reliability.

Figure 1 shows Fast Sockets' send mechanism and buffering techniques.

Figure 1: Data transfer in Fast Sockets. A send() call transmits the data directly from the user buffer into the network. When it arrives at the remote destination, the message handler places it into the socket buffer, and a subsequent recv() call copies it into the user buffer.

A possible problem with this approach is that having many Fast Sockets consumes considerably more memory than the global mbuf pool used in traditional kernel implementations. This is not a major concern, for two reasons. First, the memory capacity of current workstations is very large; the scarce physical memory situations that the traditional mechanisms are designed for is generally not a problem. Second, the socket buffers are located in pageable virtual memory - if memory and scheduling pressures are severe enough, the buffer can be paged out. Although paging out the buffer will lead to worse performance, we expect that this is an extremely rare occurrence.

A more serious problem with placing socket buffers in user virtual memory is that it becomes extremely difficult to share the socket buffer between processes. Such sharing can arise due to a fork() call, for instance. Currently, Fast Sockets cannot be shared between processes (see section 3.7).

3.4 Copy Avoidance

The standard Fast Sockets data transfer mechanism involves two copies along the receive path, from the network interface to the socket buffer, and then from the socket buffer to the user buffer specified in a recv() call. While the copy from the network interface cannot be avoided, the second copy increases the processing overhead of a data packet, and consequently, round-trip latencies. A high-performance communications layer should bypass this second copy whenever possible.

It is possible, under certain circumstances, to avoid the copy through the socket buffer. If the data's final memory destination is already known upon packet arrival (through a recv() call), the data can be directly copied there. We call this technique receive posting. Figure 2 shows how receive posting operates in Fast Sockets. If the message handler determines that an incoming packet will satisfy an outstanding recv() call, the packet's contents are received directly into the user buffer. The socket buffer is never touched.

Receive posting in Fast Sockets is possible because of the integration of the API and the protocol, and the handler facilities of Active Messages. Protocol-API integration allows knowledge of the user's destination buffer to be passed down to the packet processing code. Using Active Message handlers means that the Fast Sockets code can decide where to place the incoming data when it arrives.

Figure 2: Data transfer via receive posting. If a recv() call is issued prior to the arrival of the desired data, the message handler can directly route the data into the user buffer, bypassing the socket buffer.

3.5 Design Issues In Receive Posting

Many high-performance communications systems are now using sender-based memory management, where the sending node determines the data's final memory destination. Examples of communications layers with this memory management style are Hamlyn [Buzzard et al. 1996] and the SHRIMP network interface [Blumrich et al. 1994]. Also, the initial version of Generic Active Messages [Culler et al. 1994] offered only sender-based memory management for data transfer.

Sender-based memory management has a major drawback for use in byte-stream and message-passing APIs such as Sockets. With sender-based management, the sending and receiving endpoints must synchronize and agree on the destination of each packet. This synchronization usually takes the form of a message exchange, which imposes a time cost and a resource cost (the message exchange uses network bandwidth). To minimize this synchronization cost, the original version of Fast Sockets used the socket buffer as a default destination. This meant that when a recv() call was made, data already in flight had to be directed through the socket buffer, as shown in Figure 1. These synchronization costs lowered Fast Sockets' throughput relative to Active Messages' on systems where the network was not a limiting factor, such as the Myrinet network.

Generic Active Messages 1.5 introduced a new data transfer message type that did not require sender-based memory management. This message type, the medium message, transferred data into an anonymous region of user memory and then invoked the handler with a pointer to the packet data. The handler was responsible for long-term data storage, as the buffer was deallocated upon handler completion. Keeping memory management on the receiver improves Fast Sockets performance considerably. Synchronization is now only required between the API and the protocol layers, which is simple due to their integration. The receive handler now determines the memory destination of incoming data at packet arrival time, enabling a recv() of in-flight data to benefit from receive posting and bypass the socket buffer. The net result is that receiver-based memory management did not significantly affect Fast Sockets' round-trip latencies and improved large-packet throughput substantially, to within 10% of the throughput of raw Active Messages.

The use of receiver-based memory management has some trade-offs relative to sender-based systems. In a true zero-copy transport layer, where packet data is transferred via DMA to user memory, transfer to an anonymous region can place an extra copy into the receive path. This is not a problem for two reasons. First, many current I/O architectures, like that on the SPARC, are limited in the memory regions that they can perform DMA operations to. Second, DMA operations usually require memory pages to be pinned in physical memory, and pinning an arbitrary page can be an expensive operation. For these reasons, current versions of Active Messages move data from the network interface into an anonymous staging area before either invoking the message handler (for medium messages) or placing the data in its final destination (for standard data transfer messages). Consequently, receiver-based memory management does not impose a large cost in our current system. For a true zero-copy system, it should be possible for a handler to be responsible for moving data from the network interface card to user memory.

Based on our experiences implementing Fast Sockets with both sender- and receiver-based memory management schemes, we believe that, for messaging layers such as Sockets, the performance increase delivered by receiver-based management schemes outweigh the implementation costs.

3.6 Other Fast Socket Operations

Fast Sockets supports the full sockets API, including socket creation, name management, and connection establishment. The socket call for the for the AF_INET address family and the ``default protocol'' creates a Fast Socket; this allows programs to explicitly request TCP or UDP sockets in a standard way. Fast Sockets utilizes existing name management facilities. Every Fast Socket has a shadow socketassociated with it; this shadow socket is of the same type and shares the same file descriptor as the Fast Socket. Whenever a Fast Socket requests to bind() to an AF_INET address, the operation is first performed on the shadow socket to determine if the operation is legal and the name is available. If the shadow socket bind succeeds, the AF_INET name is bound to the Fast Socket's Active Message endpoint name via an external name service. Other operations, such as setsockopt() and getsockopt(), work in a similar fashion.

Shadow sockets are also used for connection establishment. When a connect() call is made by the user application, Fast Sockets determines if the name is in the local subnet. For local addresses, the shadow socket performs a connect() to a port number that is determined through a hash function. If this connect() succeeds, then a handshake is performed to bootstrap the connection. Should the connect() to the hashed port number fail, or the connection bootstrap process fail, then a normal connect() call is performed. This mechanism allows a Fast Sockets program to connect to a non-Fast Sockets program without difficulties.

An accept() call becomes more complicated as a result of this scheme, however. Because both Fast Sockets and non-Fast Sockets programs can connect to a socket that has performed a listen() call, there two distinct port numbers for a given socket. The port supplied by the user accepts connection requests from programs using normal protocols. The second port is derived from hashing on the user port number, and is used to accept connection requests from Fast Sockets programs. An accept() call multiplexes connection requests from both ports. The connection establishment mechanism has some trade-offs. It utilizes existing name and connection management facilities, minimizing the amount of code in Fast Sockets. Using TCP/IP to bootstrap the connection can impose a high time cost, which limits the realizable throughput of short-lived connections. Using two port numbers also introduces the potential problem of conflicts: the Fast Sockets-generated port number could conflict with a user port. We do not expect this to realistically be a problem.

3.7 Fast Sockets Limitations

Fast Sockets is a user-level library. This limits its full compatibility with the Sockets abstraction. First, applications must be relinked to use Fast Sockets, although no code changes are required. More seriously, Fast Sockets cannot currently be shared between two processes (for example, via a fork() call), and all Fast Sockets state is lost upon an exec() or exit() call. This poses problems for traditional Internet server daemons and for ``super-server'' daemons such as inetd, which depend on a fork() for each incoming request. User-level operation also causes problems for socket termination; standard TCP/IP sockets are gracefully shut down on process termination. These problems are not insurmountable. Sharing Fast Sockets requires an Active Messages layer that allows endpoint sharing and either the use of a dedicated server process [Maeda & Bershad 1993b] or the use of shared memory for every Fast Socket's state. Recovering Fast Sockets state lost during an exec() call can be done via a dedicated server process, where the Fast Socket's state is migrated to and from the server before and after the exec() - similar to the method used by the user-level Transport Layer Interface [Stevens 1990].

The Fast Sockets library is currently single-threaded. This is problematic for current versions of Active Messages because an application must explicitly touch the network to receive messages. Since a user application could engage in an arbitrarily long computation, it is difficult to implement operations such as asynchronous I/O. While multi-threading offers one solution, it makes the library less portable, and imposes synchronization costs.

Next: 4 Performance
Up: High-Performance Local Area Communication
Previous: 2 Problems with TCP/IP