################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org An implementation of the Hamlyn sender-managed interface architecture Greg Buzzard, David Jacobson, Milon Mackey, Scott Marovich, and John Wilkes Computer Systems Laboratory, Hewlett-Packard Laboratories, Palo Alto, CA As the latency and bandwidth of multicomputer interconnection fabrics improve, there is a growing need for an interface between them and host processors that does not hide these gains behind software overhead. The Hamlyn interface architecture does this. It uses sender-based memory management to eliminate receiver buffer overruns, provides applications with direct hardware access to minimize latency, supports adaptive routing networks to allow higher throughput, and offers full protection between applications so that it can be used in a general-purpose computing environment. To test these claims we built a prototype Hamlyn interface for a Myrinet network connected to a standard HP workstation and report here on its design and performance. Our interface delivers an application-to- application round trip time of 28us for short messages and a one way time of 17.4us + 32.6ns/byte (30.7MB/s) for longer ones, while requiring fewer cpu cycles than an aggressive implementation of Active Messages on the CM-5. 1 Introduction Processors are rapidly getting faster, and message-passing multicomputer interconnections are doing the same, thanks to recent developments in Gb/s links and low- latency packet switches. But the cost of passing messages between applications also includes the overhead of crossing interfaces between the operating system (OS), a device driver, and the hardware, which can be orders of magnitude more than the cost of moving a message's bits across the wires. Hamlyn is an architecture for processor-interconnection interfaces that addresses this difficulty. It achieves both low latency and high bandwidth, isolates applications from each other's mistakes, and supplies a rich set of message-delivery semantics. It does so by exploiting several techniques: * Sender-based memory management. Senders, not receivers, choose the destination memory address at which messages are deposited. This means that messages are sent only when the sender knows that there is memory space for them, eliminating buffer overrun and retransmission under heavy loads. * Direct application access to interface hardware. Send and receive operations require no OS intervention, yielding very low latencies. * Zero-copy protocols. Data are transferred directly between application memory and the network with no memory-to-memory copying or page remapping. * Automatic message reassembly. The interface allows out-of-order packet delivery in order to support adaptive routing networks, which have greater throughput and fault tolerance [Davis92]. * Data movement and message arrival notification are separate. Data can be moved without interrupting the remote host if desired, which provides greater application control and lower overheads [Thekkath94]. The original Hamlyn design [Wilkes92, Wilkes95], which incorporated most of these features, was intended to support a packet-based, fault tolerant, adaptive routing network for a large-scale, mimd multicomputer, derived from the Mayfly project [Davis92]. This paper extends the original Hamlyn work by describing: * performance data from a working prototype; * improved methods of message arrival notification; * a more powerful packet counting scheme that supports generalized group-receive semantics; * layered protocols that provide in-order message streams and application buffer management. We describe the Hamlyn architecture and our application interface library, presents performance measurements of our prototype, discusses related work, and then summarizes what we have learned. 2 The Hamlyn architecture Hamlyn was intended to support scalable, concurrent, fault-tolerant applications, running on a mimd multicomputer or a closely-coupled computer cluster. Such applications often require high-bandwidth bulk data transfers, low-latency control messages (a few microseconds per round-trip), and multiple, independent protection domains provided concurrently on each processor. For low latency, Hamlyn gives applications direct access to the interface hardware for sending messages with no OS intervention. It provides a fast, low-cost, message-arrival notification mechanism that does not require interrupts or system calls in order to receive messages. (Interrupts may be used as an optional alternative, in which case Hamlyn tries to coalesce them instead of delivering one for each packet.) For high bandwidth, Hamlyn includes a scatter-gather direct memory access (DMA) capability that frees the host processor for other operations during long transfers. Applications can use it directly: it does not require OS intervention for each use. This allows Hamlyn to avoid all memory-to-memory copying in the host. For security in a multiple-user environment, Hamlyn prevents mutually suspicious applications from sending, reading, or overwriting each other's data even though they use the interface hardware directly. This avoids the need to statically partition a multicomputer, or having to use gang-scheduling and interconnection fabric draining for inter-application protection, as on the CM-5. Hamlyn deliberately exploits several features of the short- distance interconnection networks commonly used in modern, multicomputer systems: * Very low transient error rates. Dedicated, enclosed, multicomputer interconnection fabrics are better thought of as extended backplanes than as unreliable networks-they rarely lose or corrupt packets. Hard failures can occur, but transient errors are so infrequent (perhaps one every few months) that it is reasonable to handle them using high-level, application-program mechanism, such as aborting and restarting a transaction [Saltzer84]. This means that automatic retransmission by a lower level of a protocol stack is unnecessary, improving performance; that packet reception need not be acknowledged, eliminating a potential cause of deadlock; and that transmission buffers can be released as soon as their contents have been sent, simplifying buffer management. * Hardware flow control in the interconnection fabric. This avoids packet loss by applying back pressure to senders when resources fill up. * Small packet sizes. These permit simpler, faster switches and better throughput guarantees, at the cost of requiring message segmentation and reassembly in the interface. * A physically secure network. In such networks, messages need not be encrypted to protect them against eavesdropping. Hamlyn was designed with a RISC-like philosophy: to make common cases fast and less common ones possible. An explicit goal was that Hamlyn should be simple enough to be implementable using hardware state machines, since programmable controllers are often slow. The next subsections describe the features of the Hamlyn design that allow these goals to be met. 2.1 Sender-based memory management The first-and perhaps most important-feature is sender-based memory management, which is a technique to avoid software-induced packet loss. Packet loss is a serious problem in low-latency data communication systems because coping with it usually means retaining transmission buffers, acknowledging packet reception in order to release buffers when they are no longer needed, and using a low-level time-out mechanism to trigger retransmission. Moreover, the problem usually occurs under heavy loads, when retransmission will only make it worse, so even a low rate of packet loss can produce a much higher rate of message loss. There are two main causes of packet loss: interconnection network problems, such as damaged or lost packets, and receiver buffer overrun. Our design assumptions legislate away the former, and we use sender-based memory management to prevent the latter. The basic idea is to determine a message's final destination in a receiving host's memory before sending it, so that receiver buffer overrun is impossible. The receiving network interface places incoming packets directly into their final resting place instead of leaving them on an interface card or temporarily copying them elsewhere in the receiving host's memory. 2.2 Slots: naming and protecting message areas Data are sent to and from message areas, which are contiguous regions of an application's virtual-memory address space, protected by OS mechanisms in the usual way (Figure 1). Message areas are wired down (pinned into memory) while they remain allocated. This is a deliberate design decision that trades greater physical memory use for lower latency and greatly simplified interface design. Message areas are referred to by slots, which are in turn indexed by small integer slot numbers. A slot contains base and bound registers for the message area to which it refers (several slots may refer to the same message area), and a protection key that must also be held by applications wishing to send messages to it. A slot is implemented by a data structure that can only be modified by the OS; this data structure lives in the network interface hardware in our prototype. Memory addresses in Hamlyn messages are represented by tuples; this indirection allows senders to be isolated from the details of virtual and physical memory addressing at receivers. Since packets can potentially arrive out of order, each one needs to be self-describing. This is accomplished by having the Hamlyn interface add a header to each packet that it sends out, as shown in Table 1. Table 1: packet header format (slightly simplified). Each line represents 32 bits. Destination host ID Slot number Metadata index Protection key (64 bits) Packet offset Packet length Delta (used in packet counting) Metadata length Flags (user data follows here ...) When a packet arrives at a receiving interface (Figure 2), the interface locates the named slot, adds the base address of the target message area to the offset specified in the header, and then moves the packet's data to that address using DMA, after checking that the packet will fit into the message area. When the last packet of a message arrives the interface will also notify the receiving application if desired (see section 2.6). To ensure that data cannot be written into a buffer without permission, the receiving interface compares the protection key in the packet header to the one in the slot. Only if the keys match is writing allowed. Hamlyn protection keys are large (64 bits) and sparsely allocated to provide good inter-application protection. Application software can often be simplified if a message carries a small amount of out-of-band metadata to be deposited in a separate buffer, so our prototype Hamlyn interface allows up to 60 bytes of it in the first packet sent. Secondary base and bound registers and a protection key may be used to used to prevent unauthorized overwrites of metadata at the receiver (see section 2.7). 2.3 Sending termini Hamlyn gives each sending application a private hardware send terminus, implemented as a set of control registers and a fifo work queue in the interface card. These are mapped into the application's virtual-memory address space and protected by OS mechanisms in the usual way. In our prototype, each work queue holds up to 63 entries, so that applications can quickly post several messages without blocking. When the interface sends a message, it writes a sequence number in a prearranged, per-terminus, host memory word; this number, modulo the work queue's size, identifies the corresponding entry, thereby telling the application that the entry and any buffer memory associated with its message can be reused. Short messages are pushed from a host processor to the terminus queue using ordinary store instructions. We call this direct I/O. To send a message, the application writes a transmission-request block-basically a packet header-into the send terminus' work queue, followed by any metadata, then the data. It then notifies the hardware of the new entry and proceeds to other work; no system call or interrupt occurs when sending a message. Long messages are pulled from host memory by the interface using asynchronous DMA. Our prototype prevents applications from accessing data belonging to other applications in the following way. Each send terminus has 8 base and bounds registers, settable only by the OS, that are used to identify special send buffer areas that are named by small integer send buffer tags. A single message can contain parts from one or more of these areas; the Hamlyn DMA engine interface gathers them up on the fly as the message is sent out. To send a long message, the application writes a transmission-request block to the terminus' work queue that consists of the header followed by a sequence of tuples describing the location of metadata and data. It then notifies the interface of the new entry. The send terminus automatically segments messages larger than a single packet, replicating the header in each packet-except for the offset field, which is adjusted automatically to reflect the address of the new packet at the receiver. All metadata are put into the first packet. The Hamlyn interface interleaves packets from all the send termini with non-empty work queues, providing approximately equal bandwidth to competing processes sending large amounts of data and, in the absence of network delays, bounding the time any message waits in a work queue. This scheme could be embellished with priorities, although our prototype didn't include them. 2.4 The metadata area Message areas are intended to receive most incoming data, but we needed three other storage structures for each arriving message: * a place in which to deposit metadata; * packet counters for the message-arrival notification mechanism (see section 2.5); * information for a finer-grained, per-message protection scheme (see section 2.7); There might be hundreds of application processes, with hundreds of interleaved, concurrently arriving messages per process, so we might need tens of thousands of these structures-too many to store on the interface card. We chose one mechanism to solve all of these problems: a 128-byte metadata entry is provided for each expected message (Table 2). These entries are arranged in a vector, called a metadata area, whose base and bound are stored in a slot data structure. It lives in a receiving application's virtual-memory address space. Each packet header specifies the index of a metadata entry associated with its destination slot, which may be thought of as a message identifier. A metadata entry may be reused as soon as all packets of a message referring to it have been assembled by the interface and processed by the receiving application. A slot must therefore have enough metadata entries associated with it to accommodate the largest number of messages that might arrive concurrently. Table 2: a metadata entry. User metadata (60 bytes) Packet accumulator Paranoid-mode protection key Paranoid-mode message area low limit Paranoid-mode message area high limit Hamlyn library receive class pointer 2.5 Packet counting In some interconnection fabrics, packets of a segmented message may arrive at a receiving interface out of order, so a mechanism is needed to determine when all the packets of a message have arrived and the receiving application can be notified. Hamlyn does this by maintaining a 32-bit packet accumulator in the metadata entry for each expected message. The value of the accumulator starts out at zero and returns to this when all the packets have arrived [Jacobson95]. Each send terminus has a 32-bit packet counter, which is initialized to a value (Y) specified by the sender in the delta field of a transmission-request block. For each packet except the last in a message, the send terminus sets a 32-bit delta field in the packet header to 1 and decrements its packet counter using 2's complement arithmetic. For the last packet it sets the delta field to the final value of the packet counter. It is easily seen that the sum of all delta fields in a transmitted message's packet headers equals Y, modulo 2^32. In receiving interfaces, the delta field in a packet is added to its associated packet accumulator as the packet arrives. When the accumulator reaches zero again, the entire message has arrived and a receiving application can be notified (see section 2.6). If a packet delta field is zero, notification occurs without consulting the packet accumulator. This is used as an optimization for single- packet messages. This deceptively simple mechanism provides two important capabilities: * Single message receive. Out-of-order packet arrival is handled as described above. To summarize: if a message has only one packet, then its sender sets the packet header's delta field to 0 and notification occurs immediately upon arrival. If a message has several packets, then the sum of the packet headers' delta fields is 2^32. Since the receiving packet accumulator is also counting modulo 2^32, its value will return to 0 exactly when all packets have arrived. * Group receive. A single notification can be generated when a set of messages from a known group of senders has arrived (e.g., for a distributed barrier operation). In this case, the ith sender is given an initial value Yi such that the sum over all participating senders of Yi = 2^32. This sum reaches 2^32 and wraps around to zero exactly when all packets in the message set have arrived. Scatter- gather I/O is a special case of this in which all messages come from the same sender. Moreover, a process receiving a packet counter value Yi can delegate work to (say) two other processes, j and k, as long as their initial packet counter values, Yj and Yk, sum to Yi. The identity of a group's senders need never be known by a receiver. The Hamlyn library (section 3) provides a means for dividing buffer space and counter values among the delegates of the group. Individual, multiple-packet messages could be handled using a simpler mechanism in which each packet header's delta field carries the total packet count, and a packet accumulator is reset by the first packet to arrive; however this would require the sending interface to calculate the total number of packets before sending any. Our scheme allows group reception and scatter-gather I/O while a receiver remains oblivious to the number of packets sent, and requires no extra memory. 2.6 Message arrival notification A Hamlyn interface indicates that all packets of a message have arrived by appending an entry to a circular notification queue in main memory. There is one such queue for each receiving application. The interface then generates an interrupt if requested, the receiving process is asleep, and the processor has no pending interrupt requests from the interface for other processes. All this reduces interrupt overheads to minimum. Notification queues are identified by notification queue control blocks (NQCBs) in the interface card's SRAM, which are in turn referred to by slot data structures (Figures 3 and 4). Each NQCB has a cursor that points to the tail of its notification queue; when all of a message's packets have arrived, an entry (Table 3) is written at the location indicated by the cursor and the cursor is advanced. When it advances beyond the queue's storage area, the cursor is reset and a wrap flag in the NQCB is toggled-that is, the notification queue is treated as a circular list. Table 3: notification queue entry. Wrap flag Slot index Notification index Metadata entry pointer (padding for alignment) A receiver application wishing to busy-wait for a message maintains its own cursor and wrap flag and polls the next available notification queue entry until the wrap flags match. (The cache-coherent I/O of our prototype's workstation ensures that this busy-waiting causes no I/O bus activity until the entry is rewritten.) Because the notification queue is read-only by the application and write-only by the interface, no other locking or synchronization is necessary. This is the only part of the Hamlyn design in which buffer overrun might occur: a faulty or malicious sender could transmit messages faster than a receiver can consume them, overwriting an older notification queue entry. This can be prevented by ensuring that the queue's size exceeds the number of messages arriving in the worst case. Section 2.7 describes a mechanism that can enforce an upper bound on this count; section 2.8 describes a discipline-based solution to bounding it. 2.7 Special modes of operation The mechanism described so far accommodates cooperating processes in a single application, but we wanted Hamlyn to provide increased robustness, privacy, and security for client-server systems of mutually suspicious processes: specifically, several senders sharing a common receiver slot. To this end, slots can be put into two special modes of operation called paranoid and paranoid_one_shot. Both modes use extra fields in a message's metadata entry: a secondary base and bounds register, and a protection key that supersedes the slot data structure's key. When a packet arrives in a slot operating in one of these special modes, the packet header's base and bound fields are compared first to the slot's message area (primary) base and bounds, then to the metadata (secondary) base and bounds, while the packet header's key field is compared to the secondary protection key. All tests must pass before the packet is accepted. Additionally in paranoid_one_shot mode, no further use of the metadata entry is allowed without application software intervention after the first message for the entry has been received, preventing subsequent messages from being received before the first one is processed. These modes provide several advantages: * Senders' data can be confined to a small part of a message area, limiting the damage that a faulty or malicious client can cause. * The paranoid_one_shot mode prevents a faulty or malicious client from overrunning a notification queue as long as the queue's size exceeds the number of metadata entries used. This mode also ensures that a packet's data cannot be overwritten after message arrival, so the data need not be copied elsewhere for safe-keeping. * Every sender can have its own protection key, allowing revocation of one sender's access rights without affecting others. For example, if a node appears to have failed, all of its senders' keys can be revoked, potentially allowing message and metadata areas to be immediately reassigned. * Since metadata areas are allocated in main memory, applications can change the secondary base, bounds, and key registers for their own metadata areas without OS intervention. These modes entail more complicated interface logic and extra tests during packet arrival which introduce a small amount of extra latency. (There need not be more host memory accesses, since a metadata entry's packet accumulator must be updated anyway.) A third special mode of operation, called fast mode, represents a special-case optimization for single-packet messages: if an arriving packet header's metadata index is all 1's, then the packet header is used to carry exactly one word of metadata, which is written in a notification queue entry in place of a metadata entry pointer. This lets a single-packet message carry a small amount of metadata with minimal overhead. In retrospect, it may have been a premature optimization. 2.8 Flow control and deadlock prevention An important consideration in Hamlyn's design was to ensure that, in the absence of a failing sender, receiver, or interconnection fabric, data is never lost and communication never deadlocks. Deadlock is a treacherous issue. If Hamlyn employed low-level acknowledgment of each packet's transmission, and communications were ever blocked because the fabric cannot accept more packets, then acknowledgments could be blocked as well, potentially causing deadlock. We avoid deadlock by not depending upon hardware packet acknowledgments and by arranging that the only hardware moderating the flow of incoming packets is the host's I/O bus. Incoming transfers are given priority over outgoing transfers for access to the I/O bus, so that the maximum time an incoming packet stalls at the interface is the time for one packet to traverse the I/O bus plus a small amount of overhead in the interface. When power is first applied to a Hamlyn interface card, it enters a state in which it simply accepts and discards arriving packets. It does the same if the host fails to reset a periodic handshake timer. The Myricom switch used in our prototype detects powered-down interfaces and discards packets destined for them. It uses a round-robin service discipline for incoming ports to avoid starvation of incoming packets. If arriving packets are delayed while waiting for the Hamlyn interface to process them, or if there is conflicting traffic at a switch port, the switch and interface hardware generate link-level back-pressure, halting the sender. Eventually the sending application will block when its terminus' work queue fills up. To streamline higher-level protocols, we exploit the fact that the interconnection fabric never loses packets in the absence of failures, which is a reasonable design decision for a small-area, multicomputer interconnection, although less so for a wide-area network. Sender management of memory automatically imposes a higher level of flow control for buffer space and metadata management, so a sending application blocks when no resources are available to service a request at the receiver. A final concern is to prevent notification queue overrun. This can be accomplished by ensuring that the gap between the number of messages processed at the receiver and the number that can have been transmitted by the senders is always smaller than the number of metadata entries available. This limit can be enforced in paranoid_one_shot mode. 2.9 Architectural costs Hamlyn consciously makes some design choices that impose additional costs compared to more traditional approaches: 1. Message, metadata, and notification queue storage areas must be wired down. 2. Applications are responsible for buffer management, including reclamation after sends and receives. In return for these costs-which we think are relatively small, and not unlike those imposed by other high- performance network designs-Hamlyn provides fully- protected, direct application access to a network, automatic message segmentation and assembly, group reception notification, rejection of messages from failing or malicious processors, and both direct I/O and DMA sends. 3 The Hamlyn interface library In order to make the Hamlyn architecture easier to use, we built a two-level application interface library. The upper layer provides a set of convenient programming abstractions and hides the details of buffer memory management. The lower layer provides a simple, efficient procedural interface to our communication hardware. The library was designed to provide a convenient infrastructure for popular middleware, such as mpi [Corbett95], Active Messages [vonEicken92], and Oracle's distributed lock manager. We describe these layers from the bottom up (Figure 5). 3.1 OS interface The Hamlyn interface card is managed by a device driver module in the host operating system. OS modifications to support Hamlyn in Unix systems are largely confined to this driver, which provides all interface management services requiring OS mediation, such as creating slots and termini, installing slot protection keys, wiring and unwiring memory, and arranging to suspend or resume an application pending a message's arrival. All other interface management services reside in unprivileged library code, linked in with application programs. 3.2 Low-level library procedures The Hamlyn library includes a procedural interface to the network interface hardware and hardware-manipulated data structures. It uses a data structure called a ticket, which contains a message's destination, slot number, metadata index, protection key, data buffer base and bound, and some flags. Tickets are location-independent in that they may be exported in a message and used later to reply from a remote Hamlyn interface. The library's lower layer has two main functions: h_send_msg and h_recv. The former accepts a ticket, a data buffer's address and length, and a metadata buffer's address and length. Buffer addresses are converted to offsets in sender message areas. A small message is written to the interface using direct I/O, while a DMA request is built for a large message. The function returns a handle that can later be used to decide when to release a buffer. The h_recv function checks whether a message has arrived and, if so, it returns the address of the corresponding notification queue entry. A variant, h_recv_block, accepts a time-out argument and waits until either a message arrives or the specified interval expires. 3.3 Higher-level protocols The Hamlyn library's upper layer supports a set of protocols with varying semantics to send and receive data. Three protocols, embodying most of its key ideas, are described below. This layer was written in C++ because the language lets communication end points be represented as objects, keeping an application's name space clean and letting each protocol use the same operation names (e.g., send, receive). We were inspired by the work of Stepanov and Lee on the C++ Standard Template Library [Stepanov95] in which aggressive inlining and optimization are combined to achieve highly efficient object code. The library creates a Hamlyn manager (an instance of the hamlyn_manager class) for each send terminus. This class is responsible for managing the device driver interface and notification queues, and for allocating buffer memory. Applications create an end point for message reception by instantiating a receiver class, which contains one or more metadata areas and buffers, and a queue of received messages. It also provides a C++ virtual procedure (process_arrival) that is called by the Hamlyn manager to record a new message's arrival. When receive is applied to an end point, a message is returned from the arrival queue in the receiver class instance if possible. Otherwise-if the queue is empty- the receiver instance calls a poll_nowait routine in the hamlyn_manager. The latter gets the next notification queue entry if there is one, follows its pointers to the metadata area and receiver end point, and then calls process_arrival there, passing it the address of the metadata. Control then returns to the receiver end point originally called by the application, which looks again for a new message. (It may not have received one if the notification queue was empty or the queue entry just processed represented a message for a different receiver instance.) This process continues until the original request is satisfied, a time interval expires, or the receiver end point chooses to block instead of busy-wait. Although the Hamlyn architecture and the library's lower level are thread-safe, the upper level is not. Making it so remains a research topic. All of the connection-oriented receiver classes support a make_seed call, which returns a seed object, containing tickets for preallocated buffers, metadata, and other information. A seed can be sent to a remote node in order to create an instance of the corresponding end point sender class. (It's so named by analogy with the similar purpose seeds serve in plants.) The Hamlyn library uses these techniques to support the following protocols: Simple datagram protocol. This provides access to raw Hamlyn hardware semantics. The send call is a wrapper for the h_send_msg routine. It creates a small amount of metadata that tells the receiver the offset and length of the transmitted data and can include a reply ticket. There is a separate receiver instance for each metadata entry, and each instance is either "ready" or "not ready": there is no queue. Stream protocol. This provides a one-way, one-to-one, in-order connection from a sender to a receiver. The sender class supports send and flush. Large buffers are sent as-is, and small ones may be coalesced by copying. Calling flush forces transmission of all previously-posted data. The receiver class supports receive and release; the former returns a pair describing its result; no copying occurs. It automatically allocates more buffers if needed. The release procedure frees all records up to and including that identified by its argument. Senders block if they ever get so far ahead of the receiver that they run out of tickets (which provide permission to write to a metadata entry). Tagged Remote Write. In this protocol, a send call specifies a ticket, a source buffer's address and length, the destination buffer offset, and an integer tag. Tags are enqueued in the receiver and can be retrieved by calling get_tag or get_specific_tag. This protocol uses fast mode, so messages must fit in a single packet. In-order delivery is not guaranteed. 4 Performance evaluation In order to evaluate the Hamlyn architecture, we collaborated with the University of California at Berkeley and Myricom, Inc., to build a prototype interface card for a Myrinet network [Boden95, Buzzard95]. The Myrinet switch we used is a non-blocking, 8x8 crossbar, which uses wormhole routing. It provides 80MB/s of bandwidth per port in each direction with about 0.5us of latency. We used Myricom's LANai Version 4.0 network controller chip with 512KB of on-card static ram, microcoded to implement the Hamlyn design. The LANai has a 32-bit cpu and three DMA units (Figure 6): incoming from the switch, outgoing to the switch, and to/from host memory. The host DMA engine is the only mechanism available to the LANai controller to access the host's main memory. Our host computers were early-production HP 9000 Series 770 (J200) PA-RISC workstations with 100MHz cpus, running Version 10.00 of the hp-ux operating system. The interface cards plugged into the workstations' graphics I/O bus, which operates at the same frequency as the LANai cpu (40 MHz). The bus and its processor interface can support incoming DMA at 106MB/s, but outgoing transfers are limited to 32MB/s because of reduced opportunities for pipelining requests through the bus interface chip. The workstations have a cache-coherent I/O architecture, which obviates the need for software to flush or purge data cache lines in DMA buffers. It also lets applications busy- wait for Hamlyn DMA completion without consuming I/O bus bandwidth. DMA buffers' I/O bus addresses are mapped to physical memory addresses by the workstations' memory-to-I/O bus interface hardware, so that our card can access a multiple-page buffer in a contiguous I/O bus address range-a considerable simplification. Unless otherwise stated, our performance measurements were taken by sending a message from one application- level process to a second remote one, which returned the message to the sender. The round-trip times we measured were divided by two to get one-way data. Where we report single measurement numbers, we generated them by timing at least 10000 messages. There was less than 1% variation between independent runs of our test suite. The performance data reported here apply only to our test systems and do not necessarily represent products currently in production. 4.1 Short single-packet messages The lowest latency is obtained using our interface's fast mode of operation (see section 2.7). We measured two cases using 16-byte payloads. In one, application code wrote data to the interface using direct I/O without the Hamlyn library, while the other case used the library's Tagged Remote Write protocol. The first case took 12.7us one way (25.4us round-trip), and the second case added 1.4us of overhead, yielding 14.1us in all (28.2us round- trip). Table 4 shows where the time goes. Table 4: one-way short-message transfer time (us). raw interface tagged-remote- write protocol DMA 3.3 3.3 LANai 6.7 6.7 Switch 0.5 0.5 Host I/O writes 1.4 1.4 Host protocol software 0.8 2.2 Total 12.7 14.1 Notice that the 1.4us due to the host I/O writes represents only 8 store instructions: these are slow because of the cost of traversing the I/O bus. By contrast the host protocol software costs represent tens or hundreds of instructions. 4.2 Latency for normal messages Figure 7 shows the one-way latency as a function of message size with 0, 4, and 60 bytes of metadata and 256- byte packets. Several effects are visible: * A baseline cost of about 17.4us due to host software overhead, LANai control program overhead, and the cost of writing a notification queue entry. * Increases in latency at message sizes that are multiples of 256 bytes. Our LANai code takes about 12.5us more to receive a packet of this size than to send one, so the overall time increases at each packet boundary. The first step is larger than the others because the code changes from never updating the packet accumulator to updating it twice, while subsequent packets cause one update each. * A marginal transmission cost of 55.5ns/byte (18.0MB/s). This results because the three steps involved are handled serially for each packet: about 31.3ns/byte to move outgoing data to the interface using DMA, 12.5ns/byte to move it across the network at 80MB/s, and 11.7ns/byte for DMA into the destination host. * Repeating fine structure with a 32-byte period due to power-of-2 I/O bus transaction sizes. For example, a 28-byte transfer requires three transactions (for 16, 8, and 4 bytes) while a 32-byte transfer is done in just one. * An extra 7.7us to sending metadata, including low- level library overhead to translate buffer addresses, then start incoming and outgoing DMA. Metadata bytes incur the same transfer cost as other data but are not counted in the x-axis of Figure 7, so sending more metadata sent shifts the lines to the left. Figure 8 shows the one-way latency for 4KB packets. Here, the bottleneck is moving data from the sending host to the Hamlyn interface card across the I/O bus controller. The latency exhibits alternating costs of 55.5ns/byte and 11.6ns/byte: if the last, partial packet of a message is almost full, each additional byte waits for outgoing DMA, transmission, and incoming DMA, totalling 55.5ns/byte. But small, final packets arrive early enough that they need only wait for the previous packet's DMA into the host memory to finish, which sets the marginal cost of 11.6ns/byte. If a receiving process is asleep, latency is dominated by interrupt service and process-context switching time. We observed 78us for packets with no payload, which compares favorably with other recent reports [Jones96, Keeton95, vonEiken95]. (On the same machine, a context switch provoked by a semaphore takes 31us.) 4.3 Bandwidth and packet size Using 4KB packets, the bottleneck is the I/O bus interface. The observed slope of the latency function in Figure 7 is 30.7MB/s, from which we infer that the LANai control program achieves a 96% payload utilization of the 32MB/s outgoing DMA channel. The actual utilization is somewhat higher, since counter values are fetched and stored for each packet. (A potential optimization that we did not explore would be to cache some counters in the interface card.) Our packet size can be altered by recompiling the LANai control program, so it is instructive to examine the latency achieved versus message size for various packet sizes. Figure 9 shows the overall one-way bandwidth as a function of message length for several packet sizes. The receiving interface takes 12.9us per packet. For 256-byte packets, this limits bandwidth to 20MB/s, which agrees with our observations. Asymptotic performance increases with packet size, but 2KB packets outperform 4KB packets for smaller messages because there is more concurrency between outgoing DMA, transmission, and incoming DMA. The advantage of 4KB packets is slight and inconsistent up to message sizes of 40KB, the largest we measured. Even though the interface does not pad packets to their maximum size, there are conflicting pressures on the choice of packet size: * Choosing a packet size large enough that network transit time exceeds LANai packet-processing time keeps the LANai from being a bottleneck. The larger the packet size, the more LANai time remains for other tasks, including processing packets moving in the opposite direction. * Long delays should be avoided. Transferring a 4KB packet occupies a link and a DMA channel for about 50 us, a long time to block another packet needing the same resources. This argues in favor of shorter packets. Taking these into account, we recommend a packet size of 1KB for the particular combination of bandwidths and overheads measured for our prototype. The effect of competing for LANai cpu cycles and I/O bus bandwidth can be seen when a message's source and destination are the same process on the same host, as shown in Figure 10. With only 4 bytes of data, the latency in this loopback test rises to 19.1 us because sending and receiving compete for the same LANai controller chip. The bandwidth figures tell a similar story: the asymptotic bandwidth of 28MB/s from Figure 9 drops to 22MB/s in Figure 10 because of I/O bus contention. 4.4 Projections for alternative hardware The LANai performance is relatively low because it is a programmable controller. If we were to implement a Hamlyn interface using hardware state machines, we estimate that the one-way, application-to-application short-message transfer time would decrease to about 6us, but the large-message bandwidth, which is limited by the I/O bus, would not change appreciably. 5 Related work There has been a great deal of work in the field of interface design for high-speed interconnections, especially since the original Hamlyn design was written up. The history of these ideas is not entirely clear: several teams were inventing similar-sounding approaches at around the same time. It is neither fair to say that Hamlyn copied from them, nor that they copied from Hamlyn. (For the record, the earliest extant reference to Hamlyn is dated July 1992.) Although we think that Hamlyn's main contribution lies in its coupling of sender-based memory management to its protection scheme, we present a somewhat broader summary of what we consider to be the work most relevant to our finished design. IBM's OS/360 provided variants of put and get file-system calls that avoided data copying by having the OS specify the location of the buffer to use, rather than the application [Clark66, Belady81]. Hamlyn uses a variant of this mechanism in its interface library. 5.1 Load/store interfaces A store instruction can be thought of as a degenerate, sender-directed message-indeed, there is a large and active literature that views large-scale shared memory machines in this way, of which the Cray T3D, Convex Exemplar, ksr AllCache architecture, Alewife [Kranz93], Typhoon [Reinhardt94], and low-level sci protocols [IEEE92] are representative examples. All require dedicated hardware support that is tightly integrated with the host processors. Several groups have used the load/store paradigm in less tightly coupled systems to provide an interface to cross-network communication. For example, the Alto remote memory reference protocol [Spector81] used network messages in this way; [Thekkath94] discusses the idea of separating data movement from notification in remote load/store operations (Hamlyn also allows this); and shrimp [Blumrich94] provides low-latency remote-memory access using hardware support for automatic data replication, coupled to a virtual-memory protection scheme. Many of these schemes provide excellent performance for the particular operations that they support-specialization is a powerful tool for lowering latency-but sometimes at the expense of relatively high processor utilization. Most implicitly depend upon in-order packet delivery. The hybrid deposit model [Osborne94] combines sender- based addressing with the execution of small programs on a remote node, using both local and remote data-a considerable generalization of the remote fetch-and-op proposed in [Wilkes92]. Implemented in software on top of a 155Mb/s ATM system, it achieved a round-trip time of 49us without a switch and 60us with one. Osborne credits [Subhlok93] with introducing the term "deposit model" for what we call sender-based memory management. 5.2 Copy avoidance Several projects have used page-remapping and smart interface buffer allocation to accelerate processor-to- interface communication, including the fbufs work at the University of Arizona [Druschel93], the Medusa fddi interface [Lumley92, Banks93] and the follow-on Afterburner project [Dalton93]. The Nectar system [Cooper90] allowed applications direct access to its communication interface memory in order to eliminate copies at the cost of all accesses being to memory in the I/O space. It achieved round-trip rpc latencies of 500us across a 100MB/s network. ATM network interfaces can use virtual circuit identifiers (VCIs) to provide early demultiplexing of incoming data to user data buffers. One such use occurred in the Osiris project [Druschel94], which combined stream demultiplexing using ATM VCIs into fbufs, some support for out-of-order delivery, and direct access to the network interface for a limited number of applications. Together, these achieved a round-trip latency of 154us and a maximum throughput of about 41Mb/s on a 622Mb/s ATM network. CMU's hardware-assisted remote put (harp) interface to the CreditNet ATM adapter card allows applications to send directly from their own buffers (akin to Hamlyn message areas), and to provide a set of buffers into which data for a virtual circuit is placed, but it does not appear to allow direct addressing of remote memory on a per- message basis [Mummert96]. We were unable to locate any published latency figures for this interface. 5.3 Cranium Cranium [McKenzie94], like Hamlyn, was designed to provide a host interface to a packet-switching fabric that performed adaptive routing. Like Hamlyn, it has message areas that are used to send and receive data, although it appears that these are restricted to 2KB pages, and the expectation is that there is a single message per area, since there seems to be no provision for a message-offset field. Multiple packets use a sequence number to allow reassembly, rather than offsets; this simplifies the hardware, but it requires that all packets in a message be of the same size. (This means that Cranium could not handle variable-length metadata.) Receivers specify the identity of senders expected to write to a message area, and this is used for protection checks. (There is no discussion of how spoofing is prevented.) Cranium also provides queueing channels, which allow messages to be appended to the end of a message area. We thought about providing these for Hamlyn but eventually decided not to: (1) to force us to work through all of the details of pure, sender-based memory management; (2) to avoid introducing a prima facie source of receiver buffer overruns; and (3) because such messages have to be restricted to single packets. Cranium supports many of the goals of Hamlyn, but its designers made several decisions to reduce functionality in order to simplify the interface. It thus represents a different point in the design space. 5.4 Active Messages Active Messages [vonEicken92, vonEicken94, Martin94] provide a set of arrival semantics for single-packet messages by including the address of a function to call in each one. The function is typically invoked in a restrictive environment on the interrupt stack, with no protection barriers around it. As a result, aggressive implementations of Active Messages are the standard performance target for this kind of work. The main difference is that Hamlyn provides security between applications; data placement is controlled by the sender, rather than the receiver; and all protocol processing happens in a well-defined application context. "Although the restrictions and limitations of previous interfaces [to Active Message systems] made their implementations simple and efficient, the same restrictions and limitations prevent them from supporting the broader spectrum of applications now required" [Mainwaring95a]. [Karamcheti94] reported instruction counts (but no timings) for Active Messages on a cm-5 (cmam), which are roughly comparable to ours although they were measured on a Sparc processor and ours are for PA-RISC. The cmam finite-sequence, multiple-packet delivery protocol seems to provide functionality that approaches our simple datagram protocol: it does not support our group-receive operations, but, like ours, it does handle out-of-order packet delivery. [Karamcheti94] quotes 397 instructions to do a 16-word (64-byte) unidirectional send. A send using Hamlyn's tagged remote write consumes 260 processor cycles (fewer instructions), most of which are consumed when the processor stalls while writing to the I/O bus, and a receive consumes 120 cycles. [Wallach95] reports the lifting of one of the restrictions on Active Messages: that the handlers must not block. We think that the idea of Active Messages is good, and we are gratified that some Hamlyn features are making their way into a revised proposal [Mainwaring95a], which supports protection, caching end point descriptions, as well as multiple send and receive areas per end point. On the other hand, we think that a scheme requiring host processor intervention on every packet would not be such a good idea because the process-context switches would prove too expensive. Indeed, the current trend in processor design seems to be toward ever-larger amounts of machine state, which will make this more costly still. Hamlyn addresses this concern by automating message reassembly in the interface card. U-Net [vonEicken95] embodies some of the same principles as Hamlyn, including direct user-level access to the interface in order to eliminate OS involvement whenever possible, and end points that can route incoming messages directly to application memory. Like Hamlyn, the prototype U-Net implementation is built by re-microcoding an existing interface card-a Fore Systems ATM interface. (By their definition, Hamlyn is using a "standard network interface"!) Since it is built on ATM, which is inherently unreliable, U-Net has to deal with lost packets. Its performance is slightly worse than Hamlyn's: [vonEicken95] reports Active Message round- trip times on top of U-Net of 79us for 32 bytes of data or less (41us of which is due to the ATM switch; the equivalent Myrinet time is 1us) and 135us + 0.2us/byte for bulk data transfers. (The equivalent Hamlyn numbers are probably the 28us for a round-trip tagged remote write, and the one-way bulk data transfer cost of 17.4us + 32.6ns/byte with 4KB packets.) 6 Conclusions What did we learn from this exercise? First, the basic Hamlyn approach seems to have been validated: we can provide low-latency, high bandwidth, protected communication directly from multiple application programs, with little or no OS intervention. We also picked up a few other observations and lessons along the way. 6.1 Network demands [Karamcheti94] argues that the underlying network should provide in-order delivery, deadlock freedom, and fault-tolerant packet transmission. We conclude instead that Hamlyn can synthesize in-order delivery cheaply, assuming a deadlock- and error-free network, giving interconnection designers freedom to optimize for performance, rather than high-level protocol support. [Davis92] argues that an adaptive-routing network can achieve roughly twice the throughput of a non-adaptive one. 6.2 Buffer management Hamlyn does not copy outgoing messages, so applications must be coded to avoid reusing buffers until transmission is complete. To help with this, the Hamlyn library provides a function that determines whether a message buffer can be reclaimed. Metadata often originate in a few small variables on an application program's run-time stack, but if transmission is done using DMA, they must be copied to a special, wired-down metadata area. This proved burdensome; if we were to redesign Hamlyn, we might always send metadata using direct I/O. 6.3 Opening and monitoring connections There is a "bootstrapping" problem when contacting a long-lived server: a potential client cannot transmit to the server because it has no resources allocated there, and the server cannot send a ticket or seed because it does not know of the client's existence. We considered adding an unreliable fifo message queue, but we decided that since these operations are not time-critical, they could be done with standard OS services, which might themselves use Hamlyn inside the operating system. (The lowest-level bootstrapping problem here can be solved by allocating well-known slots, one for each remote processor, which the OS instances can use to establish higher-level communication paths.) A similar observation applies to detecting peer-process failures. We once thought that an "Are you there?" message should be sent periodically between processes in a highly available system, but if such a polling interval expires without a reply, an application does not know whether the system is overloaded, or the polled host has failed, or a peer process is dead, or the process is stuck in a long computation. On the other hand, the OS has definitive knowledge of process' states and so can prevent much of this confusion. The moral is to let the OS do what it is good at. 6.4 Closing connections Traditional networks have difficulty providing reliable connection close because of potential message loss. In the absence of such loss, they can do a good job because the OS knows about the connection setup and can tear it down even after the application dies. (This is even true in most application-level protocol suites, which invoke the OS for connection setup/teardown.) In Hamlyn, the OS cannot fulfill this role because it has no knowledge of the connections, so we reverted to a model where our prototype only allows graceful close operations by a sender. 6.5 Stronger security Hamlyn uses 64-bit protection keys. We estimate that our prototype can detect and discard a packet having an invalid key within 3us. At this rate, a brute-force attack is likely to take about 877,000 years. Keys can be generated using cryptographic-quality pseudo-random number generators, or generators embodying true random processes, so we think that attack by guessing keys is futile. But Hamlyn keys reside in applications' memory address spaces, so our defense against forged messages depends upon memory privacy. For this reason, and because some users find probabilistic protection unsatisfactory, we thought briefly about other techniques: It would be easy to make protection keys accessible only indirectly, and have applications specify indices into a secure, per-process table of keys, maintained by the OS. A secure OS would then make keys unforgeable. But this scheme would fundamentally alter the Hamlyn paradigm, since almost all communication channel management would then require OS intervention. In practice, this is largely irrelevant because the main security problem is user passwords, which are much simpler to attack than 64-bit binary keys. 6.6 Interface memory cost The original Hamlyn design proposed that slots be cached by interface hardware in order to make them abundant and cheap. Our prototype allocates slot data structures in expensive interface card SRAM, and metadata areas in main memory. We never satisfactorily established the right trade-off between function and complexity here. 6.7 Summary The Hamlyn architecture provides users with a message passing interface having a combined hardware and software latency of just a few microseconds, while providing full protection between mutually suspicious applications. We described the most important techniques underlying our implementation, including design trade- offs that we can make (and have made), and we presented performance measurements. Our design is optimized for closely-coupled, multicomputer systems. It yields better performance than loosely-coupled clusters of autonomous computers and, due to the inherent isolation of message-passing systems, provides much better fault tolerance than shared-memory systems, as well as inter-application protection at low cost. All of these needs must be addressed if large-scale, parallel machines are to have a significant impact upon general-purpose computing. The Hamlyn architecture is an important step in that direction. Acknowledgments Martin Fouts and Bill Worley of HP Laboratories provided the initial encouragement to turn Hamlyn from a random thought into a proper architecture. Monroe Bridges and his colleagues in Hewlett-Packard's Networked Computing Division provided valuable insight from the perspective of product designers, forcing us to justify and simplify our design. Cedric Krumbein and other members of the Network of Workstations project in the Computer Science and Engineering Division of the University of California at Berkeley helped us to design our interface cards, while the staff of Myricom, Inc., helped us to build them and gave us a considerable amount of advice, sample firmware, and early access to their technology, enabling our success. References [Banks93] D. Banks and M. Prudence. A high performance network architecture for a PA-RISC workstation. IEEE Journal on Selected Areas in Communications 11(2), February 1993. [Belady81] L. A. Belady and R. P. Parmelee, and C. A. Scalzi. The IBM history of memory management technology. IBM Journal of Research and Development, 25(5):491-503, September 1981. [Blumrich94] Matthias A. Blumrich, Kai Li, Richard Alpert, Cezary Dubnicki, Edward W. Felton, and Jonathan Sandberg. Virtual memory mapped network interface for the shrimp multicomputer. Proceedings of 21st International Symposium on Computer Architecture (Chicago, IL). Published as Computer Architecture News, 22(2):142-53. ACM/IEEE, April 1994. [Boden95] Nannette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawik, Charles E. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet: a Gigabit-per- second local area network. IEEE Micro, pages 29-36, February 1995. [Buzzard95] Greg Buzzard, David Jacobson, Scott Marovich, and John Wilkes. Hamlyn: A high- performance network interface with sender-based memory management. Presented at HotInterconnects III (Stanford, CA), August 1995. Available from https://www.hpl.hp.com/personal/John_Wilkes/ftp-index.html#Hamlyn. [Clark66] W. A. Clark. The functional structure of OS/360: part III, data management. IBM Systems Journal, 5(1):30-51, 1966. [Cooper90] Eric Cooper, Peter Steenkiste, Robert Sansom, and Brian Zill. Protocol implementation on the Nectar communication processor. Proceedings of the ACM sigcomm`90 Symposium (Philadelphia, PA), September 1990. [Corbett95] Peter Corbett, Dror Feitelson, Sam Fineberg, Yarsun Hsu, Bill Nitzberg, Jean-Pierre Prost, Marc Snir, Bernard Traversat, and Parkson Wong. Overview of the mpi-io parallel I/O interface. 3rd Annual Workshop on I/O in Parallel and Distributed Systems (iopads'95) (Santa Barbara, CA), pages 1-15, April 1995. [Dalton93] Chris Dalton, Greg Watson, David Banks, Costas Calamvokis, Aled Edwards, and John Lumley. Afterburner. IEEE Network 7(4):36-43, July 1993. [Davis92] Al Davis. Mayfly: a general-purpose, scalable, parallel processing architecture. Lisp and Symbolic Computation 5(1-2):7-48, May 1992. [Druschel93] Peter Druschel and Larry L. Peterson. Fbufs: a high-bandwidth cross-domain transfer facility. Proceedings of the 14th ACM Symposium on Operating Systems Principles (Asheville, NC), pp 189-202, December 1993. [Druschel94] Peter Druschel, Larry L. Peterson, and Bruce S. Davie. Experiences with a high-speed network adaptor: a software perspective. Proceedings of the 1994 ACM sigcomm Conference on Communications Architectures, Protocols and Applications, pp 2-13, August 1994. [IEEE92] Standard for scalable coherent interface (sci). IEEE Standard 1596-1992. [Jacobson95] David M. Jacobson. Method and apparatus for determining when all packets of a message have arrived. US patent application, filed 24 February 1995, allowed 25 June 1996. [Jones96] Rick Jones. NetPerf. https://www.cup.hp.com/netperf/NetperfPage.html, Hewlett-Packard Company, Cupertino, CA. [Karamcheti94] Vijay Karamcheti and Andrew A. Chien. Software overhead in messaging layers: where does the time go? Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA). ACM, October 1994. [Keeton95] Kimberley Keeton, Thomas Anderson, and David Patterson. LogP quantified: the case for low- overhead local area networks. Presented at HotInterconnects III (Stanford, CA), August 1995. Available at https://now.cs.berkeley.edu/Papers/Papers/hotinter95-tcp.ps. [Kranz93] David Kranz, Kirk Johnson, Anant Agarwal, John Kubiatowicz, and Beng-Hong Lim. Integrating message-passing and shared-memory: early experience. Proceedings of 4th ACM Annual Symposium on Principles and Practice of Parallel Programming, May 1993. [Lumley92] J. Lumley. A high-throughput network interface for to a RISC workstation. Hewlett-Packard Laboratories technical report hpl-92-7, January 1992. [McKenzie94] Neil R. McKenzie, Kevin Bolding, Carl Ebeling, and Lawrence Snyder. Cranium: an interface for message passing on adaptive routing networks. Proceedings of Parallel Computer Routing and Communication Workshop (Seattle, WA), pages 266-80, May 1994. [Osborne94] Randy Osborne. A hybrid deposit model for low overhead communication in high speed lans. Proc. 4th Intl. IFIP Workshop on Protocols for High-speed Networks, August 1994. Available as https://www.merl.com/TR/TR94-02c/Welcome.html. [Reinhardt94] Steven K. Reinhardt, James R. Larus, and David A. Wood. Typhoon and Tempest: user-level shared memory. Proceedings of 21st International Symposium on Computer Architecture (Chicago, IL). Published as Computer Architecture News, 22(2):325-36. ACM/IEEE, April 1994. [Saltzer84] J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, 2(4):277-88, November 1984. [Stepanov95] Alexander Stepanv and Meng Lee. The Standard Template Library. Technical Report hpl-95- 11 (R.1), Hewlett-Packard Laboratories, 1995. [Subhlok93] J. Subhlok, J. Stichnoth, D. O'Hallaron, and T. Gross. Exploiting task and data parallelism on a multicomputer. Proceedings of the ACM sigplan Symposium on Principles and Practice of Parallel Programming (San Diego, CA), May, 1993, pp 13-22. [Thekkath94] Chandramohan A. Thekkath, Henry M. Levy, and Edward D. Lazowska. Separating data and control transfer in distributed operating systems. Proceedings of asplos vi (4-7 Oct. 1994, San Jose, CA). Published as Operating Systems Review 28(5):2-11, December 1994. [vonEicken92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schuser. Active Messages: a mechanism for integrated communication and computation. Proceedings of 19th International Symposium on Computer Architecture (Gold Coast, Australia), pages 256-66, May 1992. [vonEicken94] Thorsten von Eicken, Veena Avula, Anindya Basu and Vineet Buch. Low-latency communication over ATM networks using active messages. Presented at Hot Interconnects II (Stanford, CA), August 1994. [vonEicken95] Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels. U-Net, a user-level interface for parallel and distributed computing. Proceedings of the ACM Symposium on Operating System Principles (Copper Mountain Resort, Colorado). Published as Operating Systems Review 29(5):40-53, December 1995. [Wilkes92] John Wilkes. Hamlyn---an interface for sender-based communications. Technical report hpl-osr-92-13. Operating Systems Research Department, Hewlett-Packard Laboratories, Palo Alto, CA, November 1992. Available from https://www.hpl.hp.com/personal/John_Wilkes/ftp-index.html#Hamlyn. [Wilkes95] John Wilkes. Inter-processor communication system in which messages are stored at locations specified by the sender. US patent number 5,448,698, granted Sept. 1995. The authors can be contacted as follows: gdb@geoplex.com, {jacobson, marovich, mackey, wilkes}@hpl.hp.com. Please address correspondence to David Jacobson.