The following paper was originally published in the Proceedings of the USENIX Conference on Object-Oriented Technologies (COOTS) Monterey, California, June 1995 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Object-Oriented Components for High-speed Network Programming Douglas C. Schmidt, Tim Harrison, and Ehab Al-Shaer schmidt@cs.wustl.edu, harrison@cs.wustl.edu, and ehab@cs.wustl.edu Department of Computer Science Washington University St. Louis, MO 63130 (314) 935-7538 Abstract This paper makes two contributions to the development and evaluation of object-oriented communication software. First, it reports performance results from benchmarking several network programming mechanisms (such as sockets and CORBA) on Ethernet and ATM networks. These results illustrate that developers of bandwidth-intensive and delay-sensitive applications (such as interactive medical imaging or teleconferencing) must evaluate their performance requirements and the efficiency of their communication infrastructure carefully before adopting a distributed object solution. Second, the paper describes the software architecture and design principles of the ACE object-oriented network programming components. These components encapsulate UNIX and Windows NT network programming interfaces (such as sockets, TLI, and named pipes) with C++ wrappers. Developers of object-oriented communication software have traditionally had to choose between high-performance, lower-level interfaces provided by sockets or TLI or less efficient, higher-level interfaces provided by communication frameworks like CORBA or DCE. ACE represents a midpoint in the solution space by improving the correctness, programming simplicity, portability, and reusability of performance-sensitive communication software. Introduction Distributed object computing (DOC) frameworks like the Common Object Request Broker Architecture (CORBA) , OODCE , and OLE/COM are well-suited for applications that exchange richly typed data via request-response or oneway communication. However, current implementations of DOC frameworks may be less suitable for an important class of bandwidth-intensive and delay-sensitive applications that stream relatively simple datatypes over high-speed networks. Medical imaging, interactive teleconferencing, and video-on-demand are common examples of these streaming applications. Streaming applications with stringent throughput and delay requirements are ideal candidates for high-speed networks such as ATM and FDDI. However, these applications may not be able to tolerate the overhead associated with contemporary DOC frameworks. This overhead stems from non-optimized presentation layer conversions, data copying, and memory management, inefficient receiver-side demultiplexing and dispatching operations, synchronous stop-and-wait flow control, and non-adaptive retransmission timer schemes. Meeting the throughput demands of streaming applications has traditionally involved direct access to network programming interfaces such as sockets or System V TLI . These lower-level interfaces are efficient since they omit unnecessary functionality (such as presentation layer conversions for ASCII data) and allow fine-grained control over memory management, protocol buffering, demultiplexing, and flow control. However, conventional network programming interfaces are low-level, non-portable, and non-typesafe, which complicates programming and permits subtle run-time errors. For instance, communication endpoints in the socket interface are identified by weakly-typed integer handles (also known as socket descriptors ). Weak type-checking increases the potential for run-time errors since compilers cannot detect or prevent improper use of handles. Thus, operations can be applied to handles incorrectly (such as invoking a read or write on a passive-mode handle that can only accept connections). Traditionally, developers of high-performance streaming applications had to choose between two solutions: Higher-level, but less efficient network programming interfaces -- such as DOC frameworks or RPC toolkits; Lower-level, but more efficient network programming interfaces -- such as sockets or TLI. This paper describes object-oriented network programming components that provide a midpoint in the solution space. These components are part of the ACE toolkit , which encapsulates conventional network programming interfaces with a family of C++ wrappers. The ACE toolkit improves the correctness, ease of use, portability and reusability of communication software without sacrificing performance. This paper is organized as follows: Section compares the performance of several network programming mechanisms (C sockets, C++ wrappers for sockets, and two implementations of CORBA) for a representative streaming application over Ethernet and ATM networks; Section outlines the design of the object-oriented ACE components that encapsulate UNIX and Windows NT network programming interfaces (such as sockets, TLI, STREAM pipes, and named pipes); Section illustrates the differences between programming with C sockets, ACE, and CORBA; Section summarizes the design principles of the ACE wrappers; and Section presents concluding remarks. Performance Experiments This section describes performance results from comparing several network programming mechanisms that transfer large streams of data using TCP/IP over Ethernet and ATM networks. The network programming mechanisms compared below include C sockets, C++ wrappers for sockets, and two implementations of CORBA. The benchmark tests are representative of applications written by the authors for the Motorola Iridium project (which is a next-generation satellite-based global personal communication system) and Project Spectrum (which is an enterprise-wide medical imaging system that transports radiology images across high-speed ATM LANs and WANs ). Test Platform and Benchmarks The performance results in this section were collected using a Bay Networks LattisCell 10114 ATM switch connected to two uni-processor SPARCstation 20 Model 5Os. This LattisCell is a 16 Port, OC3 155Mbs/port switch. The SPARCstations contain 100 MIP Super SPARC CPUs running SunOS 5.4. The SunOS 5.4 TCP/IP protocol stack is implemented using the STREAMS communication framework . Each SPARCstation 20 has 64 Mbytes of RAM and an ENI-155s-MF ATM adaptor card, which supports 155 Mbits/sec (Mbps) SONET multimode fiber. The Maximum Transmission Unit (MTU) size of a SONET frame on the ENI ATM adaptor is 9,180 bytes. Each ENI card has 512 Kbytes of on-board memory. 32 Kbytes is alloted per ATM virtual circuit connection for receiving and transmitting frames (for a total of 64K). This allows up to 8 connections per card. Data for the experiments was produced and consumed by an extended version of the widely available ttcp protocol benchmarking tool. This tool measures end-to-end data transfer throughput in Mbps from a transmitter process to a remote receiver process. The flow of user data is uni-directional, with the transmitter flooding the receiver with a user-specified number of data buffers. Various sender and receiver parameters (such as the number of data buffers transmitted, the size of data buffers, and the size of the socket transmit and receive queues) may be selected at run-time. The following versions of ttcp were implemented and benchmarked: C version -- this is the standard ttcp program implemented in C. It uses C socket calls to transfer and receive data via TCP/IP. ACE version -- this version replaces all C socket calls in ttcp with the C++ wrappers for sockets provided by the ACE network programming components (version 3.2) . The ACE wrappers encapsulate sockets with efficient and typesafe C++ interfaces. CORBA versions -- two implementations of CORBA were used: version 1.3 of Orbix from IONA Technologies and version 1.2 of ORBeline from Post Modern Computing. These versions replace all C socket calls in ttcp with stubs and skeletons generated from a pair of CORBA IDL definitions. One IDL definition uses a sequence parameter for the data buffer and the other uses a string parameter. Each version of ttcp was compiled using SunC++ 4.0.1 with the highest level of optimization ( -O4 ). To control for confounding factors the timing mechanisms, command-line options, socket options, and communication protocols were held constant for all implementations of ttcp . Only the connection establishment and data transfer mechanisms were varied. Results We ran a series of tests that transferred 64 Mbytes of user data in buffers ranging from 1 byte to 128 Kbytes using TCP/IP over Ethernet and ATM networks. Data buffers were run in increments of 1 byte, 1K, 2K, 4K, 8K, 16K, 32K, 64K, and 128K sizes. Two different sizes for socket queues were also used: 8K (the default on SunOS 5.4) and 64K (the maximum size supported by SunOS 5.4). Each test was run 20 times to account for performance variation due to transient load on the networks and hosts. The variance between runs was very low since the tests were conducted on otherwise unused networks. Figure summarizes the performance results for all the benchmarks using 64K socket queues over a 155 Mbps ATM link and a 10 Mbps Ethernet (the 8K socket queue results are presented below and Tables and summarize the results for all the tests). The C and ACE C++ wrapper versions of ttcp obtained the highest throughput: 62 Mbps using 8K data buffers. In contrast, the Orbix and ORBeline CORBA versions of ttcp peaked at around 39 Mbps with 64K data buffers using IDL sequences . The results for Ethernet show much less variation, with the performance for all tests ranging from around 8 to 8.7 Mbps with 64K socket queues. None of the Ethernet benchmarks ran faster than 8.7 Mbps, which is 87 percent of the maximum speed of a 10 Mbps Ethernet. Although the absolute throughput of ttcp is much faster over ATM, the relative utilization of the network channel speed was much lower (62 Mbps represents only 40 percent of the 155 Mbps ATM link). The disparity between network channel speed and end-to-end application throughput is known as the throughput preservation problem , where only a portion of the available bandwidth is actually delivered to applications. This problem stems from operating system and protocol processing overhead (such as data movement, context switching, and synchronization ). As shown in Section , the throughput preservation problem is exacerbated by contemporary implementations of DOC frameworks like CORBA, which copy data multiple times during fragmentation/reassembly, marshalling, and demarshalling. Sections and examine these performance results in detail and Section presents recommendations based on the results. C and ACE Wrapper Implementations of TTCP Figure illustrates the performance results from the C and ACE wrapper versions of ttcp over ATM and Ethernet. The performance of C sockets and ACE C++ wrappers are roughly equivalent, indicating there is no significant performance penalty for using the ACE wrappers. Both peak at 62 Mbps over ATM using 8K data buffers and 64K socket queues. When the data buffers exceeded 8K performance began to decline, leveling off at around 48 Mbps with 64K data buffers. This behavior is caused primarily by the MTU size of the ATM network, which is 9,180 bytes (the MTU size of a SONET frame). When data buffers exceed the MTU size they are fragmented and reassembled, thereby lowering performance. Figure also illustrates the impact of socket queue size on throughput. Larger socket queues increase the TCP window size, which allows the transmission of multiple TCP segments back-to-back. In the case of ATM, increasing the socket queue from 8K to 64K improves ttcp performance significantly from 23 Mbps to 62 Mbps. The Ethernet results for large and small socket queues are more similar than the ATM results. They peak at 8.4 Mbps with 8K socket queues and 8.7 Mbps with 64K socket queues. In both cases, the factor limiting performance is the slow speed of the network. CORBA Implementations of TTCP Figure illustrates the results of measuring two versions of ttcp implemented with two different versions of CORBA. The CORBA implementations were developed using single-threaded versions of Orbix 1.3 and ORBeline 1.2. At the time these tests were performed, neither Orbix nor ORBeline fully supported the OMG 2.0 CORBA standard. This complicated the CORBA versions of ttcp somewhat since different implementations were required to account for differences in Orbix and ORBeline. Extending ttcp to use CORBA required several modifications to the original C/socket code. All C socket calls were replaced with stubs and skeletons generated from pair of CORBA interface definitions. One IDL interface uses a sequence to transmit the data and the other IDL interface uses a string , as follows: 0.85 typedef sequence ttcp_sequence; interface TTCP_Sequence oneway void send (in ttcp_sequence ttcp_seq); ; interface TTCP_String oneway void send (in string ttcp_string); ; The send operations use oneway semantics since the ttcp benchmarks measure the performance of uni-directional data transfers. The client-side of ttcp was modified to obtain object references to the server-side TTCP Sequence and TTCP String object implementations, as follows: 0.9 // Use locator service to acquire bindings. TTCP_String *t_str = TTCP_String::_bind (); TTCP_Sequence *t_seq = TTCP_Sequence::_bind (); Data buffers of the appropriate size were initialized and then transmitted by calling the IDL-generated send stubs, as follows: 0.85 // String transfer. char *buffer = new char[buffer_size]; // Initialize data in char * buffer... while (--buffers_sent >= 0) t_str->send (buffer); // Sequence transfer. TTCP_Sequence sequence_buffer; // Initialize data in TTCP_Sequence buffer... while (--buffers_sent >= 0) t_seq->send (sequence_buffer); The server-side was modified to create object implementations for TTCP Sequence and TTCP String . CORBA IDL compilers generate skeletons that translate IDL interface definitions (such as TTCP Sequence ) into C++ base classes (such as TTCP SequenceBOAImpl ). Each IDL operation (such as oneway void send ) is mapped to a corresponding C++ pure virtual method (such as virtual void send ). Programmers then define C++ derived classes that override these virtual methods to implement application-specific functionality, as follows: Both CORBA implementations of ttcp used inheritance since ORBeline does not support Orbix's ``TIE'' technique (which uses object composition to tie application-specific classes to the generated IDL skeletons). 0.85 // Implementation class for IDL interface // that inherits from automatically-generated // CORBA skeleton class. class TTCP_Sequence_i : virtual public TTCP_SequenceBOAImpl public: TTCP_Sequence_i (void): nbytes_ (0) // Upcall invoked by the CORBA skeleton. virtual void send (const TTCP_Sequence &ttcp_seq, CORBA::Environment &IT_env) this->nbytes_ += ttcp_seq._length; // ... private: // Keep track of bytes received. u_long nbytes_; ; The server-side used the CORBA impl is ready event loop to demultiplex incoming requests to the appropriate object implementation, as follows: 0.85 int main (int argc, char *argv[]) // Implements the Sequence object. TTCP_Sequence_i ttcp_sequence; // Implements the String object. TTCP_String_i ttcp_string; // Tell the ORB that the objects are active. CORBA::BOA::impl_is_ready (); /* NOTREACHED */ return 0; Porting ttcp to use CORBA over ATM demonstrated the importance of having sufficient hooks to manipulate underlying OS mechanisms (such as transport layer and socket layer options) that significantly affect performance. In particular, high performance data transfers over TCP and ATM require large socket queues. This is illustrated by the considerable difference in throughput for the 8K and 64K socket queues in Figures and . Orbix provides hooks to enlarge socket queues via setsockopt by invoking a user-defined callback function whenever a new socket is connected. In contrast, it was hard to enlarge the socket queues using ORBeline 1.2 since it did not provide direct access to sockets (subsequent versions of ORBeline will provide this functionality). By comparing Figure with Figure it is clear that the CORBA-based ttcp implementations ran considerably slower than the C and ACE wrapper versions on the ATM network, particularly for 8K data buffers. The highest throughput (39 Mbps) was obtained by the Orbix sequence implementation using 64K data buffers and 64K socket queues. The performance leveled off beyond 64K data buffers. Unlike the C and ACE wrapper results in Figure , the performance of the CORBA versions did not decrease when the size of the data buffers exceeds 8K. This behavior stems from the higher fixed overhead of CORBA (such as demultiplexing and memory management) which lowers its performance for small buffer sizes. As the buffer size increases, however, the relative impact of this fixed overhead is reduced. However, as the buffers increase in size the overhead of data copying grows, which ultimately limits the throughput achievable with the CORBA implementations. Further profiling and examination of the IDL stubs and skeletons generated by Orbix and ORBeline revealed that the CORBA overhead stems from the following sources: Data Copying: The data buffers exchanged between the sender and receiver in ttcp are treated as a stream of untyped bytes. This is similar to the type of data transmitted by streaming applications such as teleconferencing and medical imaging . Since the data is untyped the CORBA presentation layer need not perform complex marshalling to handle byte-ordering differences between sender and receiver. Although marshalling is not required, the CORBA implementations incurred significant data copying overhead. The UNIX profiler prof was used to pinpoint the sources of this overhead. prof measures the amount of time spent in functions during program execution. Figure lists the functions for all the tests where the most time was spent sending and receiving 64 Mbytes using 128K data buffers and 64K socket queues: The read and write system calls accounted for most of the execution time in the C and ACE wrapper implementations of ttcp . The remaining time for the sender-side was spent preparing the data for transmission. Note that although the data was transmitted as 512 128K buffers it was read by the receiver in much smaller chunks (around 4K). This illustrates the fragmentation and reassembly performed by the ATM network adaptors. The read and write system calls dominated the execution of the CORBA implementations, as well. However, unlike the C and ACE wrapper versions, these implementations spent 4 to 15 percent of their time performing other tasks, such as copying and/or inspecting data ( memcpy , strcpy , and strlen ), checking for activity on other handles ( poll ), and manipulating signal handlers ( sigaction ). The highest cost tasks involved data copying. The IDL stubs and sequences copy data multiple times, e.g., from the TCP data buffer into a marshalling buffer, and then again into the parameter passed to the send upcall. The results in Figure illustrate that the choice of CORBA IDL parameter datatypes has a significant impact on performance. The sequence implementations shown in Figure peaked at 39 Mbps for Orbix and 38 Mbps for ORBeline. In contrast, the string implementations peaked at 34 Mbps for Orbix and 30 Mbps for ORBeline. The performance variation between the sequence and string results from differences in their IDL to C++ mappings. In particular, the IDL sequence mapping contains a length field, whereas the string mapping does not. Thus, the generated stubs and skeletons use this length field to avoid searching each sequence parameter for a terminating NUL character. In contrast, the IDL string implementations use strlen to determine the length of their parameters. The performance variation between Orbix and ORBeline results from differences in their message fragmentation/reassembly implementation, as well as the design of their socket event handling. As shown in Figure , ORBeline copies data approximately 3 more times than Orbix on the sender and receiver for both sequence and string . In addition, ORBeline invokes the poll and sigaction system calls over 1,000 times. The Orbix implementation does not perform these extra operations, which is one reason why ORBeline performance consistently lower than Orbix in Figure . Demultiplexing: Each CORBA request message contains the name of its remote operation represented as a string. Orbix demultiplexes incoming messages to the upcall by performing a linear search through the list of operations in the IDL interface. In the case of ttcp , linear search suffices since there was only one choice ( send ). However, this strategy does not scale since search time grows linearly with the number of operations in the IDL interface. Moreover, the order of operations will determine the demultiplexing performance. Therefore, operations in Orbix should be ordered by decreasing frequency of use. In contrast, ORBeline use hashing to determine the appropriate upcall associated with an incoming request. Hashing is likely to scale better for large IDL interfaces, but may be less efficient for small interfaces. Thus, demultiplexing may benefit from adaptive optimizations that select customized strategies depending on the properties of the IDL interface. Alternatively, perfect hashing or some type of integral indexing scheme could be negotiated between sender and receiver to improve performance and to shield developers from having to manually tune their IDL interfaces. Memory allocation: CORBA-generated skeletons do not know how the user-supplied upcall will use the parameters passed to it from the request message. Thus, they use conservative memory management techniques that dynamically allocate and release copies of messages before and after an upcall, respectively. These memory management policies are important in some circumstances ( e.g., if an upcall is used in a multi-threaded application). However, this strategy needlessly increases processing overhead for streaming applications like ttcp that immediately consume their data without modifying it. Evaluation and Recommendations Section compared the performance of C, ACE wrapper, and CORBA versions of ttcp in terms of their ability to stream large qualities of data using TCP/IP over Ethernet and ATM networks. Tables and summarize the results for all the ATM and Ethernet tests, respectively. All tests perform roughly the same on Ethernet. However, the data copying overhead of the CORBA implementations significantly limits their throughput on ATM. This illustrates that the overhead of CORBA implementations may not be revealed until the network is no longer the limiting factor. In addition, the profiler results in Figure illustrate that small design and implementation differences have a large performance impact on high-speed networks. As users and organizations migrate to high-speed networks the performance limitations of contemporary CORBA implementations will become more evident. This should encourage vendors to optimize the performance of their ORBs for streaming applications running over high-speed networks such as ATM. Key areas of optimization include presentation layer conversions, memory management and memory copying, and receiver-side demultiplexing and dispatching. In particular, implementations must reduce the number of times that large data buffers are copied on the sender and receiver. The need for these optimizations is widely recognized in the communication protocol community and prototypes that implementation these optimizations are becoming available . Until these optimizations are widely implemented in production systems, however, we recommend that developers of bandwidth-intensive and delay-sensitive streaming applications on high-speed networks consider the following when adopting a distributed object computing solution: Carefully measure the performance of the communication infrastructure ( i.e., the network/host hardware and software). The ttcp benchmarks and ACE source code described in this paper are freely available and may be obtained via anonymous ftp from ics.uci.edu in the file /C++_wrappers.tar.Z@ or from URL ://www.cs.wustl.edu/ schmidt/@. We encourage others to replicate our ttcp experiments using different implementations of CORBA and other network/host platforms and report the results. Evaluate tools based on empirical measurements and thorough understanding of application requirements, rather than adopting a particular communication model or implementation unconditionally. Integrate higher-level DOC frameworks with high-performance object-oriented encapsulations of lower-level network programming interfaces (such as the ACE socket wrappers described in Section ). Insist that CORBA implementors provide hooks to manipulate the underlying protocol layer and socket layer options conveniently. It is particularly important to increase the size of the socket queues to the largest values supported by the OS. Tune the size of transmitted data buffers to match the MTU of the network where appropriate. Use IDL sequences rather than strings to avoid unnecessary data access. The performance results and recommendations in this paper are not intended as a criticism of the CORBA model or of particular ORB vendors. It is beyond the scope of this paper to discuss the benefits (such as extensibility and maintainability) of CORBA, as well as its limitations . Clearly, implementations of other DOC frameworks (such as OODCE or OLE/COM) that do not address key sources of overhead on high-speed networks will exhibit similar performance problems. An Object-Oriented Network Programming Interface Low-level network programming interfaces like sockets or TLI are difficult to program. They require strict attention to many tedious details, making them hard to learn and error prone to program. In addition, programming directly to low-level interfaces limits portability and reuse. One solution is to develop applications using higher-level distributed object computing (DOC) frameworks like CORBA. DOC frameworks shield developers from low-level programming details and facilitate a reasonably portable distributed computing platform. As described in the previous section, however, the performance of conventional implementations of CORBA may be inadequate for bandwidth-intensive and delay-sensitive streaming applications on high-speed networks. One method for satisfying the tension between programming simplicity, portability, and run-time efficiency is to encapsulate lower-level network programming interfaces with object-oriented wrappers. By judicious use of languages features (such as inlining and templates) and design patterns (such as Factories , Connectors and Acceptors ) it is possible to create reusable object-oriented components that are typesafe, portable, convenient to program, and efficient. This section outlines the design of the IPC SAP object-oriented network programming components provided by the ACE toolkit . ACE contains a set of object-oriented networking programming components that perform active and passive connection establishment, data transfer, event demultiplexing, event handler dispatching, routing, dynamic (re)configuration of application services, and concurrency control . IPC SAP stands for ``InterProcess Communication Service Access Point.'' It consists of a family of class categories shown in Figure that encapsulate handle-based network programming interfaces such as sockets ( SOCK SAP ), TLI ( TLI SAP ), UNIX SVR4 STREAM pipes ( SPIPE SAP ), and UNIX named pipes ( FIFO SAP ). These network programming wrappers are designed to improve the correctness, programming simplicity, portability, and reusability of performance-sensitive communication software. This section describes the SOCK SAP socket wrappers, focusing on interface design techniques that shield programmers from shortcomings of C, C++, and existing OS network programming interfaces. Limitations with Sockets Sockets were originally developed in BSD UNIX to provide an interface to the TCP/IP protocol suite . From an application's perspective, a socket is a local endpoint of communication that can be bound to an address residing on a local or a remote host. Sockets are accessed via handles , which are unsigned integers that index into a table maintained in the OS. Handles shield applications from the internal representation of OS data structures. In UNIX and Windows NT, socket handles share the same name space as other handles (such as files, named pipes, and terminal devices). The standard socket interface is defined by the C functions shown in Figure . It contains several dozen routines that perform tasks such as locating address information for network services, establishing and terminating connections, and sending and receiving data . Although the socket interface is widely available and widely used, its design has several notable limitations discussed below. These limitations are shared by other lower-level network programming interfaces such as TLI, STREAM pipes, and named pipes. High Potential for Error In UNIX any integral value can be passed as a handle to a socket routine. Therefore, compilers are unable to detect or prevent the erroneous use of handles. This weak type-checking allows subtle errors to occur at run-time since the socket interface cannot enforce the correct use of routines for different communication roles (such as active vs. passive connection establishment or datagram vs. stream communication). Operations (such as invoking a data transfer operation on a handle designated for establishing connections) may therefore be applied improperly on handles. Figure depicts the following subtle (and common) errors that occur when using the socket interface: Forgetting to initialize the length parameter (used by accept ) to the size of struct sockaddr in ; Forgetting to ``zero-out'' all bytes in the socket address structure; Using an address family type that is inconsistent with the protocol family of the socket ( e.g., PF UNIX vs. AF INET ); Neglecting to use the htons library function to convert port numbers from host byte-order to network byte-order and vice versa; Applying the accept function on a SOCK DGRAM socket; Erroneously omitting parentheses in an assignment expression; Trying to read from a passive-mode socket that should only be used to accept connections; Failing to properly detect and handle ``short-writes'' that occur due to buffering in the OS and flow control in the transport protocol. Other common misuses of sockets not shown in this example are forgetting to call listen when creating a passive-mode SOCK STREAM listener socket and miscalculating the length of the pathname in a UNIX-domain socket address (the trailing NUL should not be counted). Several of the problems listed above are classic problems with programming in C. For instance, by omitting the parentheses in the expression if (n_fd = accept (s_fd, (struct sockaddr *) &s_addr, &length) == -1) the value of n fd will always be set to either 0 or 1, depending on whether accept() == -1 . This problem is exacerbated by the fact that accept returns the handle of the newly connected socket. If this handle were passed back as an out parameter there would be less incentive to use accept in an assignment expression. A deeper problem is that C's lack of support for data abstraction and object-oriented programming makes it hard to define typesafe, extensible, and reusable component interfaces. For example, the generic sockaddr socket address structure provides a crude form of inheritance to express the commonality between Internet domain and UNIX domain address structures ( sockaddr in and sockaddr un , respectively). These ``subclass'' address structures require the use of a non-typesafe cast to overlay the sockaddr ``base class.'' In an object-oriented language this commonality would be expressed more cleanly and robustly using inheritance and dynamic binding. In general, the use of unsafe typecasts, combined with the weakly-typed handle-based socket interface, makes it impossible for a compiler to detect mistakes at compile-time. Instead, error checking is deferred until run-time, which complicates error handling and reduces application robustness. Most of the error checking has been omitted in these examples to save space. Naturally, robust programs should check the return values of library and system calls. Complex Interface Sockets support multiple protocol families (such as TCP/IP, IPX/SPX, ISO OSI, and UNIX domain sockets) with a single interface. The socket interface contains many functions to support different communication roles (such as active vs. passive connection establishment), communication optimizations (such as writev / readv that send/receive multiple buffers in a single system call), and protocol options (such as broadcasting, multicasting, asynchronous I/O, and urgent data delivery). Although sockets combine this functionality into a common interface, the result is complex and hard to master. Much of this complexity stems from the overly broad and one-dimensional design of the socket interface. That is, all the routines appear at a single level of abstraction (as shown in Figure ). This design increases the amount of effort required to learn and use sockets correctly. In particular, programmers must understand most of the interface to use any part of it effectively. If the socket routines are examined carefully, however, it is clear that the interface decomposes naturally into the following communication dimensions: Type of communication service -- i.e., stream vs. datagram vs. connected datagram; Communication role -- i.e., active vs. passive (clients are typically active, whereas servers are typically passive); Communication domain -- i.e., local IPC only vs. local/remote IPC. Figure classifies the socket routines according to these dimensions. Since the socket interface is one-dimensional, however, this natural clustering of functionality is obscured. Another problem with the socket interface is that its several dozen routines lack a uniform naming convention. Non-uniform naming makes it hard to determine the scope of the socket interface. For example, it is not immediately obvious that socket , bind , accept , and connect routines are related. Other network programming interfaces solve this problem by prepending a common prefix before each routine. For example, a t is prepended before each routine in the TLI library. * The SOCK SAP Class Category SOCK SAP is designed to overcome the limitations with sockets described above. It improves the correctness, ease of learning and ease of use, reusability, and portability of communication software without sacrificing performance. This section outlines the software architecture of SOCK SAP and explains the classes used by the programming examples in Section . Readers who are not interested in this level of detail may want to skip to Section , which discusses the general principles underlying the design of the SOCK SAP wrappers. SOCK SAP consists of around one dozen C++ classes that are related by multiple inheritance and composition. These components and their relationships are illustrated via Booch notation in Figure . Dashed clouds indicate classes and directed edges indicate inheritance relationships between these classes ( e.g., SOCK Stream inherits from SOCK ). The general structure of SOCK SAP corresponds to the taxonomy of communication services , communication roles , and communication domains shown in Figure . It is instructive to compare Figure with Figure . The latter is more concise since it uses C++ wrappers to encapsulate the behavior of multiple socket mechanisms within classes related by inheritance. Each class in SOCK SAP provides an abstract interface for a subset of mechanisms that together comprise the overall class category. The functionality of various types of Internet-domain and UNIX-domain sockets is achieved by inheriting mechanisms from the appropriate classes described below. These classes are presented below according to the groupings shown in Figure . Base Classes: The SOCK and LSOCK classes anchor the inheritance hierarchy and enable subsequent derivation and code sharing. Objects of these classes cannot be instantiated since their constructors are declared in the protected section of the class definition. SOCK: this class is the root of the SOCK SAP hierarchy. It provides mechanisms common to all other classes, such as opening and closing local endpoints of communication and handling options (such as selecting socket queue sizes and enabling group communication). LSOCK: this class provides mechanisms that allow applications to send and receive open file handles between unrelated processes on the local host machine (hence the prefix 'L'). Note that System V and BSD UNIX both support this feature, though Windows NT does not. Other classes inherit from LSOCK to obtain this functionality. SOCK SAP distinguishes the LSOCK* and SOCK* classes on the basis of network address formats and communication semantics. In particular, the LSOCK* classes use UNIX pathnames as addresses and only allow intra-machine IPC. The SOCK* classes, on the other hand, use Internet Protocol (IP) addresses and port numbers and allow both intra- and inter-machine IPC. Connection Establishment: Communication software is typified by asymmetric connection behavior between clients and servers. In general, servers listen passively for clients to initiate connections actively . The structure of passive/active connection establishment and data transfer relationships are captured by the following connection-oriented SOCK SAP classes: SOCK Acceptor and LSOCK Acceptor: The *Acceptor classes are factories that passively establish new endpoints of communication in response to active connection requests. The SOCK Acceptor and LSOCK Acceptor factories produce SOCK Stream and LSOCK Stream connection endpoint objects, respectively. SOCK Connector and LSOCK Connector: The *Connector classes are factories that actively establish new endpoints of communication. These classes establish connections with remote endpoints and produce the appropriate *Stream object when a connection is established. A connection may be initiated either synchronously or asynchronously. The SOCK Connector and LSOCK Connector factories produce SOCK Stream and LSOCK Stream connection endpoint objects, respectively. Note that the *Acceptor and Connector classes do not provide methods for sending or receiving data. Instead, they are factories that produce the *Stream data transfer objects described below. The use of strongly-typed interfaces detects accidental misuse of local and non-local *Stream objects at compile-time. In contrast, the socket interface can only detect these type mismatches at run-time. Stream Communication: Although establishing connections requires a distinction between active and passive roles, once a connection is established data may be exchanged in any order according to the protocol used by the endpoints. SOCK SAP isolates the data transfer behavior in the following classes: SOCK Stream and LSOCK Stream: These two classes are produced by the *Acceptor or *Connector factories described above. The *Stream classes provide mechanisms for transferring data between two processes. LSOCK Stream objects exchange data between processes on the same host machine; SOCK Stream objects exchange data between processes that may reside on different host machines. The overloaded send and recv *Stream methods provide standard UNIX write and read semantics. Thus, a send may write less (and a recv may read more) than the requested number of bytes. These ``short-writes'' and ``short-reads'' occur due to buffering in the OS and flow control in the transport protocol. To reduce programming effort, the the *Stream classes provide send n and recv n methods that allow transmission and reception of exactly bytes. ``Scatter-read'' and ``gather-write'' methods are also provided to efficiently send and receive multiple buffers of data simultaneously. Datagram Communication: SOCK CODgram and LSOCK CODgram: These classes provide a ``connected-datagram'' mechanism, which allows the send and recv operations to omit the address of the service when exchanging datagrams. Note that the connected-datagram mechanism is only a syntactic convenience since there are no additional semantics associated with the data transfer ( i.e., datagram delivery remains unreliable). SOCK CODgram inherits mechanisms from the SOCK base class. LSOCK CODgram inherits mechanisms from both SOCK CODgram and LSOCK (which provides the ability to pass file handles). SOCK Dgram and LSOCK Dgram: These classes provide mechanisms for exchanging datagrams between processes running on local and/or remote hosts. Unlike the connected-datagram classes described above, each send and recv operation must provide the address of the service with every datagram sent or received. LSOCK Dgram inherits all the operations of both SOCK Dgram and LSOCK . It only exchanges datagrams between processes on the same host. The SOCK Dgram class, on the other hand, may exchange datagrams between processes on local and/or remote hosts. Group Communication: SOCK Dgram Bcast: This class provides mechanisms for broadcasting UDP datagrams to processes running on local and/or remote hosts attached to local subnets. The interface for this class supports the broadcast of datagrams to (1) all network interfaces connected to the host machine or (2) a particular network interface. This class shields the end-user from the low-level details required to utilize broadcasting effectively. SOCK Dgram Mcast: This class provides mechanisms for multicasting UDP datagrams to processes running on local and/or remote hosts attached to local subnets. The interface for this class supports the multicast of datagrams to a particular multicast group. This class shields the end-user from the low-level details required to utilize multicasting effectively. Network Addressing Designing an efficient, general-purpose network addressing interface is hard. The difficulty stems from trying to represent different network address formats with a space efficient and uniform interface. Different address formats store diverse types of information represented with various sizes. For example, an Internet-domain service (such as ftp or telnet ) is identified using two fields: (1) a four-byte IP address (which uniquely identifies the remote host machine throughout the Internet) and (2) a two-byte port number (which is used to demultiplex incoming protocol data units to the appropriate client or server process on the remote host machine). In contrast, UNIX-domain sockets rendezvous via UNIX pathnames (which may be up to 108 bytes in length and are meaningful only on a single local host machine). The existing sockaddr -based network addressing structures provided by the socket interface is cumbersome and error-prone. It requires developers to explicitly initialize all the bytes in the address structure to 0 and to use explicit casts. In contrast, the SOCK SAP addressing classes shown in Figure contain mechanisms for manipulating network addresses. The constructors for the Addr base class ensure that all fields are automatically initialized correctly. Moreover, the different sizes, formats, and functionality that exist between different address families are encapsulated in the derived address subclasses. This makes it easier to extend the network addressing scheme to encompass new communication domains. For example, the UNIX Addr subclass is associated with the LSOCK* classes, the INET Addr subclass is associated with the SOCK* and TLI* classes, and the SPIPE Addr subclass is associated with the STREAM Pipe classes. Programming with SOCK SAP C++ Wrappers This section illustrates the ACE SOCK SAP wrappers by using them to develop a client/server streaming application. This application is simplified version of the ttcp program described in Section . For comparison, this application is also written with sockets and CORBA. Figures and present a client/server program that uses Internet-domain sockets to implement the stream application. The server shown in Figure creates a passive-mode listener socket and waits for clients to connect to it. Once connected, the server receives the data transmitted from the client and displays the data on its standard output stream. The client-side shown in Figure establishes a TCP connection with the server and transmits its standard input stream across the connection. The client uses non-blocking connections to limit the amount of time it waits for a connection to be accepted or refused. Most of the error checking for return values has been omitted to save space. However, it is instructive to note all the socket initialization, network addressing, and flow control details that must be programmed explicitly to make even this simple example work correctly. Moreover, the code in Figures and is not portable to platforms that do not support sockets or select . Figures and use SOCK SAP to reimplement the C versions of the client/server programs. The SOCK SAP programs implement the same functionality as those presented in Figure and Figure . The SOCK SAP C++ programs exhibit the following benefits compared with the socket-based C implementation: Decreased program size -- e.g., a substantial reduction in the lines of code results from localizing active and passive connection establishment in the SOCK Acceptor and SOCK Connector connection factories. In addition, default values are provided for constructor and method parameters, which reduces the number of arguments needed for common usage patterns. Increased clarity -- e.g., network addressing and host location is handled by the Addr class, which hides the subtle and error-prone details that must be programmed explicitly in Figures and . Moreover, the low-level details of non-blocking connection establishment are performed by the SOCK Connector . Increased typesafety -- e.g., the SOCK Acceptor and SOCK Connector connection factory objects must be passed arguments of the SOCK Stream type. This prevents the type errors shown in Figure from occurring at run-time. Increased portability -- e.g., switching between sockets and TLI simply requires changing send_data (s_addr); in the client to send_data (s_addr); and recv_data (s_addr); in the server to recv_data (s_addr); Conditional compilation directives can be used to further decouple the communication software from reliance upon a particular type of network programming interface. However, the ACE wrappers share some of the same drawbacks as sockets. In particular, too much of the code required to program at this level is not directly related to the application. In contrast, Figures and illustrate the CORBA version of the stream application implemented using Orbix 1.3. This implementation is considerably more concise than both the C and ACE wrapper versions. CORBA performs the low-level communication details associated with service location, passive and active connection establishment, message framing, marshalling and demarshalling, demultiplexing, and upcall dispatching. This allows developers to concentrate on defining application-specific behavior, rather than wrestling with the details of network programming. The persistent server shown in Figure creates an implementation of a Data Stream IDL interface and informs the ORB that it is ready to receive send requests. It uses a standard ACE class CORBA Handler to register the server and object name with the Orbix daemon automatically. The client shown in Figure uses the Orbix locator service to bind to the marker exported by the Data Stream server. Once bound, the client transmits all data from its standard input to the server via the Data Stream::send proxy. This example is behaves slightly differently than the C and ACE wrapper versions since CORBA does not provide a standard means to obtain the host and port of the sender. Moreover, CORBA communication semantics are request-oriented rather than connection-oriented. Thus, other clients could conceivably bind to the same marker name and transmit data via its send method. Socket Wrapper Design Principles This section describes the design principles applied throughout the SOCK SAP class category. Although these principles are widely used in domains such as graphical user interfaces they are less widely applied in the communication software domain. Only permit typesafe operations: Several limitations with sockets discussed in Section stem from the lack of typesafety in its interface. To enforce typesafety, SOCK SAP ensures all of its objects are properly initialized via constructors. In addition, to prevent accidental violations of typesafety, only legal operations are permitted on SOCK SAP objects. This latter point is illustrated in the SOCK SAP revision of echo server shown in Figure . This version fixes the problems with sockets and C identified in Figure . Since SOCK SAP classes are strongly typed, invalid operations are rejected at compile-time rather than at run-time. For example, it is not possible to invoke recv or send on a SOCK Acceptor connection factory since these methods are not part of its interface. Likewise, return values are used to convey success or failure of operations, rather than returning more detailed information. This reduces the potential for misuse in assignment expressions. Simplify for the common case: The key to this principle is ``make it easy to use SOCK SAP correctly, hard to use it incorrectly, but not impossible to use it in ways the class designers did not anticipate originally.'' This principle is exemplified by the get handle and set handle methods provided by the IPC SAP root class. These methods extract and assign the underlying handle, respectively. By providing get handle and set handle , IPC SAP allows applications to circumvent its type-checking mechanisms in unforeseen situations where applications must interface directly with UNIX system calls (such as select ) that expect a handle. Define parsimonious interfaces: This principle localizes the cost of using a particular abstraction. The IPC SAP interfaces limits the amount of details that application developers must remember. IPC SAP provides developers with distinct cluster of classes that perform various types of communication (such as connection-oriented vs. connectionless) and various roles (such as active vs. passive). For example, to reduce the change of error, the SOCK Acceptor class only permits operations that apply for programs playing passive roles and the SOCK Connector class only permits operations that apply for programs playing an active role. In addition, sending and receiving open file handles has a much simpler calling interface using SOCK SAP compared with using the highly-general UNIX sendmsg/recvmsg routines. Replace one-dimensional interfaces with hierarchically-related class categories: This principle involves using hierarchically-related class categories to restructure existing one-dimensional socket interfaces. The criteria used to structure the SOCK SAP class category involved identifying and clustering related socket routines to maximize the reuse and sharing of class components. Inheritance support different subsets of functionality for the SOCK SAP class categories. For instance, not all operating systems support passing open file handles ( e.g., Windows NT). Thus, it is possible to omit the LSOCK class (described in Section ) from the inheritance hierarchy without affecting the interfaces of other classes in the SOCK SAP design. Inheritance also increases code reuse and improves modularity. Base classes express similarities between class category components and derived classes express the differences . For example, the SOCK SAP design places shared mechanisms towards the ``root'' of the inheritance hierarchy. These mechanisms include operations for opening/closing and setting/retrieving the underlying socket handles, as well as certain option management functions that are common to all the derived SOCK SAP classes. Subclasses located towards the ``bottom'' of the inheritance hierarchy implement specialized operations that are customized for the type of communication provided (such as stream vs. datagram communication or local vs. remote communication). This approach avoids unnecessary duplication of code since the more specialized derived classes reuse the more general mechanisms provided at the root of the inheritance hierarchy. Enhance portability with parameterized types: Wrapping sockets with C++ classes (rather than stand-alone C functions) helps to improve portability by allowing the wholesale replacement of network programming mechanisms via parameterized types. Parameterized types decouple applications from reliance on specific network programming interfaces. Figure illustrates this technique by modifying the echo server to become a C++ function template. Depending on certain properties of the underlying OS platform (such as whether it implements TLI or sockets more efficiently), the echo server may be instantiated with either SOCK SAP or TLI SAP classes, as shown in Figure . In general, the use of parameterized types is less intrusive and more extensible that conventional alternatives (such as implementing multiple versions or littering conditional compilation directives throughout the source code). For example, the SOCK SAP and TLI SAP classes offer the same object-oriented interface (depicted in Figure ). Certain OS platforms may possess different underlying network programming interfaces such as sockets but not TLI or vice versa. Using IPC SAP , applications can be written that are transparently parameterized with either the SOCK SAP or TLI SAP class category. C++ templates support a loose form of type conformance that does not constrain an interface to encompasses all potential functionality. Instead, templates are used to parameterize application code that is carefully designed to invoke only a subset of methods that are common to the various communication abstractions ( e.g., open , close , send , recv , etc.). The type abstraction provided by templates helps improve portability among platforms that support different network programming interfaces (such as sockets or TLI). For example, the parameterizing the transport interface turned out to be useful for developing applications across various SunOS platforms. The socket implementation in SunOS 5.2 was not thread-safe and the TLI implementation in SunOS 4.x contains a number of serious defects. Inline performance critical methods: To encourage developers to replace existing low-level network programming interfaces with C++ wrappers, the SOCK SAP implementation must operate efficiently. To ensure this, methods in the critical performance path (such as the SOCK Stream recv and send methods) are specified as C++ inline functions to eliminate run-time overhead. Inlining is both time and space efficient since these methods are very short (approximately 2 or 3 lines per method). The use of inlining implies that virtual functions should be used sparingly since most contemporary C++ compilers do not fully optimize away virtual function overhead. Design auxiliary classes that shield applications from error-prone details: e.g., SOCK SAP contains the Addr class hierarchy (shown in Figure ). This hierarchy supports several diverse network addressing formats via a typesafe C++ interface. The Addr hierarchy eliminates several common programming errors (such as forgetting to zero-out a sockaddr addressing structure) associated with using the C-based family of struct sockaddr data structures directly. Combine several operations to form a single operation: e.g., the SOCK Acceptor is a factory for passive connection establishment. Its constructor performs the socket calls socket , bind , and listen required to create a passive-mode listener endpoint. Supply default parameters for typical method argument values: e.g., the addressing parameters to accept are frequently NULL pointers. To simplify programming, these values are given as defaults in SOCK Acceptor::accept so that programmers need not provide them. Concluding Remarks An important class of applications require high-performance streaming communication. Bandwidth-intensive and delay-sensitive streaming applications like medical imaging or teleconferencing are not supported efficiently by contemporary CORBA implementations due to data copying, demultiplexing, and memory management overhead. As shown in Section , this overhead is often masked on low-speed networks like Ethernet and Token Ring. On high-speed networks like ATM or FDDI, however, this overhead becomes a significant factor limiting communication performance. The ACE socket wrappers described in this paper provide a high-performance network programming interface that shields developers from lower-level details of sockets or TLI without sacrificing performance. The ACE wrappers automate and simplify many aspects (such as initialization, addressing, and handling short-writes) of using lower-level network programming interfaces. They improve portability by shielding applications from platform-specific network programming interfaces. Wrapping sockets with C++ classes (rather than stand-alone C functions) makes it convenient to switch wholesale between different network programming interfaces by using parameterized types. In addition, as shown in Figure , the ACE socket wrappers do not introduce any significant overhead compared with programming with socket directly. The primary drawback with the ACE network programming wrappers is that they do not address higher-level issues related to system reliability and availability, flexibility of object location and selection, support for transactions, security, and deferred process activation, and the exchange of binary data between different computer architectures. For example, programmers must provide explicit support for presentation layer conversions in conjunction with the ACE wrappers. Therefore, these wrappers are most useful when the datatypes simple, like those used by the high-performance streaming applications described in this paper. The ACE C++ wrappers for sockets may be integrated with CORBA to enhance the performance of streaming applications. We've combined CORBA and the ACE wrappers in a high-speed teleradiology system that transfers 10-40 Mbyte medical images over ATM. In this system, CORBA is used as a signaling mechanism to identify endpoints of communication in a location-independent manner. The ACE wrappers are then used to establish point-to-point TCP connections and transmit bulk data efficiently across the connections. This strategy builds on the strengths of both CORBA and ACE. ACE has been ported to many versions of UNIX and Windows NT and is currently being used in many commercial products including the Bellcore and Siemens Q.port ATM signaling software product, the Ericsson EOS family of telecommunication monitoring applications, the System Control Segment of the Motorola Iridium project, and a high-speed enterprise-wide medical image delivery system for Kodak Health Imaging Systems.