USENIX Technical Program - Paper - 5th USENIX Conference on Object Oriented Technologies 99
Applying Optimization Principle
|Irfan Pyarali, Carlos O'Ryan, Douglas Schmidt,||Aniruddha Gokhale|
|Nanbor Wang, and Vishal Kachroo|
|Washington University||Bell Labs, Lucent Technologies|
|Campus Box 1045||600 Mountain Ave Rm 2A-442|
|St. Louis, MO 63130||Murray Hill, NJ 07974|
Our findings indicate that ORBs must be highly configurable and adaptable to meet the QoS requirements for a wide range of real-time applications. In addition, we show how TAO can be configured to perform predictably and scalably, which is essential to support real-time applications. A key result of our work is to demonstrate that the ability of CORBA ORBs to support real-time systems is mostly an implementation detail. Thus, relatively few changes are required to the standard CORBA reference model and programming API to support real-time applications.Many companies and research groups are developing distributed applications using middleware components like CORBA Object Request Brokers (ORBs) . CORBA helps to improve the flexibility, extensibility, maintainability, and reusability of distributed applications . However, a growing class of distributed real-time applications also require ORB middleware that provides stringent quality of service (QoS) support, such as end-to-end priority preservation, hard upper bounds on latency and jitter, and bandwidth guarantees . Figure 1 depicts the layers and components of an ORB endsystem that must be carefully designed and systematically optimized to support end-to-end application QoS requirements.
First-generation ORBs lacked many of the features and optimizations [4,5,6,7] shown in Figure 1. This situation was not surprising, of course, since the focus at that time was largely on developing core infrastructure components, such as the ORB and its basic services, defined by the OMG specifications . In contrast, second-generation ORBs, such as The ACE ORB (TAO) , explicitly focus on providing end-to-end QoS guarantees to applications vertically (i.e., network interface application layer) and horizontally (i.e., end-to-end) integrating highly optimized CORBA middleware with OS I/O subsystems, communication protocols, and network interfaces.
Our previous research has examined many dimensions of high-performance and real-time ORB endsystem design, including static  and dynamic  scheduling, event processing , I/O subsystem integration , ORB Core connection and concurrency architectures , systematic benchmarking of multiple ORBs , and design patterns for ORB extensibility . This paper focuses on four more dimensions in the high-performance and real-time ORB endsystem design space: Object Adapter and ORB Core optimizations for (1) request demultiplexing, (2) collocation, (3) memory management, and (4) ORB protocol overhead.
The optimizations used in TAO are guided by a set of principle patterns  that have been applied to optimize middleware  and lower-level networking software , such as TCP/IP. Optimization principle patterns document rules for avoiding common design and implementation problems that degrade the performance, scalability, and predictability of complex systems. The optimization principle patterns we applied to TAO include: optimizing for the common case; eliminating gratuitous waste; shifting computation in time such as precomputing; avoiding unnecessary generality; passing hints between layers; not being tied to reference implementations; using specialized routines; leveraging system components by exploiting locality; adding state; and using efficient data structures. Below, we outline how these optimization principle patterns address the following TAO Object Adapter and ORB Core design and implementation challenges.2 describes how Object Adapter demultiplexing strategies impact the scalability and predictability of real-time ORBs. This section also illustrates how TAO's Object Adapter optimizations enable constant time request demultiplexing in the average- and worst-case, regardless of the number of objects or operations configured into an ORB. The principle patterns that guide our request demultiplexing optimizations include precomputing, using specialized routines, passing hints in protocol headers, and not being tied to reference models. 3.1 describes how TAO's collocation optimizations are completely transparent to clients, i.e., collocated objects can be used as regular CORBA objects, with TAO handling all aspects of collocation. 3.2 describes the mechanisms used in TAO to allocate and manipulate the internal buffers it uses for parameter (de)marshaling. We illustrate how TAO minimizes fragmentation, data copying, and locking for most application use-cases. The principle patterns of exploiting locality and optimizing for the common case influence these optimizations. 3.3 shows how TAO can be configured to reduce the overhead of GIOP/IIOP without affecting the standard CORBA programming APIs exposed to application developers. This optimization is based on the principle pattern of avoiding unnecessary generality.
The remainder of this paper is organized as follows: Section 2 outlines the Portable Object Adapter (POA) architecture of CORBA ORBs and evaluates the design and performance of POA optimizations used in TAO; Section 3 outlines the ORB Core architecture of CORBA ORBs and evaluates the design and performance of ORB Core optimizations used in TAO; Section 4 describes related work; and Section 5 provides concluding remarks.
The OMG CORBA 2.2 specification  standardizes several components on the server-side of CORBA-compliant ORBs. These components include the Portable Object Adapter (POA), standard interfaces for object implementations (i.e., servants), and refined definitions of skeleton classes for various programming languages, such as Java and C++ .
These standard POA features allow application developers to write more flexible and portable CORBA servers . They also make it possible to conserve resources by activating objects on-demand  and to generate ``persistent'' object references  that remain valid after the originating server process terminates. Server applications can configure these new features portably using policies associated with each POA.
CORBA 2.2 allows server developers to create multiple Object Adapters, each with its own set of policies. Although this is a powerful and flexible programming model, it can incur significant run-time overhead because it complicates the request demultiplexing path within a server ORB. This is particularly problematic for real-time applications since naive Object Adapter implementations can increase priority inversion and non-determinism .
Optimizing a POA to support real-time applications requires the resolution of several design challenges. This section outlines these challenges and describes the optimization principle patterns we applied to maximize the predictability, performance, and scalability of TAO's POA. These POA optimizations include constant-time demultiplexing strategies, reducing run-time object key processing overhead during upcalls, and generally optimizing POA predictability and reducing memory footprint by selectively omitting non-deterministic POA features.Scalable and predictable POA demultiplexing is important for many applications such as real-time stock quote systems  that service a large number of clients, and avionics mission systems  that have stringent hard real-time timing constraints. Below, we outline the steps involved in demultiplexing a client request through the server-side of a CORBA ORB and then qualitatively and quantitatively evaluate alternative demultiplexing strategies. A standard GIOP-compliant client request contains the identity of its object and operation. An object is identified by an object key, which is an octet sequence. An operation is represented as a string. As shown in Figure 2,
The conventional deeply-layered ORB endsystem demultiplexing implementation shown in Figure 2 is generally inappropriate for high-performance and real-time applications for the following reasons :
Conventional implementations of CORBA incur significant demultiplexing overhead. For instance, [4,6] show that conventional ORBs spend 17% of the total server time processing demultiplexing requests. Unless this overhead is reduced and demultiplexing is performed predictably, ORBs cannot provide uniform, scalable QoS guarantees to real-time applications.
The remainder of this section focuses on demultiplexing optimizations performed at the ORB layer, i.e., steps 3 through 6. Information on OS kernel layer demultiplexing optimizations for real-time ORB endsystems is available in [22,12].As illustrated in Figure 2, demultiplexing a request to a servant and dispatching the designated servant operation involves several steps. Below, we qualitatively outline the most common demultiplexing strategies used in CORBA ORBs. Section 2.2.3 then quantitatively evaluates the strategies that are appropriate for each layer in the ORB. 4] use linear search for operation demultiplexing. 6]. 23]. Perfect Hashing is based on the principle pattern of precomputing and using specialized routines. A demultiplexing strategy based on perfect hashing executes in constant time and space. This property makes perfect hashing well-suited for deterministic real-time systems that can be configured statically , i.e., the number of objects and operations can be determined off-line. 6] strategy provides a low-overhead, O(1) lookup technique that can be used throughout an Object Adapter.
Table 1 summaries the demultiplexing
strategies considered in the implementation of TAO's POA.
Simple to implement
Does not scale
O(1) average case
O(n) worst case
O(1) worst case
For static configurations,
O(1) worst case
For system generated
keys, add direct indexing
information to keys
We conducted an experiment to measure the effect of increasing the POA
nesting level on the time required to lookup the appropriate POA in
which the servant is registered. We used a range of POA depths, 1
through 25. The results are shown in Figure 3.
Since most ORB server applications do not have deeply nested POA hierarchies, TAO currently uses a POA demultiplexing strategy where each POA finds its child using dynamic hashing and delegates to the child POA where this process is repeated until the search is complete. This POA demultiplexing strategy results in O(n) growth for the lookup time and does not scale up to deeply nested POAs. Therefore, we are adding active demultiplexing to the POA lookup phase, which operates as follows:
Using active demultiplexing for POA lookup should provide optimal predictability and scalability, just as it does when used for servant demultiplexing, which is described next.Once the ORB Core demultiplexes a client request to the right POA, this POA demultiplexes the request to the correct servant. The following discussion compares the various servant demultiplexing techniques described in Section 2.2.2. TAO uses the Service Configurator , Bridge, and Strategy design patterns  to defer the configuration of the desired servant demultiplexing strategy until ORB initialization, which can be performed either statically (i.e., at compile-time) or dynamically (i.e., at run-time) . Figure 4 illustrates the class hierarchy of strategies that can be configured into TAO's POAs.
To evaluate the scalability of TAO, our experiments used a range of
servants, 1 to 500 by increments of 100, in the server.
Figure 5 shows the latency for servant
demultiplexing as the number of servants increases.
Note that we did not implement the perfect hashing strategy for servant demultiplexing. Although it is possible to know the set of servants on each POA for certain statically configured applications a priori, creating perfect hash functions repeatedly during application development is tedious. We omitted binary search for similar reasons, i.e., it requires maintaining a sorted active object map every time an object is activated or deactivated. Moreover, since the object key is created by a POA, active demultiplexing provides equivalent, or better, performance than perfect hashing or binary search.The final step at the Object Adapter layer involves demultiplexing a request to the appropriate skeleton, which demarshals the request and dispatches the designated operation upcall in the servant. To measure operation demultiplexing overhead, our experiments defined a range of operations, 1 through 50, in the IDL interface.
For ORBs like TAO that target real-time embedded systems, operation demultiplexing must be efficient, scalable, and predictable. Therefore, we generate efficient operation lookup using GPERF , which is a freely available perfect hash function generator we developed.
GPERF  automatically constructs perfect hash functions from a user-supplied list of keywords. In addition to the perfect hash functions, GPERF can also generate linear and binary search strategies.
Figure 6 illustrates the interaction between the TAO
IDL compiler and GPERF.
The lookup key for this phase is the operation name, which is a string defined by developers in an IDL file. However, it is not permissible to modify the operation string name to include active demultiplexing information. Since active demultiplexing cannot be used without modifying the GIOP protocol. TAO uses perfect hashing for operation demultiplexing. Perfect hashing is well-suited for this purpose since all operations names are known at compile time.
Figure 7 plots operation demultiplexing
latency as a function of the number of operations.
However, certain POA operations and policies require lookups on Active Object Map to be based on the servant pointer rather than the Object Id. For instance, the _this method on the servant can be used with the IMPLICIT_ACTIVATION POA policy outside the context of request invocation. This operation allows a servant to be activated implicitly if the servant is not already active. If the servant is already active, it will return the object reference corresponding to the servant.
Unfortunately, naive POA's Active Object Map implementations incur worst-case performance for servant-based lookups. Since the primary key is the Object Id, servant-based lookups degenerate into a linear search, even when Active Demultiplexing is used for the Object Id-based lookups. As shown in Figure 5, linear search is prohibitively expensive as the number of servants in the Active Object Map increases. This overhead is particularly problematic for real-time applications, such as avionics mission computing systems , that (1) create a large number of objects using _this during their initialization phase and (2) must reinitialize rapidly to recover from transient power failures.
To alleviate servant-based lookup bottlenecks, we apply the principle
pattern of adding extra state to the POA in the form of a
Reverse-Lookup map that associates each servant with its Object Id in
O(1) average-case time. In TAO, this Reverse-Lookup map is used in
conjunction with the Active Demultiplexing map that associates each
Object Id to its servant. Figure 8 shows the time
required to find a servant, with and without the Reverse-Lookup map,
as the number of servants in a POA increases.
Servants are allocated from arbitrary memory locations. Since we have no control over the pointer value format, TAO uses a hash map for the Reverse-Lookup map. The value of the servant pointer is used as the hash key. Although hash maps do not guarantee O(1) worst-case behavior, they do provide a significant average-case performance improvement over linear search.
A Reverse-Lookup map can be used only with the UNIQUE_ID POA policy since with the MULTIPLE_ID POA policy, a servant may support many Object Ids. This constraint is not a shortcoming since servant-based lookups are only required with the UNIQUE_ID policy. One downside of adding a Reverse-Lookup map to the POA, however, is the increased overhead of maintaining an additional table in the POA. For every object activation and deactivation, two updates are required in the Active Object Map: (1) to the Reverse-Lookup map and the (2) to the Active Demultiplexing map used for Object Id-based lookups. However, this additional processing does not affect the critical path of Object Id-based lookups during run-time.9 summarizes the demultiplexing strategies that we have determined to be most appropriate for real-time applications .
All of TAO's optimized demultiplexing strategies described above are entirely compliant with the CORBA specification. Thus, no changes are required to the standard POA interfaces specified in CORBA specification .
To enable applications to select the optimal POA synchronization, TAO provides the following POA creation policy extensions:
NULL_LOCK, THREAD_LOCK, DEFAULT_LOCK
interface SynchronizationPolicy : CORBA::Policy
(in SynchronizationPolicyValue value);
Objects that support the SynchronizationPolicy interface can be obtained using the TAO's POA extension method create_synchronization_policy, which is modeled on the standard POA policy factories. Instances of SynchronizationPolicy are passed to the POA::create_POA operation to specify the synchronization policy used in the created POA. The value attribute of SynchronizationPolicy contains the value supplied to the create_synchronization_policy operation from which it was obtained. The following values can be supplied by server developers:
If no SynchronizationPolicy object is passed to create_POA, the synchronization policy defaults to DEFAULT_LOCK. The DEFAULT_LOCK option allows applications to make the synchronization decision once for all the POAs created in the server. For example, if the server is single threaded, the application can configure the ORB at initialization-time to use the null lock as the default lock. Hence, the application need not specify the NULL_LOCK policy in every call to create_POA.
Table 2 shows the footprint reduction achieved when the
features listed above are excluded from TAO.
To ensure consistent behavior throughout the layers in an ORB endsystem, TAO's POA is designed to support TAO's various ORB Core configurations. The important variations are (1) each ORB Core in a process has its own POA and (2) all ORB Cores in a process share one POA, as described below.shows the POA per ORB Core configuration,
When the POA per ORB Core configuration is used, each POA is accessed by only one thread in the process. Thus, no locking is required within a POA, thereby reducing the overhead and non-determinism incurred to demultiplex servant requests. However, the drawback of the POA per ORB Core configuration is that registering servants becomes more complicated if servants must be registered in multiple POAs.
Optimizing a CORBA ORB Core to support real-time applications requires the resolution of many design challenges. This section outlines several of these challenges and describes the optimization principle patterns we applied to maximize the predictability, performance, and scalability of TAO's ORB Core. These optimizations include transparently collocating clients and servants that are in the same address space, minimizing dynamic memory allocations and data copies, and minimizing GIOP/IIOP protocol overhead. Additional optimizations for real-time ORB Core connection management and concurrency strategies are described in .
A common ORB concurrency model is to use a thread pool  where one thread is dedicated to I/O. This thread reads the request or reply from the network into a dynamically allocated buffer, which is placed into a queue. Threads in a pool then process the user upcalls; for example demarshaling the data in the buffers into storage supplied either by the application or the IDL compiler generated stubs and skeletons.
This approach is popular because it bounds the resources dedicated to threads, it isolates the I/O threads from the concurrency strategies, it is relatively easy to implement, the user can provide callback objects to control thread creation and control, and other concurrency mechanisms, such as thread-per-request or thread pools with lanes, can be implemented using this approach or variations of it.
Unfortunately this threading model is not adequate for real-time systems because:
The stub and skeleton classes shown in Figure 12 are required by the POA specification; the collocation class is specific to TAO. Collocation is transparent to the client since it only accesses the abstract interface and never uses the collocation class directly. Therefore, the POA provides the collocation class, rather than the regular stub class, when the servant resides in the same address space as the client.
Since the collocation class bypasses the POA, care must be taken to ensure that the following invariants are met so that servant developers can create servants without concerning themselves with collocation issues:] allow programmers to specify additional code to be executed before or after the normal code of an operation. This enables applications to perform security checks, provide debugging traps, maintain audit trails, and so on. It is necessary that the ORB run these interceptors regardless of the collocation of the client and the server.
Clients can obtain an object reference in several ways, e.g., from a CORBA Naming Service or from a Lifecycle Service generic factory operation. Likewise, clients can use string_to_object to convert a stringified interoperable object reference (IOR) into an object reference. To ensure locality transparency, an ORB's collocation optimization must determine if an object is collocated. If it is, the ORB returns a collocated stub - if it is not, the ORB returns a regular stub to a distributed object.
The specific steps used by TAO's collocation optimizations are described below:13 shows the internal structure for collocation table management in TAO.
Multiple ORBs can reside in a single server process. Each ORB can support multiple transport protocols and accept requests from multiple transport endpoints. Therefore, TAO maintains multiple collocation tables for all transport protocols used by ORBs within a single process. Since different protocols have different addressing methods, maintaining protocol specific collocation tables allows us to strategize and optimize the lookup mechanism for each protocol.14 to obtain a reference to the collocated object.
As shown in Figure 14, when a client process tries to resolve an imported object reference (1), the ORB checks (2) the collocation table maintained by TAO's ORB Core to determine if any object endpoints are collocated. If a collocated endpoint is found this check succeeds and the RootPOA corresponding to the endpoint is returned. Next, the matching Object Adapter is queried for the servant, starting at its RootPOA (3). The ORB then instantiates a generic CORBA::Object (4) and invokes the _narrow operation on it. If a servant is found, the ORB's _narrow operation (5) invokes the servant's _narrow method (6) and a collocated stub is instantiated and returned to the client (7). Finally, clients invoke operations (8) on the collocated stub, which forwards the operation to the local servant via a virtual method call.
If the imported object reference is not collocated, then either operation (2) or (3) will fail. In this case, the ORB invokes the _is_a method to verify that the remote object matches the target type. If the test succeeds, a distributed stub is created and returned to the client. All subsequent operations are invoked remotely. Thus, the process of selecting collocated stubs or non-collocated stubs is completely transparent to clients and it's only performed at the time of object reference creation.
Although executing an operation in the client's thread is very efficient, it is undesirable for certain types of real-time applications . For instance, priority inversion can occur when a client in a lower priority thread invokes operations on a collocated object in a higher priority thread. To provide greater access control over the scope of TAO's collocation optimizations, applications can associate different access policies to endpoints so they only appear collocated to certain priority groups. Since endpoints and priority groups in many real-time applications are statically configured, this access control lookup does not impose additional overhead.
Figure 15 shows the performance improvement,
measured in calls-per-second, using TAO's collocation optimizations.
Each operation cubed a variable-length sequence of longs that
contained 4 and 1,024 elements, respectively.
TAO's collocation optimizations are not totally compliant with the CORBA standard since its collocation class forwards all requests directly to the servant class. Although this makes the common case very efficient, this implementation does not support the following advanced POA features:
Adding support for these features to TAO's collocation class slow downs the collocation optimization, which is why TAO currently omits these features. We plan to support these advanced features in future releases of TAO so that if applications know these advanced features are not required they can be ignored selectively.
One source of memory management overhead stems from the use of dynamic memory allocation, which is problematic for real-time ORBs. For instance, dynamic memory can fragment the global process heap, which decreases ORB predictability. Likewise, locks used to access a global heap from multiple threads can increase synchronization overhead and incur priority inversion .
Another significant source of memory management overhead involves excessive data copying. For instance, conventional ORB's often resize their internal marshaling buffers multiple times when encoding large operation parameters. Naive memory management implementations use a single buffer that is resized automatically as necessary, which can cause excessive data copying.
TAO's memory management optimizations leverage off the design of its concurrency strategies, which minimize thread context switching overhead and priority inversions by eliminating queueing within the ORB's critical path. For example, on the client-side, the thread that invokes a remote operation is the same thread that completes the I/O required to send the request, i.e., no queueing exists within the ORB. Likewise, on the server-side, the thread that reads a request completes the upcall to user code, also eliminating queueing within the ORB. These optimizations are based on the principle pattern of exploiting locality and optimizing for the common case.
By avoiding thread context switches and queueing, TAO can benefit from memory management optimizations based on thread-specific storage. Thread-specific storage is a common design pattern  for optimizing buffer management in multi-threaded middleware. This pattern allows multiple threads to use one logically global access point to retrieve thread-specific data without incurring locking overhead for each access, which is an application of the pattern of avoiding waste. TAO uses this pattern to place its memory allocators into thread-specific storage. Using a thread-specific memory pool eliminates the need for intra-thread allocator locks, reduces fragmentation in the allocator, and helps to minimize priority inversion in real-time applications.
In addition, TAO minimizes unnecessary data copying by keeping a
linked list of CDR buffers. As shown in Figure 16,
operation arguments are marshaled into TSS allocated buffers. The
buffers are linked together to minimize data copying. Gather-write
I/O system calls, such as writev, can then write these buffers
atomically without requiring multiple OS calls, unnecessary data
allocation, or copying.
In this experiment, we perform 16 ORB buffer allocations and 1,000 regular data allocations. The exact series of allocations is not important, as long as both experiments perform the same number. If there is one series of allocations where the global allocator behaves non-deterministically, it is not suitable for hard real-time systems.
Our results in Figure 17 illustrate that TAO's TSS allocators isolate the ORB from variations in global memory allocation strategies. In addition, this experiment shows how TSS allocators are more efficient than global memory allocators since they eliminate locking overhead. In general, reducing locking overhead throughout an ORB is important to support real-time applications with deterministic QoS requirements .
Since embedded and real-time systems typically run the same ORB implementation on similar hardware, we have modified TAO to optionally remove some fields from the GIOP header and the GIOP Request header when the -ORBgioplite option is given to the client and server CORBA::ORB_init method. The fields removed by this optimization are shown in Table 3. These optimizations are guided by the principle patterns of relaxing system requirements and avoiding unnecessary generality.
|GIOP magic number||4 bytes|
|GIOP version||2 bytes|
|GIOP flags (byte order)||1 byte|
|Request Service Context||bytes|
|3|c||Marshaling Enabled||3c||Marshaling Disabled|
Our empirical results reveal a slight, but measurable, improvement when removing the GIOP message footprint ``overhead.'' More importantly though, these changes do not affect the standard CORBA APIs used to develop applications. Therefore, programmers can focus on the development of applications, and if necessary, TAO can be optimized to use this lightweight version of GIOP.34]. This framework generalizes TAO's current -ORBgioplite option to support both pluggable ORB protocols (ESIOPs) and pluggable transport protocols. The primary design goals for TAO's pluggable protocols framework are:
All IOP and Transport protocols used are registered with the connector and acceptor registries shown in Figure . These registries are responsible for keeping track of available protocols, creating protocol objects, and interpreting profiles and object addresses.
To obtain more significant protocol optimizations, we are adding a pluggable protocols framework to TAO . This framework generalizes TAO's current -ORBgioplite option to support both pluggable ORB protocols (ESIOPs) and pluggable transport protocols.Demultiplexing is an operation that routes messages through the layers of an ORB endsystem. Most protocol stacks models, such as the Internet model or the ISO/OSI reference model, require some form of multiplexing to support interoperability with existing operating systems and peer protocol stacks. Likewise, conventional CORBA ORBs utilize several extra levels of demultiplexing at the application layer to associate incoming client requests with the appropriate servant and operation (as shown in Figure 2).
Related work on demultiplexing focuses largely on the lower layers of the protocol stack, i.e., the transport layer and below, as opposed to the CORBA middleware. For instance, [21,35,22,36] study demultiplexing issues in communication systems and show how layered demultiplexing is not suitable for applications that require real-time quality of service guarantees.
Packet filters are a mechanism for efficiently demultiplexing incoming packets to application endpoints . A number of schemes to implement fast and efficient packet filters are available. These include the BSD Packet Filter (BPF) , the Mach Packet Filter (MPF) , PathFinder , demultiplexing based on automatic parsing , and the Dynamic Packet Filter (DPF) .
As mentioned before, most existing demultiplexing strategies are implemented within the OS kernel. However, to optimally reduce ORB endsystem demultiplexing overhead requires a vertically integrated architecture that extends from the OS kernel to the application servants. Since our ORB is currently implemented in user-space, however, our work focuses on minimizing the demultiplexing overhead in steps 3, 4, 5, and 6 (which are shaded in Figure 2).
SunSoft IIOP uses an interpretive marshaling/demarshaling engine. An alternative approach is to use compiled marshaling/demarshaling. A compiled marshaling scheme is based on a priori knowledge of the type of an object to be marshaled. Thus, in this scheme there is no necessity to decipher the type of the data to be marshaled at run-time. Instead, the type is known in advance, which can be used to marshal the data directly.
 describes the tradeoffs of using compiled and interpreted marshaling schemes. Although compiled stubs are faster, they are also larger. In contrast, interpretive marshaling is slower, but smaller in size.  describes a hybrid scheme that combines compiled and interpretive marshaling to achieve better performance. This work was done in the context of the ASN.1/BER encoding .
According to the SunSoft IIOP developers, interpretive marshaling is preferable since it decreases code size and increases the likelihood of remaining in the processor cache. Our goal is to generate efficient stubs and skeletons by extending optimizations provided in USC  and ``Flick'' , which is a flexible, optimizing IDL compiler. Flick uses an innovative scheme where intermediate representations guide the generation of optimized stubs. In addition, due to the intermediate stages, it is possible for Flick to map different IDLs (e.g., CORBA IDL, ONC RPC IDL, MIG IDL) to a variety of target languages such as C, C++. TAO's IDL compiler implements optimizations to improve the performance of its interpretive stubs. The stubs and skeletons produced by USC and Flick are compiled in nature.Developers of real-time systems are increasingly using off-the-shelf middleware components to lower software lifecycle costs and decrease time-to-market. In this economic climate, the flexibility offered by CORBA makes it an attractive middleware architecture. Since CORBA is not tightly coupled to a particular OS or programming language, it can be adapted readily to ``niche'' markets, such as real-time embedded systems, which are not well covered by other middleware. In this sense, CORBA has an advantage over other middleware, such as DCOM  or Java RMI , since it can be integrated into a wider range of platforms and languages.
The POA and ORB Core optimizations and performance results presented
in this paper support our contention that the next-generation of
standard CORBA ORBs will be well-suited for distributed real-time
systems that require efficient, scalable, and predictable performance.
Table 5 summarizes which TAO optimizations are
associated with which principle patterns, as well as which
optimizations conform to the CORBA standard and which are
Precompute, Avoid waste
Passing hints in header
Relaxing system requirements
Using specialized routines
Not tied to reference models
Adding extra state
Relaxing system requirements
Relax system requirements
Add extra state
Optimize for common case
Relax system requirements
Our primary focus on the TAO project has been to research, develop, and optimize policies and mechanisms that allow CORBA to support hard real-time systems, such as avionics mission computing . In hard real-time systems, the ORB must meet deterministic QoS requirements to ensure proper overall system functioning. These requirements motivate many of the optimizations and design strategies presented in this paper. However, the architectural design and performance optimizations in TAO's ORB endsystem are equally applicable to many other types of real-time applications, such as telecommunications, network management, and distributed multimedia systems, which have statistical QoS requirements.
The C++ source code for TAO and ACE is freely available at www.cs.wustl.edu/schmidt/TAO.html. This release also contains the ORB benchmarking test suites described in this paper.We would like to thanks our COOTS shepherd, Steve Vinoski, whose comments helped improve this paper. In addition, we would like to thank the COOTS Program Committee and anonymous reviewers their constructive suggestions for improving the paper.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 COOTS-99.tex.
The translation was initiated by Irfan Pyarali on 3/22/1999
This paper was originally published in the
Proceedings of the 5th USENIX Conference on Object-Oriented Technologies and Systems, May 3-7, 1999, San Diego, California, USA
Last changed: 21 Mar 2002 ml