################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org (This is an ASCII version of the paper presented at the Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, California, September, 1993. Figures were omitted from this version.) ********************* Panda: A Portable Platform to Support Parallel Programming Languages Raoul Bhoedjang Tim Ruhl Rutger Hofman Koen Langendoen Henri Bal Vrije Universiteit Amsterdam Department of Mathematics and Computer Science Frans Kaashoek MIT Laboratory for Computer Science, Cambridge MA kaashoek@amsterdam.lcs.mit.edu This research is supported in part by a PIONIER grant from the Netherlands Organisation for Scientific Research (N.W.O.). Abstract Current parallel programming languages require advanced run-time support to implement communication and data consistency. As such run-time systems are usually layered on top of a specific operating system, they are nonportable. This paper reports on our early experiences with Panda , a portable virtual machine that provides general and flexible support for implementing run-time systems for parallel programming languages. Panda has two interfaces: a Panda interface providing threads, RPC, and totally-ordered group communication, and a system interface which encapsulates machine dependencies by providing machine-independent thread and communication abstractions. We describe the interfaces, our experience with an initial Unix implementation, and the development of a new, portable, and scalable run-time system for the Orca parallel programming language on top of Panda. Unix is a trademark of Unix Systems Laboratories, Inc. 1. Introduction Modern parallel programming languages require advanced run-time support for communication and data consistency. In order to fully exploit a machine's particular features, run-time systems for parallel languages tend to be built directly on top of the host operating system. Our experience with the parallel programming language Orca [4] on the Amoeba [19, 24] distributed operating system is that this strategy results in a language implementation that is difficult to port to other operating systems. Orca is based on shared objects. As shared objects may be replicated to speed up read accesses, their implementation on distributed memory machines requires advanced run-time support. To investigate the scalability and portability of shared data objects, we aim to port Orca to a variety of parallel architectures. Although operating systems like Amoeba offer a virtual machine abstraction on which shared objects can be implemented, they are generally tied to a particular machine architecture and thus nonportable. Also, current operating systems generally provide more functionality than is needed or wanted for parallel processing (e.g., virtual memory management). Some modern operating systems, like Clouds [12] design , support their own object model. Unless such a model is very simple, flexible, and lightweight, layering another object model on top of it can be troublesome and inefficient. Higher-level approaches to supporting parallel programming include page-based Distributed Shared Memory (DSM) systems (e.g., Munin [9]). Page-based DSM systems, however, often rely on manipulation of the virtual memory management unit, and therefore also suffer from portability problems. Highly portable message passing systems exist, but they provide limited functionality. PVM [22] and p4 [8], for instance, provide low-level message passing, but they support neither threads nor totally-ordered group communication. Instead of relying on page-based DSM, operating systems, or low-level message passing, we have developed a portable virtual machine, called Panda. Panda was designed with the portability requirements of parallel languages in mind and is currently used to implement a new Orca run-time system. Panda, however, does not restrict its users to Orca's object model. It provides the following, general abstractions: - threads, - Remote Procedure Call (RPC), - totally-ordered group communication. Experience with similar Amoeba abstractions has shown that efficient implementations of shared objects can be built on top of them [23]. Threads provide a simple, lightweight unit of activity. RPC [7] is a general mechanism for high-level point-to-point communication between nodes (and thus for the implementation of remote object invocation). Totally-ordered group communication [16] has been successfully employed in previous Orca run-time systems for keeping replicated objects consistent and for the implementation of a distributed checkpointing algorithm [17]. It assures that all members of a group receive all group messages in the same order, which makes many parallel applications easier to implement. Hardware broadcast mechanisms usually do not guarantee this strong semantics. Since Panda's abstractions are language-independent, we believe that Panda can be used to implement run-time systems for languages other than Orca as well. The Panda architecture, illustrated in Figure~1, consists of two layers that reflect our design goals: portability and support for parallel programming languages. Support for parallel programming languages is achieved by providing high level abstractions in the Panda interface. The software that implements the Panda interface is called the Panda layer. Portability is achieved by implementing the Panda layer on top of the system interface , which encapsulates machine-dependencies. This makes the Panda layer fully machine-independent. An implementation of the system interface, a system layer, can be constructed with only some basic operating system support, but can also exploit features of the underlying operating system (e.g., kernel threads, scatter-gather interfaces, or hardware broadcasting and multicasting). Panda takes a layering approach towards portability. Although layering is an effective way to abstract from machine-dependencies, it bears with it the danger of poor performance. Thoughtless layering may well result in a loss of information that is essential to achieving good performance [20]. Therefore, we have identified Panda's performance-critical parts: threads, message manipulation, and the nature of the underlying network. These performance-critical parts are all implemented in the system layer, where they have access to low-level, operating-specific features. The main elements of the system layer are threads, message manipulation primitives, and communication primitives. By implementing threads and messages in the system layer, we can benefit from operating system-specific features and thus achieve better performance. Communication takes place between virtual processors, called platforms , which are identified by platform identifiers. The communication primitives provide unreliable point-to-point communication and multicasting between these platforms. Porting Panda to a new architecture requires porting the system layer only. The minimal support required from the operating system for implementing the system layer consists of a facility for unreliable message passing, and a facility for handling signals generated by incoming messages and expired timers. If the host operating system offers no more, all of the system interface has to be implemented from scratch on top of this operating system. However, most current operating systems offer threads and usable communication facilities. Implementing the system layer on such systems is easier. We have constructed an initial implementation of Panda for a collection of SPARC-based workstations, running Unix, and connected by a 10 Mbit/s Ethernet. We intend to port Panda in the near future to a T9000-based parallel machine, the Alewife [1], and the CM-5 [25]. Section~2 describes the Panda and system interface in more detail. The machine-independent implementation of the Panda interface is outlined in Section~3 In Section~4, we describe our experience with an initial implementation of the system interface on Unix. Section~5 discusses the implementation of a new, portable Orca run-time system on top of Panda. In Section~6, Panda is compared with related systems. Finally, in Section~7, we present our conclusions. 2. The Panda and System Interface In this section, we describe the relevant parts of the current Panda and system interfaces. Based on further experience with Panda and Orca these interfaces may evolve. For reasons of efficiency threads and messages are implemented in the system layer (and part of the system interface), but most of the primitives associated with them are also visible in the Panda interface. To distinguish between Panda layer and system layer functions, each Panda layer function name is prefixed by "pan_" and each system layer function by "sys_." Functions that are part of both interfaces are not prefixed. 2.1 The Panda Interface The Panda interface provides the RPC, totally-ordered group communication, and thread abstractions with which Panda applications can be built. Threads The thread interface (see Table~1) is based on the Pthreads [15, 18] and C~Threads [11] interfaces. Since threads are implemented in the system layer (see Figure~1), thread primitives do not have a pan prefix. >From experience with Amoeba threads we have learned that a thread package for parallel programming languages should support priority scheduling to handle incoming messages immediately when they arrive. This automatically implies preemption of running threads when a new message arrives. Priorities are supported by the operations thread_getprio and thread_setprio which return and set priorities. Since we do not specify a scheduling policy among threads with the same priority, we also provide the function thread_yield that tries to run another runnable thread with the same priority. Synchronization between threads is based on mutexes and condition variables. Together these provide strong enough semantics to construct monitors [14]. RPC The RPC interface (see Table~2) is based on the notion of a service that provides a number of operations. A service is implemented by one or more servers. A server can register its services with Panda's name server using pan_export_service giving as arguments the number of operations it supports and an array of pointers to these operations. Before an operation can be called, the client must get a handle to a server (pan_import_service). This handle can be used to identify the server that must handle the RPC request (pan_do_rpc). When a request message comes in, a thread is started. This thread calls the registered function for the specified service and operation. This function has three parameters: an operation index number, an input message, and a reply message. Totally-Ordered Group Communication The group abstraction of Panda (see Table~3) supports totally-ordered, closed groups [16]. The total ordering assures that every group member receives all group messages in the same order. A group is closed if only its members can send messages to the group. This makes an efficient implementation possible. Each group is identified by a character string, which is registered with Panda's name server. A platform that wants to join the group calls pan_group_join which initializes a group structure. If the group does not exist, it will be created. Group messages are handled asynchronously by an upcall to a specified receive routine, which handles incoming messages one by one to ensure total ordering. 2.2 The System Interface The system interface hides machine dependencies by providing three abstractions: threads, messages, and communication primitives. As explained before, threads are implemented in the system layer; the system and Panda interface are identical with respect to threads. Communication The communication facilities are divided into two parts: send primitives (see Table~4) and addressing. At startup time each platform gets a unique platform identifier (pid), which is an integer number ranging from 1 to the number of platforms. This pid is used as a point-to-point address. Pids can be grouped together in a platform set (pset) that serves as a logical multicast address. The send primitives provided by the system layer are sys_unicast for (unreliable) point-to-point communication, and sys_multicast for (unreliable) one-to-many communication. When the Panda layer initializes itself, it registers a message receive handler with the system layer. All platforms run a system layer receive daemon. Each time a (unicast or multicast) message arrives, this daemon makes an upcall to the message receive handler in the Panda layer. Messages At the interface level, messages look like stacks. To construct a message, senders push data fields of a specified size and alignment onto a message's stack; these fields are popped in reverse order by the receivers (see Table~5). message_look is similar to message_pop but it does not pop the data field off the message. Although the communication primitives hide machine dependencies, they do not handle messages with a length larger than the underlying system supports. Instead, the system interface provides primitives to fragment messages so that they can be handled by the communication primitives. This fragmentation is based on a common header ,a header that is placed in front of every fragment. With sys_message_mark the Panda layer can specify the end of the data part and the start of the common header. Every data field pushed after the mark belongs to the common header. sys_message_fragment initializes a new fragment message containing the common header and part of the data of the original message. This function takes as a parameter an offset indicating the start of a fragment's data in the original message, and it returns the offset of the next fragment's data. After getting a fragment from a message, some fields in the common header part can be filled in with information that identifies this fragment. At the receiving side, sys_message_assemble is used to reassemble the original message. These primitives resemble the x-kernel primitives for fragmenting messages [20]. A fragment message need not contain copies of the common header and data fields of the original message; pointers may be used instead. To support sharing of the common header among fragments, only one fragment may exist at a time (i.e., before creating the next fragment, the predecessor fragment must be released). Now it is always clear to what fragment the identification information in the common header refers. Using pointers to the original message in the fragment message avoids unnecessary copying. 2.3 Portability and Efficiency Both the Panda and the system interface have been designed to allow efficient implementations. Some of the abstractions present in the system layer may seem high-level, but providing these abstractions rather than low-level primitives gives us the opportunity to exploit advanced features offered by many modern operating systems. Among these features are efficient, user-level thread packages or kernel threads, scatter-gather message transmission, access to hardware broadcasting and multicasting, etc. We have decided to make message passing in the system interface unreliable. Reliable message passing would have prohibited an efficient RPC implementation on architectures that only provide unreliable message passing, since sending a message reliably requires at least two network packets. 3. Implementation of the Panda Interface The Panda interface is implemented using the primitives provided by the system interface. Therefore, this code is entirely machine-independent. 3.1 Group Communication Structure and Protocol The group communication implementation is based on [16], which describes an efficient, totally-ordered, and atomic group communication protocol. Since we are not concerned with fault tolerance, we have implemented this protocol in a non-resilient way, thereby loosing atomicity (all-or-none delivery, even in the presence of processor crashes). It is possible, however, to do synchronous checkpointing on top of totally-ordered group communication without atomicity [17]. Totally-ordered group communication is achieved by having a special member in each group, the sequencer , which assigns a sequence number to each group message. This gives two possibilities for a group send [16]. The first method is to send a point-to-point message to the sequencer, and the sequencer then broadcasts the message after filling in the sequence number (the so-called PB method). The second method is to let the sender itself do the broadcast. When the sequencer receives this broadcast message, it assigns a sequence number to it, and broadcasts a short acknowledgement message containing this sequence number (BB method). This method saves network bandwidth (because the data is transmitted only once), but it generates more interrupts. A choice between the two methods is made dynamically, based on the message size and on information from the system layer. Either way, when a message arrives its sequence number is checked against the last sequence number received. If the sequence number indicates this is the next message, the message can be delivered to the application level, otherwise the receiver asks the sequencer for the missing messages. Incoming group messages are handled by a single daemon thread, which upcalls to the receive handler specified by pan_group_join. To prevent loosing the ordering of the group messages by unpredictable thread scheduling, we use only one daemon thread per group. Since the underlying architecture may have stronger semantics than we actually need, the system layer can define the following two compilation flags: RELIABLE to specify that multicast messages are never lost, and ORDERED which specifies that all multicast messages arrive in total order. The code of the group implementation is adapted by these parameters. 3.2 RPC Structure and Protocol The RPC protocol is based on Birrell and Nelson [7]. An RPC requires three messages during normal execution: a request, a reply, and an acknowledgement. On some architectures (e.g. a network of T9000 Transputers) reliable message passing is provided already, so the acknowledgement is not necessary. Therefore, the system layer can define a compilation flag RELIABLE which implies that messages are reliable. When compiled with this flag set, no acknowledgements are sent. 4. Experience with Panda on Unix We have implemented Panda on Unix (SunOS 4.1.2). The following subsections describe the implementation of the system layer and the performance of Panda. Not all parts of the implementation have been tuned yet. 4.1 Implementation of the System Interface In contrast to modern operating systems for parallel computers, Unix provides neither threads nor multicasting. Nevertheless, we have selected Unix as our initial target operating system, because it provides a complete programming environment and because it is widely available. To avoid writing a large amount of software that we expect to be provided by future target platforms, we have used public-domain software for our threads and (unreliable) multicast implementation. We have implemented our threads interface with Pthreads [18], a POSIX-conformant, user-space threads implementation. We have extended the kernels of our SPARC workstations with IP multicast, a kernel extension for multicasting [13]. Point-to-point message passing has been implemented on top of UDP [21]. Pthreads provides all the functionality we need, including priority scheduling, and runs entirely in user-space. User space threads are more efficient than (pure) kernel-based implementations, because thread context switches do not involve trapping to the kernel. However, they suffer from poor integration with virtual memory management and blocking I/O [2]. Virtual memory makes performance less predictable: a page fault will block all threads while the missing page is brought in from the disk. Blocking network I/O is a more serious problem. The thread that waits for incoming messages should not block all other threads contained in the same process. Therefore, each platform's receive daemon thread uses Unix's asynchronous and nonblocking I/O options to prevent blocking the entire process when reading from the network. If it finds no pending messages, it waits for a signal. Since Pthreads supports signals on a per-thread basis, only the receive daemon is blocked, not the entire process. Incoming messages generate SIGIO signals that cause the receive daemon to be rescheduled immediately (since it has the highest priority). Since UDP has no support for Panda's stack-based message manipulation and fragmentation routines, most of our system layer code is devoted to the implementation of these routines. This code is machine-independent and need not be changed when Panda is ported. However, it may be beneficial to adapt the code to platform-specific features (e.g., scatter-gather facilities). The system interface was designed to allow such modifications in its implementation. No changes need to be made to the interface itself. 4.2 Performance To compare the overhead of our protocols, Table~6 gives an overview of the performance of the communication primitives in the system and the Panda layer. These performance figures were obtained on a collection of diskless SPARCstations SLC, running at 20 Mhz, and connected by a 10 Mbit/s Ethernet. Also included is the overhead of thread context switching. The message passing latencies were measured with two platforms (running on two different machines), one sending 10,000 messages and the other sending acknowledgements. The null RPC latency is measured with 10,000 empty request and reply messages to an empty server routine, and the throughput with 1000 RPC messages with a request message size of 8000 bytes and an empty reply message. The latency of an empty group message is 6.7 ms. Since the protocol uses negative acknowledgements, this latency is almost independent of the number of platforms [16]. RPC and group communication performance of our initial Panda implementation is within a factor 4 of Amoeba, which has RPC and group communication built into its microkernel. (On comparable hardware, Amoeba does a null RPC in 1.7 ms; a null group message on a collection of 20 MHz MC68030s takes 2.7 ms.) 5. Implementing the Orca RTS using Panda Orca is a type-secure parallel and distributed programming language. Orca programs consist of processes that communicate solely through shared objects, which are instances of abstract data types. To speed up read accesses to shared objects, such objects may be replicated. The replication strategy is based on a combination of compile-time and run-time techniques [3]. The Orca run-time system (RTS) is responsible for keeping replicas in a consistent state. Orca is being re-implemented to obtain a programming system that is portable, efficient, and scalable. A new Orca compiler generating fast and portable ANSI-C code has already been implemented, and we are now reimplementing the run-time system on top of Panda. The new RTS makes heavy use of all Panda facilities: - Threads The Orca RTS uses Panda's threads for the implementation of Orca processes. Threads are also created implicitly as a result of incoming RPC requests and group messages. - RPC RPC is used by the RTS for performing operations on remote, nonreplicated objects and for transmitting objects when they are migrated or replicated. - Group Communication When a shared, replicated object is simultaneously updated by two Orca processes, then all replica holders of this object must apply the updates in the same order. To achieve this, the RTS uses totally-ordered group communication. All RTSs belong to a single group and simply send their update messages to this group. Since communication in this group is totally-ordered, all RTSs receive and process the update messages in the same order, thus keeping the replicas consistent. In contrast with previous Orca implementations based on Amoeba, machine dependencies are now hidden from the RTS by Panda, thus making the RTS portable. Moreover, as can be seen from the interface descriptions, the primitives in the Panda interface are language independent and can be used for the implementation of other language run-time systems. 6. Related Work Panda differs from many existing parallel programming platforms in that it has been designed with the requirements for run-time systems for parallel programming languages in mind. As previous Orca implementations have demonstrated, such run-time systems can benefit from high-level support in the form of RPC and totally-ordered group communication. In this section we compare Panda with other systems that can be used for implementing parallel programming languages; some of these systems are language-based themselves, whereas others come in library form. We consider portable message passing systems (p4, PVM), Distributed Shared Memory systems (Munin and Midway), ARCADE, and ISIS/HORUS. Like Panda, PVM [22] and p4 [8] provide portable communication primitives. PVM and p4, however, provide message passing primitives only, and neither provides high-level communication in the form of RPC or totally-ordered group communication. In our experience, RPC and group communication simplify the implementation of complex run-time systems. Neither PVM nor p4 supports lightweight threads. Because of their high context-switching overhead, processes are not suitable for hiding communication latencies. Both PVM and p4 provide visualization tools and support for heterogeneity. Work on extending Orca and Panda with performance debugging tools is in progress. Unlike PVM and p4, Panda does not support heterogeneity. DSM systems like Munin [9] and Midway [5] support parallel programming by providing a shared memory abstraction that hides all message passing from the programmer. Although Panda does not provide such an abstraction by itself, it gives sufficient support to layer a shared memory model on top of it. Munin programmers annotate shared variables with their expected access pattern. These shared variables are kept consistent through a release consistency protocol. The Munin implementation of this protocol relies on the Memory Management Unit (MMU) to detect writes to pages containing shared data, thus rendering the implementation machine-dependent. In the Midway system, shared variables are associated with their synchronization objects and kept consistent through a memory consistency protocol called entry consistency. Although Midway does not rely on MMU manipulation to enforce entry consistency, it does need the MMU to implement stronger memory consistency models (release consistency and processor consistency) [5]. Munin and Midway support parallel programming by providing a shared memory abstraction and weak consistency models. We consider this support too low-level for application programming: programmers should not have to annotate their variables or use low-level locking. Munin (and sometimes Midway) needs to manipulate the MMU, while Orca implementations guarantee sequential consistency, which is stronger than all previously mentioned forms of consistency, without MMU manipulation. Thus, layering an Orca run-time system on top of Panda results in a portable implementation of sequential consistency. Like Panda, ARCADE [10] supports the implementation of parallel programming languages through high-level abstractions. The ARCADE abstractions, however, are based on language-independent data units rather than communication mechanisms. A data unit is an abstraction of a typed region of memory that can be named, moved, and shared across multiple nodes in a distributed environment. Language-specific objects can be mapped onto ARCADE's data units. Like Orca, ISIS [6] is currently being reimplemented for reasons of portability and scalability. The new ISIS system, HORUS [26], has a core interface that provides reliable, causal multicasting (CBCAST). Other services are implemented on top of this interface. This interface is somewhat like the Panda interface, although the ordering semantics of CBCAST is weaker than that of Panda's group communication --- totally-ordered group communication and RPC are implemented on top of CBCAST. The CBCAST layer is implemented on top of a portable operating system abstraction, MUTS (Multicast Transport Service), that is similar to Panda's system layer. 7. Conclusion This paper described the motivation, design, and implementation of Panda, an implementation platform for parallel programming languages, that combines portability with flexibility and efficiency. Panda achieves portability by defining a machine-independent system interface in addition to the Panda interface. The Panda interface is implemented on top of this system interface and is thus machine-independent. The implementation of the system interface requires only basic operating system support for the context switching of threads and unreliable message passing. Most of the current system interface implementation is machine-independent and can be easily reused. Its careful interface design and modular implementation allow for the incorporation of efficient, native thread packages and communication facilities (e.g., scatter-gather message passing and hardware broadcasting and multicasting). Porting the system layer to other parallel architectures should be straightforward. Panda provides its users with three flexible abstractions that have been effectively employed for the implementation of several Orca run-time systems: threads, RPC, and totally-ordered group communication. We use these abstractions to implement Orca's object model. Early experience with a SPARC/Unix implementation of Panda has shown the feasibility of a layering approach towards support for parallel programming languages. Acknowledgements We would like to thank Gerard Kok and Anil Sukul for testing Panda's RPC implementation. We also greatly appreciate the helpful comments of Ceriel Jacobs and Leendert van Doorn on earlier drafts of this paper. References [1] A. Agarwal, D. Chaiken, G. D'Souza, K. Johnson, D. Kranz, J. Kubiatowics, K. Hurihara, B-H. Lim, G. Maa, D. Nussbaum, M. Parkin, and D. Yeung. The MIT Alewife Machine: A large-scale distributed-memory multiprocessor. Technical Report MIT/LCS TM-454, MIT, 1991. [2] T.E. Anderson, B.N. Bershad, E.D. Lazowska, and H.M. Levy. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. In Proc. of the 13th Symposium on Operating Systems Principles, pages 95-109. ACM, 1991. [3] H.E. Bal and M.F. Kaashoek. Object Distribution in Orca using Compile-time and Run-time Techniques. In Conference on Object-Oriented Programming Systems, Languages and Applications, Washington D.C., 26 September-1 October 1993. To be published. [4] H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca: A Language for Parallel Programming of Distributed Systems. IEEE Transactions on Software Engineering, 18(3), pages 190-205, March 1992. [5] B. Bershad, M. Zekauskas, and W.A. Sawdon. The Midway Distributed Shared Memory System. In Computer Conference, 1993. [6] K. P. Birman and T.A. Joseph. Exploiting Virtual Synchrony in Distributed Systems. In Proc. of the 11th ACM Symposium on Operating Systems Principles, pages 123-138, 1987. [7] A.D. Birrell and B.J. Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems, 2(1), pages 39-59, February 1984. [8] R. Butler and E. Lusk. Monitors, Messages, and Clusters: the p4 Parallel Programming System. Journal of Parallel Computing. (submitted). [9] J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proc. of the 13th Symposium on Operating Systems Principles. ACM, 1991. [10] D.L. Cohn, A. Banerji, M.R. Casey, P.M. Greenawalt, and D.C. Kulkarni. Basing Micro-kernel Abstractions on High-Level Language Models. In Proc. of the Autumn 1992 OpenForum Technical Conference, pages 323-336, Utrecht, Holland, November 1992. [11] E.C. Cooper and R.P. Draves. C Threads. Technical Report CMU-CS-88-154, Department of Computer Science, Carnegie Mellon University, Pittsburgh, 1988. [12] P. Dasgupta, R.C. Chen, S. Menon, M. Pearson, R. Ananthanarayanan, U. Ramachandran, M. Ahamad, R. LeBlanc Jr., W. Applebe, J. M. Bernabeu-Auban, P.W. Hutto, M.Y.A. Khalidi, and C. J. Wilkenloh. The Design and Implementation of the Clouds Distributed Operating System. Computing Systems Journal, 3, 1990. [13] S.E. Deering and D.R. Cheriton. Multicast Routing in Datagram Internetworks and Extended LANs. ACM Transactions on Computer Systems, 17(1), January 1991. [14] C.A.R. Hoare. Monitors: An Operating System Structuring Concept. Communications of the ACM, 12(10), pages 549-557, October 1974. [15] IEEE. Threads Extensions for Portable Operating Systems P1003.4a , draft 6 edition, February 1992. [16] M.F. Kaashoek. Group Communication in Distributed Computer Systems . PhD thesis, Vrije Universiteit Amsterdam, 1992. [17] M.F. Kaashoek, R. Michiels, H.E. Bal, and A.S. Tanenbaum. Transparent Fault-Tolerance in Parallel Orca Programs. Symposium on Experiences with Distributed and Multiprocessor Systems III, pages 297-312, March 1992. [18] F. Mueller. Implementing POSIX Threads under UNIX: Description of Work in Progress. In Proc. of the 2nd Software Engineering Research Forum, pages 253-261, November 1992. [19] S.J. Mullender, G. van Rossum, A.S. Tanenbaum, R. van Renesse, and H. van Staveren. Amoeba-A Distributed Operating System for the 1990s. IEEE Computer, 1990. [20] L. Peterson, N. Hutchinson, S. O'Malley, and H. Rao. The x-kernel: A Platform for Accessing Internet Resources. IEEE Computer, pages 23-33, May 1990. [21] J. Postel. User Datagram Protocol. Internet Request for Comments RFC768, September 1981. [22] V. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, 2(4), December 1990. [23] A. S. Tanenbaum, M. F. Kaashoek, and H. E. Bal. Parallel Programming Using Shared Objects and Broadcasting. IEEE Computer, 25(8) pages 10-19, August 1992. [24] A.S. Tanenbaum, R. van Renesse, H. van Staveren, G.J. Sharp, S.J. Mullender, A.J. Jansen, and G. van Rossum. Experiences with the Amoeba Distributed Operating System. Communications of the ACM, 33(2) pages 46-63, December 1990. [25] Thinking Machines Corporation. The Connection Machine CM-5 Technical Summary , 1991. [26] R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephenson. Reliable Multicast between Microkernels. In Proccedings of the USENIX workshop on Micro-Kernels and Other Kernel Architectures, pages 269-283, April 27-28 1992.