################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Notice to readers: this text does not contain any figures, nor does it contain any cross references. For a full and complete paper, please refer to: %A D. Milojicic %A W. Zint %A P. Giese %T Experiences with Load Distribution on top of the Mach Microkernel %Z Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems %D September 1993 %L Milo93b ************************************************************************ Experiences with Load Distribution on top of the Mach Microkernel Dejan S. Milojicic, Peter Giese and Wolfgang Zint University of Kaiserslautern, Informatik, Geb. 36, Zim. 424 Erwin-Schr dingerstra e, 67663 Kaiserslautern, Germany e-mail: [dejan,giese,zint]@informatik.uni-kl.de Abstract The paper describes our experiences in the design, implementation and use of load distribution on top of Mach. As a first step to- wards load distribution, we provided task migration which is our base mechanism for distributed scheduling. We compared task mi- gration with initial placement. In order to make more accurate scheduling decisions, we instrumented Mach to account for network IPC and network paging. Processing is still the prevailing fac- tor, but we also consider information on VM and IPC. We have conducted experiments with well-known distributed scheduling strategies in order to prove our assumptions. We are primarily interested in microkernel aspects of load distribution. We re- port on the lessons learned while dealing with the Mach inter- face and on task migration relationship to process migration and the file system. Introduction Load Distribution (LD) has always attracted the interest of the research community. Unfortunately, it has never been widely used, despite many successful implementations [Bara85,Doug91,Zajc93] and promising simulation research [Krue88,Zhou88,Krem92]. One of the reasons for the modest use of LD may be the absence of widely used distributed operating sys- tems and adequate parallel applications. New microkernels, such as Mach [Blac92] and Chorus [Rozi92] , are inherently distributed. New distributed applications are evolving, such as PVM [Sund90]. In this new environment, we expect new opportunities for LD. There are also significant improvements in hardware architec- tures. Massively Parallel Processors (MPP) and mobile computers may be yet another reason for a wider LD use. In our research we mainly target operating system issues in LD. New operating systems tend to be small microkernels with various servers, usually running in user space [Golu90,Rozi92,Cher90]. On top of microkernels there are emula- tions of various operating system personalities, such as BSD UNIX [Golu90], AT UNIX V [Rozi92,Cher90], OS/2 [Phel93] and Sprite [Kupf93]. Contemporary applications communicate using two types of interface: message based and (Distributed) Shared Memory (DSM), as presented in Figure xx Compared to earlier systems which supported process migration, microkernels also provide task migration. Task migration represents moving a task abstraction from a node to another, transparently to the task itself as well as to other tasks in a distributed sys- tem. UNIX process is mapped to a task with a single thread of control. Process migration incurs transfer of more state, such as file descriptors and enabled signals. What are the implications of this environment for LD design? We claim the following issues to be important: LD should be performed on top of the microkernel, which is a common denominator for all running OS personalities. Besides traditional information, such as processing and paging, LD should consider information on network IPC and distributed shared memory. LD should exploit all possible mechanisms, such as process mi- gration and particularly Task Migration (TM). Migration has often been neglected in favor of initial placement. In order to verify these hypothesis, we designed and implemented LD on top of the Mach microkernel. Our work consists of three phases. First we implemented task migration Milo93 which is our main mechanism for LD. User space task migration allows us to experiment with various address space migration strategies. It provides the necessary flexibility without paying a signi- ficant performance price, and while retaining full transparen cy. It should be noted, though that compared to process migra- tion we have to deal with less state. For the time being we do not migrate UNIX processes which remain on the source node. As the second step, we exported the information necessary for LD decisions [Milo93a]. We instrumented Mach to account for net- work IPC and DSM. Finally, we implemented a distributed scheduler which uses task migration as a mechanism, and takes advantage of load information in order to make more accurate scheduling decisions [Milo93a]. In this paper we report on the experiences obtained during our project. Experiences are related to load distribution, as well as to microkernel issues. The remainder of the paper is organized in the following manner. In Section zz we present background and previous work. In Sec- tion yy we discuss design and implementation of our LD scheme. Lessons learned are summarized in Section xx. Finally, in Section zz we draw conclusions and suggest future work. Background and Previous Work Due to the implementation character of the project, particularly task migration, our research is on the border between the fields of distributed scheduling and operating systems. Therefore, there is a lot of previous work in this area. We try to mention only the ones most related to our research. In this section we shall describe the Mach microkernel and mention important LD systems that influenced our work. Mach In this subsection, we shall briefly mention a few important Mach characteristics and describe Mach NORMA version. Interested readers may consult the extensive Mach literature for more infor- mation [Blac92,Boyk93]. The Mach microkernel is well known for its portability. It is ported to various multiprocessor computers (shared memory with (non)uniform access and non shared memory, such as MPP architectures) and distributed systems. Its extensibility is proved by extensive research conducted on top of Mach, such as X-kernel protocols Orma93 and scheduler activations Bart93 , although some experiments showed that Mach does not fit the re- quirements of all architectures sufficiently well [Kupf93]. There are particular Mach versions extended for real-time Toku90 and fault-tolerance [Chen90]. Its modularity is expressed by the minimal set of supported abstractions: task and thread abstraction for processing, ports and messages for IPC and memory objects for VM. The Mach microkernel is transparently extended to a distributed system with NORMA support for network IPC Barr91 and distri- buted shared memory. NORMA IPC is an in-kernel implementation of the user space net- work message ( Netmsg ) server [Juli89]. While the Netmsg server provides functionality, the in-kernel network IPC is an attempt to improve performance. It is optimized for the short in-line (less then 128 bytes) and large out-of-line messages (one or two pages). In Mach, the out-of-line messages are logically copied, avoiding physical copying. The NORMA network IPC pro- vides a distributed capability space, taking care of notifica- tions for deleted capabilities and reference counting. The NORMA distributed shared memory, called XMM (eXtended Memory Management), is an in-kernel reimplementation of the user space DSM [Fori89]. XMM extends the consistency semantics of the current pagers (e.g. inode and default pager) to a distributed environment. In order to allow the existing pagers to support multiple kernels without XMM, significant changes would be required, incurring incompati- bility. XMM overcomes this problem by transparently interposing between the pager and multiple kernels. Previous Work One of the most complete load balancing schemes, including pro- cess migration and sophisticated load information dissemination, was developed for MOS(IX) system [Bara85]. MOS(IX) is one of the first kernel-like system, since it is divided into two layers. The precopy technique is implemented in the V kernel Thei85 , for one of the first task migration implementa- tions on a message passing kernel. The precopy technique improved the freeze time (period while process is not active) but it negatively influenced the overall system performance. Signi- ficant performance improvements for address space transfer is achieved by the copy-on-reference technique which is intro- duced in Accent [Zaya87]. Address space is virtually mapped and pages are only transferred when they are referenced. Sprite process migration contributes the idea of the home node that maintains the state of the process [Doug91]. Load dis- tribution is supported through the use of parallel make. The other interesting issues concern the flushing technique for the address space migration, and optimizations based on its rela- tionship to the file system. The relevant research in the area of distributed scheduling is the load balancing work conducted by Ferrari and Zhou [hou88]. The authors investigated load indices for various scheduling strategies. Of particular interest is the work done by Krueger whose PhD thesis represents an excellent overview of the issues involved in the field of load distribution [Krue88]. The author compared load balancing and load sharing, preemptive and nonpreemptive load distribution and other objectives of load distribution in distributed systems. Cabrera measures the typical task execution time [Cabr86]. He found that the most UNIX processes are short lived, e.g. more than 78% of the observed processes have lifetime shorter than 1s, and 97% shorter than 4s. Of historical importance is the work done by Eager et al [Eage86] which demonstrate that al- ready a small amount of information could lead to dramatic performance improvements. The more recent and closely related research to our work is the following. Load balancing in Chorus is based on processes migration but otherwise it has similar background as our work [Phil93]. Similar problems are solved: migrating capabilities, threads, etc. The Stealth project introduces depressing the priority of processing and VM activities of the incoming tasks [Krue91]. This requires modifications to the inode and default pager. It is opposite to what we do. We do load balancing, tar- geted for clusters and MPP architectures, and therefore we try to distribute types of load (processing, VM and IPC). The goal of Stealth is load sharing on autonomous workstations therefore in- coming load (processing and VM) is depressed so that it does not influence the local computations. In Locus Transparent Network Computing (TNC) migration remains at the OS personality level, considering the kernel layer only when necessary [Zajc93]. Interesting contributions of TNC are Vprocs and the work on streams migration. Extend- ing TNC with our TM scheme would provide a complete migration solution. Load Distribution Design and Implementation Our LD scheme consists of three major elements: task migration, load information management and distributed scheduling. Each part is briefly described in Subsections xx, yy and zz respectively. More detailed descriptions can be found in [ilo93,Milo93a]. Subsection xx overviews implementation history and environment. Task Migration Our task migration deals with the Mach task abstraction, leaving the OS personality abstraction (e.g. a UNIX process) on the source machine. This introduces a kind of home dependency, since most OS personality calls are redirected back to the source node. We designed TM in user space. Unfortunately, some modifications to the kernel are necessary. These are however minor (200 lines of code) and originate from a Mach limitation with regard to the task and thread kernel ports. The kernel ports represent kernel objects. User tasks control kernel objects by means of sending a message to the corresponding send capability for the kernel port. The message is intercepted by the kernel which recognizes that it is directed to a kernel object and instead of queuing the mes- sage, the appropriate kernel procedure is invoked on behalf of the object. Kernel ports do not have the corresponding receive capabilities, since they are owned by the kernel. With the current Mach interface therefore it is not possible to migrate the task and thread kernel ports from within user space. We pro- vide two additional calls for interposing the task and thread kernel ports. The interpose calls take as an argument the inter- pose port. The interpose port replaces the original task kernel port, which is returned as an output parameter of the interpose call. After the interpose call, only the caller has the send ca- pability for the migrated task, and the communication directed towards the task ends up in the original task kernel port that now resides in the capability space of the calling task. After migration, the original kernel port is interposed back, and the collected messages are restarted in order to invoke the appropri- ate actions on behalf of the new migrated task instance. Our TM scheme is transparent, we can migrate at any point of time. If a thread in the migrated task executes a system call, it is aborted and brought to a clean point where it does not con- tain any kernel state. This is supported by the Mach system call thread Typical examples are sending or receiving messages, or page faults. In each case, action is either awaited to finish, if it would end in the guaranteed short time, or interrupted other- wise. If it was interrupted, the system call is restarted after migration, which is handled by the system call library. There are no limitations to the system calls that a task may is- sue, which is not true for most other user space implementations, such as Condor [Litz92]. Like other user-space implementations, our TM scheme is easy to extend and modify. Like in-kernel imple- mentations, we do not sacrifice performance, transparency and functionality. Therefore, we manage to combine most of the good characteristics of both kernel and user space implementations. Our TM scheme significantly benefits from the Mach network IPC and DSM. The DSM support is a prerequisite when a part of the task address space is shared or needs some kind of consistency (not provided by the default pager), as in the case of the mapped files. We implemented two kinds of TM servers, a Simple Migration Server (SMS) and an Optimized Migration Server (OMS). OMS sup- ports various address space transfer strategies, such as copy-on-reference, precopy, flushing and eager copy , in a similar way to implementations in systems such as Accent, V ker- nel, Sprite and MOS(IX). However, in OMS these strategies are supported in user space (unless there is a need for DSM, where we rely on NORMA DSM). SMS was recently reimplemented in the ker- nel. The integration of SMS into the kernel is discussed in Sec- tion xx. Only SMS is used for LD experiments, due to its simpli- city and robustness. OMS is too complex and the in-kernel migra- tion is not mature enough. A more detailed description of SMS and OMS could be found in [Milo93]. Load Information Management Our load information management scheme is similar to other imple- mentations, such as TNC Zajc93 and MOS [Bara85]. It consists of information collection, dissemination and negotiation. It differs from the other schemes in the level at which it is per- formed, and in the kind of the information it is based on. Be- side processing, which is regarded as the prime factor for dis- tributed scheduling, we also consider information on VM and IPC. We instrumented Mach to collect information on the network paging and network IPC. Information is collected at the node and task level. At the node level we account for the number and the size of network messages, pagein/pageout requests and the number of in/out migrations. Our scheme for information collection is an attempt to tradeoff the completeness of the collected information for the costs and minimum modifications to the underlying kernel. Therefore, we do not collect all possible information on the task level but only the amount that is simple and cheap enough to collect. We ac- count for the number of remote messages sent from the task and the number of remote pageins. It is too costly to account for the received messages in the current Mach implementation because there is no back pointer from the Mach port to the task. There- fore, the only opportunity is to store the information in ports and then to loop through the entire task capability space, which may be time-expensive if we assume searching of many tasks, and space-expensive, since we need to reserve space in each port in- stead of only once in the task. It is impossible and inappropri- ate to account for remote pageouts on behalf of a task. It is impossible because there are no back pointers from the page to the task. It is inappropriate because, if a few tasks share the same page, all of them will be accounted for pageout, although in reality only one of them might have accessed the page, in some sense bearing the responsibility for pageout. We currently use only three nodes for measurements, and there- fore we disseminate information with a circulating token for the strategies that rely on periodic information exchange. Due to the small number of nodes, there is no significant overhead in the periodic circulation of the token, however as soon as we start using more nodes, we shall switch to the dissemination scheme as used for MOS(IX) [Bara85]. Distributed Scheduling Our distributed scheduler (LD server) is a user level program running on every node in the LD cluster. The nodes can join or leave the cluster at any time. The LD server is a Mach application which communicates using Mach IPC. In ord- er to allow for the comparison of the different LD strategies, the LD Server is highly parameterized. We can specify the following input parameters to the scheduler: The type of the strategy: no LD is the case when no load distribution is performed; random strategy is activated if the local load exceeds a threshold, in which case the tasks are distributed randomly onto other nodes in the cluster without considering the load of the node we migrate to; sender initiated strategy polls a speci- fied number of nodes in order to find a suitable one; in re- ceiver initiated strategy, nodes that have lower load than threshold try to find an overloaded node; the symmetrical strategy is a combination of the sender and receiver initiated strategies; etc. The level of considered information: no information at all; only processing; information on processing, network IPC and XMM. Strategy specific parameters , such as thresholds, fre- quency of load collection and dissemination, server priority, accumulated task user time before being considered for TM, etc. The LD server periodically inspects the load on the local node using the load information management interface. If the local load crosses a threshold value, the LD server acts according to the specified strategy. Task migration is our basic mechanism for LD. Based on the specified criteria a task is selected and if ap- propriate migrated to a suitable node. Depending on the underly- ing strategy, negotiation takes place before migration in order to find a destination node or to verify its suitability and wil- lingness to accept the migrated task. Implementation History and Environment We started our LD project in the fall of 1991. We migrated a task for the first time in May 1992. Migration was stable after a few months of improvement and with more robust versions of the ker- nel. Load information management was finished by the end of 1992. We conducted load distribution experiments since the be- ginning of 1993. New strategies are continuously added for new experiments. The underlying environment consists of 3 PCs interconnected via Ethernet. Each PC contains a 33MHz i80486 processor with 8 MB RAM and a 400MB SCSI disk. For the implementation and experiments we used Mach NORMA ver- sions 7, 12, 13 and 14, and the UNIX server UX28. Currently we are moving to the OSF/1 environment. The routines concerning kernel modifications fit in a file of 200 lines of C code, most of which are comments, debugging code and assertions. The SMS server has around 600 lines of code, while the LD server and load information management part consist of around 1800 lines. For most of our measurements we used the artificial load and our computing environment as a testbed. At the very beginning we in- tended to use real applications, however it was quite hard to find any appropriate distributed application. PVM was being port- ed to Mach but was still not available. Most other applications required porting to Mach. UNIX applications are not suitable since we do not support process migration. Therefore, we imple- mented an Artificial Load Task (ALT). ALT is an attempt to pro- vide a simple and reproducible behavior of processing, IPC and VM. Load is specified by the parameters for the processing, (network) IPC and VM (XMM) access. Processing is specified by the amount of CPU user time, IPC as the number of messages, and XMM, as the number of network pageins. VM access and IPC are equally distri- buted over processing time, and their amount per unit of time is constant, same for all ALTs. For each node in the cluster the load is separately specified. If not otherwise noted, the mean interarrival time of the ALTs on each node is computed by dividing the mean user-time of the tasks (given by the hyperexponential distribution) by the load for the specific node. The ALT interarrival times are drawn from an ex- ponential distribution with the above computed mean interarrival time. The node of the server which ALT communicates with is ran- domly chosen and remains the same throughout the task's lifetime. In the presented experiments, tasks are migrated at most once. Tasks can not be migrated before they accumulate of user time. Lessons Learned Our research project is a practical one. During its life, we dealt with various kernel and load distribution issues. This section summarizes the lessons we have learned while developing and using LD on top of Mach. Throughout the section, we refer specifically to our TM scheme and to the Mach microkernel, however similar reasoning could be applied to other microkernels as well. We would like to discuss the following obser-vations: Task migration is easy to implement and insulate. User space and in-kernel TM are similar regarding performance and implementation, they differ in maintainability, interface and kernel integrity. Task migration is not necessarily inferior to initial place- ment. Task migration is not always enough. Network IPC is powerful but also complex to implement and op- timize. Information on IPC and VM improves scheduling decisions. The presented measurements have been conducted on our testbed of three computers. Although it may seem that having only three com- puters limits the generality of our conclusions, we do not be- lieve that the rather small configuration has the significant im- pact in this case. Our scheme is designed to be scalable. However, the poor performance and functionality of network IPC may mean that some improvements that we have achieved may vanish for the better network IPC implementation. We believe though that like the newer and faster processors have not eliminated the need for load distribution, similarly the better implementation of network IPC would not render the communication optimization use- less. Task Migration is Easy to Implement and Insulate By implementing two user space task migration servers, as well as in-kernel task migration, we have demonstrated that there are no significant complexities involved in the TM design and implemen- tation. The changes to Mach for a user-space implementation are minor. The main effort for moving user space TM into the kernel consisted of changing the interface. The Mach object orientedness allows for much easier design. Tran- sparently accessing the Mach objects across the network simpli- fies the effort of controlling (extracting, inserting and access- ing) various objects in the migrated task. Insulating process/task migration from the other modules in the system may be hard to achieve. Douglis reports that it is hard to insulate process migration from the other modules in Sprite, due to the significant dependencies [Doug91]. Compared to other process/task migration experiences, we had no problems with insu- lating task migration from the other modules. We make a poten- tially unfair comparison between process and task migration be- cause there are differences that may affect our conclusions. We do plan to upgrade our scheme with process migration as well, and only then we could firmly prove some of our claims. However, there are also enough reasons that we may already predict a lot of behavior, since most of the state is already contained within the task. For instance, OSF/1 AD task contains most of the pro- cess state that is residing within the emulator. Although the OSF/1 environment is moving towards the emulator-free task Pati93 , in which case all of the process state will reside in process manager, similar activities would be required to extract and migrate the required state. While designing and implementing TM, the NORMA interface (network IPC and XMM) has been of the significant help. Similar develop- ment history exists for other systems. For example, process mi- gration in Sprite [Doug91] relies on the file system implemen- tation Welc90 , particularly on migration of the opened I/O streams. Load distribution for the V kernel Stum88 is based on the related work of task migration [Thei85]. Similarly, we make use of NORMA IPC and XMM for the implementation of our TM and LD. The simplicity of our task migration stems from the NORMA sup- port. The Mach NORMA version provides a message based and a memory mapped interface, compared to the traditional UNIX-like open- close-read-write-ioctl interface. Douglis compares the UNIX in- terface with the message based interface and derives the conclu- sion that although on the surface it seems that the message based interface simplifies state management problems, there is not really too much difference between using the two types of in- terface from the complexity point of view [Doug91]. He admits, though that the message based systems make migration somewhat easier. Another Sprite implementor argues that message based interface is a general tool, providing powerful functionality at the lower level [Welc93]. The UNIX-like interface, however being at a higher level, provides for various optimizations for the particu- lar implementation, e.g. of a file system. Sprite, a representa- tive of the UNIX-like interface, proved this opinion to be correct. Our findings are in line with both Sprite implementors. We argue that our Mach task migration was easy to implement but perfor- mance optimizations are limited to improving NORMA functionali- ty. User Space v. In-kernel Task Migration. After two versions of migration servers we implemented task mi- gration in the kernel. The in-kernel task migration is achieved by moving SMS into the kernel. This required switching to the appropriate in-kernel interface and we also applied some optim- izations. The performance is better compared to the user space SMS server, and it is in the range of the performance of OMS. The performance improvement is not the consequence of run- ning in the kernel space but rather due to the various optimiza- tions. The dominant costs for migration are the network mes- sages. Since SMS is deliberately unoptimized and relies only on the LD server on one node, it involves a new message for each state transfer, incurring high costs. This was, however the matter of choice, and is certainly not inherent to user space implementation. In OMS, for example, there are servers on each node which cooperate in packing and unpacking of the task state in messages, thereby improving performance. For the in- kernel migration, the kernels play the role of the servers, and therefore allow for the optimization by packing more state into one message. OMS is not moved to the kernel since its optimizations are too complex (e.g. packing more capabilities into a message) which would incur too much complexity for the in-kernel implementation. Besides, the overhead of initial costs is not significant, therefore the tradeoff of simplicity for per- formance is reasonable. Most of the SMS in-kernel reimplementation complexity involves optimization, e.g. if there are several send capabilities for a port, only one send capability and the reference count are mi- grated, instead of migrating each send capability; the thread states are combined into an array instead of migrating each state separately, etc. The implementation effort was the matter of hours and debugging the matter of days. The ease of in-kernel im- plementation further demonstrates modularity of the Mach kernel, although it also owes a lot to the expertise achieved during user-space development. The in-kernel TM is developed due to the interpose calls. In a way, interpose calls represent an attack on the kernel in- tegrity, and certainly break the current Mach interface. There- fore, we do not believe that these calls could ever progress into the standard Mach distribution, contrary to our intentions to provide TM to a wider community. Performing kernel port interpo- sition as a part of the in-kernel migration does not raise con- cerns. There are advantages and disadvantages for both the user space and in-kernel task migration. User space migration provides for easier experiments with various address space migration stra- tegies. For example, in OMS we support flushing, precopy, copy-on-reference , and other strategies. On the other hand, since in some cases DSM is needed, it is better to use the in- kernel address space transfer or to have some other kind of DSM support, such as the one described in [Fori89]. From the design point of view, we do not regard the difference between the in- kernel and user space migration as a significant one. Having both implementations available we shall collect some more data and ob- serve the (dis)advantages of each version. Task Migration is not Necessarily Inferior to Initial Placement There has been analytical, practical and simulation work that demonstrated that process migration is not suitable for LD [Eage88,Doug91,Shiv92]. Most researchers would start their re- ports by stating that process migration, despite being a useful tool, still involves the significant complexities for the imple- mentation and incurs sufficient costs that render it useless for LD. One of the researchers who considers process migration advan- tages Krue88 is mostly worried about the operating system com- plexities involved with process migration. We try to prove that the above observations are not always valid for our task migration. We showed that our task migration imple- mentations do not involve any significant complexity, on the con- trary, we provide both user space and in-kernel task migration facilities without significant effort. With regard to the costs, since our task migration supports copy-on-reference data transfer scheme, the initial migration costs are comparable with the costs of initial placement. Run time costs for message redirection can be neglected. The only significant costs may be due to the lazy copying of pages. It is true that an incon- venient access pattern could result in higher costs for TM. On the other hand, only referenced pages are transferred, and we may potentially predict the task behavior based on the size of its address space and its resident set size. On the positive side, task migration provides advantages due to the additional information a running task can provide, such as the information on who the task communicates with, and the task execution time. Using TM is an implicit way to filter out short-running tasks not suitable for migration. Besides, migra- tion is the only appropriate mechanism for receiver initiated strategy, a preferred strategy for systems with higher load [Krue88]. We conducted an experiment to compare task migration with initial placement. Once started, a task can provide more information on its behavior. For example, we can find out who and where a task communicates with, and potentially migrate the task to the ap- propriate location. Initial placement does not offer such an ad- vantage, since we know nothing about the task behavior in ad- vance. Instead of initial placement, which would involve access- ing the code pages remotely, we use remote invocation (sending just parameters, while the text segment is not migrated but re- sides on each node), which represents the lower bound cost for initial placement. Figure xx presents the average execution time of the tasks as a function of the amount of IPC they perform per second of the user-time. The tasks are started on all three nodes as described in subsection xx. Nodes are unequally loaded with workload of 0.8, 0.5 and 0.1. Initial placement is event driven. When a task is created, it is initially placed on an underloaded node. In the case of migra- tion, tasks can be migrated if they execute longer than 800ms. Load distribution is started each second, and if appropriate, a task is migrated. We can notice that the sender initiated algorithm (based on TM) does not have much worse performance than initial placement, while the sender initiated algorithm that also considers informa- tion on IPC slightly outperforms initial placement for more in- tensive IPC traffic. All three cases are better then the case when no LD is performed. The reasons why we do not obtain even better performance improve- ments are twofold. The first reason has to do with NORMA IPC. In the overloaded host, NORMA IPC is a bottleneck leading to unac- ceptable migration times. Whereas on the underloaded host (from NORMA IPC point of view) migration lasts few hundred ms, on the overloaded host it goes up to few s, which increases the task average execution time and decreases the benefits ob- tained by considering IPC. The second reason deals with the small number of nodes that we are currently using. In the case of initial placement, there is a high probability that a suit- able node with respect to IPC traffic is selected. With more hosts, the probability would be much lower, and the benefits of having the correct information on communication in the case of task migration will be more expressed. Task Migration is not Always Enough Extending the environment running on Mach to a distributed system is not painless. There are many issues that require significant modifications in order to satisfy functionality and performance requirements. Some of them include the file system, process mi- gration, emulator, etc. Our task migration scheme is a Mach lev- el mechanism. In the case that the user applications need a UNIX interface, TM is not enough. We observed that the performance penalties due to home dependency could be significant enough to render task migration useless (for experimental results see Milo93 ). In such cases it is necessary to provide the OS per- sonality abstraction migration. In other cases, though task mi- gration can still suffice as a lone migration mechanism. In MPP, for example, it may be inappropriate to have an OS personality server on each node (due to paging, memory consumption, etc.). In such cases, providing the Mach interfaces may be enough, possibly upgrading it with a library which would emulate the UNIX-like VM. Many applications conform to this requirement, e.g. simulations or numerical computations which generally do not have the need for file access (except for the initiation and termination phases), or other UNIX interfaces. There are a few possibilities to combine task and process migra- tion. Once we select a task to be migrated, we can inform the OS personality to migrate its related abstraction, e.g. after the task is migrated it may be appropriate also to migrate the OS personality abstraction. The reverse strategy is also possible, an OS personality decides to migrate its process abstraction, consults the load information module, and then performs task and process migration. Finally, a process abstraction can remain on the source host if performance requirements and task behavior allow for that. Files are more related to the process than to the task abstrac- tion, but we are certainly concerned with the file influence on task migration. In the current TM scheme, files are transparent- ly supported by using DSM. Since files are mapped into the task address space, after task migration they are mapped as shared, retaining the necessary consistency. Therefore, we can still benefit within the UNIX environment, although only the task, and not the process abstraction, is migrated. Functionally this solution suffices but it is unacceptable from the performance point of view. There are two related activities in this area. In the OSF/1 AD operating system a significant effort is invested in Distributed File System (DFS) support [Zajc93]. Although targeted for NORMA architectures in general, its main application is foreseen for MPP architectures. Another file system is being developed particularly for clusters [Roga93]. Since most of the functionality in either of the two DFS implementations is primarily based on XMM and NORMA IPC, we do not see any limitations on the process/task migration. Simi- larly to the existing implementation, files will be supported by remapping memory mapped areas from one node to another (thereby retaining consistency), and by migrating the capabilities that represent opened files. Compared to the Sprite experience, where migrating I/O streams (files, devices, etc.) is a primary source of the complexity, mi- grating Mach tasks with the opened files is mainly supported by NORMA IPC. However, performance is still concern. While in Sprite there is an optimized interaction between the file system and process mi- gration, the question is still open of how well does the current Mach interface (XMM and NORMA IPC) match the needs of DFS. It should be thoroughly tested and measured in order to verify all the performance issues related to DFS activities, and related to the interaction with the task/process migration. Our current ef- forts to provide at least a partial process migration which would interact with the cluster DFS Roga93 are in line with this in- vestigation. Network IPC is Powerful but also Complex to Implement and Optim- ize Our experience with NORMA IPC shows that it has neither been thoroughly optimized, nor fully debugged. While using it we en- countered bugs and performance drawbacks. As a message based in- terface, it is more comfortable compared to RPC, in particular from the aspects of transparency and scalability however its design and implementation are fairly complex. Once that it is correctly supported, there is a big question of providing the adequate performance. We noticed a few performance drawbacks. For example, in the case that a send-once capability (a special case of send capability used for only one message transmission) stems from a migrated receive capability, the ex- isting implementation requires an additional message to the source node for this send-once capability unless there is already a proxy port and the correct node has been set. The need for the additional message is quite a frequent case, resulting in a dou- ble cost for the send-once capability transfer. We also observed a problem related to stressing IPC activity. TM can be significantly delayed when IPC is stressed. Delays can be up to a few seconds, which is unacceptable. This raises the ques- tion of a prioritized IPC. Since NORMA IPC has not been optimized (currently it runs in a stop-and-wait mode, there is no sliding window, no piggy-backing, etc.), the behavior might be somewhat exaggerated, however even with an optimized implementation, de- lays would probably be observable. Prioritizing Mach processing and paging is done in Stealth pro- ject [Krue91]. The migrated-in tasks have depressed priority in ord- er to reduce their influence on the local tasks. Similar reason- ing could be applied to IPC. The presented problems and drawbacks are intended to criticize neither message based interface nor its current implementation, but rather to show the examples of the complexities involved in providing a correct and well-performing message based interface. Despite our objections, we significantly benefited from the ex- isting NORMA IPC and our task would have been much harder without it. Information on IPC and VM Improves Scheduling Decisions Traditionally, processing load has been the main source for load distribution decisions. Other factors, such as files, device ac- cess, virtual memory, networking etc., have been considered, but rarely used in practice. In kernels, all operating system per- sonality abstractions, such as files, flavors of IPC and device access, are mapped to kernel abstractions, e.g. in Mach: tasks/threads, memory objects and IPC objects. This provides for a unified accounting for operating system resources. In particu- lar it is relevant to extending kernel abstractions to distri- buted environments, namely, network IPC and DSM. Accounting only on three elements (processing, VM and IPC) simplifies our infor- mation management, while providing a more accurate insight into the system load. In order to demonstrate the potential benefits of the additional information, we conducted some preliminary experiments with and without considering network IPC and XMM. Figures xx and xx present the average execution time of 12 ex- perimental tasks (ALT) started on one node every 500 milliseconds and distributed on 3 nodes. Average execution time is presented as a function of processing and communication the tasks perform. The amount of communication is expressed as the number of network messages/pageins that each task communicates with a server on a randomly chosen node. Messages are sent in pairs (16 bytes and 256 bytes), and pages are 4KB size. The amount of processing is expressed as a function of the cpu-time the tasks consume (user- time). The figures demonstrate the advantage of considering the additional information. The upper surfaces represent LD without considering additional information while the lower one does. There is an obvious advantage of considering it. The benefit depends on the amount of communication that an application per- forms. For example, improvement can be over 100 if there is a significant amount of communication with the remote server. There are two more interesting details. Peaks in the surfaces demonstrate cases when there is a small amount of processing while IPC/XMM activity is stressed. NORMA IPC becomes a bottleneck in such cases. The surfaces intersect for the zero IPC/XMM activity, demonstrating that our scheme incurs no signi- ficant overhead in considering additional information. It should be pointed out that our goal is not to migrate each client towards the server. An obvious example is a file server, where we certainly would not allow migration of each client to the file server. In such cases the LD server would not be started on the file server node, preventing any task to be migrated to- wards it. We are rather targeting new distributed applications, with a variety of clients and servers. Under some circumstances we shall decide to migrate a task. If it also happens that the server node is a convenient destination for migration, than we can take advantage of our knowledge about DSM and network IPC in- formation. This experiment demonstrates that in some cases LD decisions could be improved by using information on network IPC and XMM. Performance improvements are gained in two ways. The obvious ad- vantage comes from the fact that moving a client towards the server improves the average task execution time. The less obvi- ous advantage is the consequence of the limited IPC bandwidth which could, similar to processor, become a bottleneck. Distributing load from the client node to other nodes improves performance because communication is parallelized, although it still ends up with the same server node, but now in parallel with few client nodes. There is no intent to introduce any complex or perfect load in- formation management. We propose only simple and straightforward extensions of existing schemes. Out of the task candidates for migration, and the host candidates for destination, we select those that also satisfy the communication criteria. Necessary information is obtained either locally (for task selection) or during negotiation (for node selection). Neither poses signifi- cant overhead. Accounting information for large number of machines only influences the amount of information that is rea- sonable to store within the particular structures. Therefore, either the limited number of nodes can be considered (on the first-come-first-served policy), or the space for storing infor- mation can be added dynamically. In order to inspect how our LD scheme performs for various stra- tegies, we repeated the experiments with a few well know stra- tegies, as presented in Figures xx. The figures show average ex- ecution time of ALTs as a function of load for the symmetrical, sender initiated, random and receiver initiated strategy. The parameters are the same as described in Section xx. except for workload which varies from 0.1 to 1.5 on one node and it is 0 on two other nodes. The performance improvement is observed for all strategies. Conclusion Our research consists of the design and implementation of load distribution on top of the Mach microkernel. It is achieved in the following three phases: task migration, instrumenting Mach to provide information on network IPC and XMM, and distri- buted scheduling. The conducted research allowed us to reach some experiences, as presented in Section xx. Once more we would like to point out the benefits of using addi- tional load information, and the comparison between task migra- tion and initial placement. We demonstrated that considering additional information on net- work IPC and XMM for distributed scheduling can improve the aver- age execution time of the tasks. We demonstrated improvements for various scheduling strategies. We also revisited the usefulness of task migration for LD compared to initial placement. We showed instances when the information provided by running task leads to better scheduling decisions and decreases average execu- tion time. Our research is not intended for the underutilized workstation environments, but rather for new MPP and cluster ar- chitectures and new distributed applications. We believe that such a new environment should make use of all possible mechanisms for load distribution. We foresee the following work as a prospect for our future research: Providing at least a primitive form of process migration and experimenting with its relationship to a new distributed file system. Working with real applications would most probably be per- formed with the PVM port on Mach and choosing suitable PVM appli- cations. Repeating the conducted experiments in a larger installation with more computers than the current environment. Availability All referenced programs or utilities are available upon request. We also hope to provide our in-kernel task migra- tion to OSF or CMU in order to merge it into the source code tree. Acknowledgements We would like to thank CMU and OSF for providing us with Mach, as well as for the continuous support. The OSF/1 source code license is provided to us as a part of our collaboration with the Grenoble OSF Research Institute. The fol- lowing is an alphabetically ordered list of the paper reviewers, whose significant help is appreciated. David Black, Henry Chang, Orly Kremien, Reinhard L ling, Laurent Philippe, Bryan Rosen- burg, Nikola erbed ija, Brent Welch and Roman Zajcew. Partic- ularly we are indebted to Alan Langerman for his profound review and sharp criticism. Our non-native English has been ironed by Simon Patience. Special thanks are to program committee for the extensive comments which improved our article in many aspects. Prof. Nehmer provided support and made the whole project live.