################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Microkernels and Other Kernel Architectures Symposium San Diego, California, September 20-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Experimentation with a Reconfigurable Micro-Kernel by Bodhisattwa Mukherjee Karsten Schwan College of Computing Georgia Institute of Technology Atlanta, Georgia 30332 e-mail: bodhi, schwan @cc.gatech.edu Abstract -------- Since the implementation of operating system functions can significantly affect the performance of parallel programs, it is important to customize operating system functionality for specific application programs. In this paper, we propose an architecture for a reconfigurable micro-kernel. This kernel can be configured at compile-time and at execution-time to suit varying application requirements. Such a micro-kernel can be used for the development of high performance operating systems and applications for parallel and distributed systems. We have implemented the reconfigurable micro-kernel on multiple parallel machines, including a 32-node GP1000 BBN Butterfly, SGI multiprocessors, and a 32-node Kendall Square Supercomputer. 1. Introduction ------------ Earlier research in operating system for parallel and distributed systems has demonstrated the need for reconfiguration and customization of operating system mechanisms to suit each class of application programs. For example, the set of functions provided by HYDRA was designed to enable the user of C.mmp to create his own operating environment without being confined to predetermined mechanisms and policies. Similarly, for real-time applications executing on shared memory multiprocessors, the object-based operating system kernels described in offer several representations of objects and object invocations to support the different degrees of coupling, task granularities, and invocation semantics existing in real-time applications . Such experiences in the real-time domain are mirrored by work in multiprocessor scheduling that demonstrates the importance of using application-dependent information or algorithms while making scheduling decisions. In addition, tradeoffs in program performance due to the use of alternative synchronization constructs have been demonstrated for various parallel architectures, including the experimental C.mmp and Cm* multiprocessors, interconnection-network-based machines like the Ultracomputer , and bus-based machines like the Sequent Symmetry . For example, for bus-based UMA machines, Anderson showed that spin locks can put a significant load on the shared bus, so that efficient use of the parallel machine requires a back-off strategy for spin locks similar to the one used by low-level Ethernet devices . Such a back-off strategy is a typical example of dynamic reconfiguration since it dynamically alters the back-off delay. Lastly, recent research addressing efficient memory models for shared memory and for distributed memory machines has made it clear that the support of multiple semantics of memory consistency can result in improvements in parallel program efficiency. Developing micro-kernel based operating systems for parallel and distributed systems has been a recent thrust in operating system research, and it has led to systems structured as collections of user-level server processes or threads layered on a minimal kernel. Our research is complementary to such efforts, wherein we investigate the reconfigurability of different abstractions implemented as part of the micro-kernel itself. Specifically we investigate: How can a micro-kernel support multiple configurations ( configurability )? How can a micro-kernel be structured so that it can be reconfigured to suit the changing runtime requirements of application programs ( reconfigurability )? Is such dynamic reconfiguration advantageous? Namely, do the runtime costs incurred by dynamic reconfiguration justify the possible gains due to such reconfiguration ( adaptability )? What functionality is required to build an adaptive micro-kernel? The next section first discusses a few sample examples of reconfiguration in operating systems, then presents the structure of the operating system kernel being built by our group. Section 3 describes the basic architecture of a reconfigurable micro-kernel. A few measurements of the basic micro-kernel functionalities are presented in Section 4. Section 5 compares our work with related research and finally, section 6 concludes the paper and presents some future directions. 2. Reconfiguration in Operating Systems ------------------------------------ The feasibility of dynamic reconfiguration has already been demonstrated in the real-time domain , and for parallel applications regarding their runtime synchronization on large-scale multiprocessors. Specifically, past research demonstrates that a single parallel program exhibits dynamic changes in its locking pattern during execution ( , it exhibits changes in the distributions of locking requests and of the number of threads waiting for a lock). Therefore, it is possible to utilize information about such patterns to dynamically change the behavior and consequently, the runtime performance of the synchronization primitives implemented as part of the operating system kernel. Similarly, earlier research demonstrates that dynamic reconfiguration is often useful in process/thread management and scheduling. For example, dynamic process migration and load balancing are often used in existing operating systems for improved performance. Such dynamic load balancing techniques (a static load balancing technique monitors the load of a processor only at the process creation time, and does not migrate the process after initial placement) monitor the load of individual processors and migrate threads or processes from one processor to another at run-time. Recent studies show that improvements in processor speed and therefore, application code performance, are not matched by similar improvements in performance of basic operating system functions ( system call, trap, page table entry change, context switch ) . We argue that application performance can be improved significantly by matching operating system kernel functionalities with specific application characteristics and specific hardware. Specifically, we posit that: The run-time behavior of applications differ across multiple applications and across multiple phases of a single application. (This is demonstrated by experimentation in ) Operating system kernel configurations can provide high-performance applications with the policies and mechanisms best suited to their characteristics and to their target hardware. Dynamic kernel reconfiguration can improve performance by satisfying each application's changing behavior. Efficient application state monitoring can detect changes in application requirements which are not known prior to program execution time. As stated above, the object of our research is to improve operating system performance by using application information to select and/or customize the operating system facilities used by the application program. We are constructing a dynamically configurable kernel -- the Kernel Tool Kit (KTK) that offers the flexibility to customize and/or reconfigure its mechanisms and policies to suit an application's runtime needs and the underlying multiprocessor hardware. The Kernel Toolkit (KTK) provides support for the construction of operating system kernels suited to specific classes of application programs. KTK offers a small set of abstractions (built-in object classes) and policies that have been extended and customized for several parallel application programs, including real-time programs like those required for autonomous robotics , highly dynamic parallel applications like parallel branch-and-bound codes , and numerical simulations. KTK is also the basis for our current research addressing configurable communication protocols for parallel and real-time systems . As shown in Figure , the whole system is based on a micro-kernel which consists of two well-defined portions -- a machine dependent portion which encapsulates the underlying hardware, and a portable machine independent portion built on top of the interface provided by the machine dependent portion. The micro-kernel provides a basic set of functionalities primarily at the threads level, whereas, among other things, KTK (the object layer) offers the notions of classes,objects, and invocations on top of the micro-kernel. It supports various types of objects. For example an object of class TADT (threaded abstract data type) is active and is used for the representation of parallelism in applications. Similarly, a TASK object is like an Ada task in that it consists of a single active thread of control. KTK also provides support for fragmented objects using a distributed shared abstraction (DSA) module . This module: permits programmers to define and create encapsulated, fragmented objects, and offers low-level mechanisms for implementing efficient, abstraction-specific communications among object fragments. The KTK interface and/or the micro-kernel interface can be used to build applications and operating systems (by means of servers) suited for specific application class and hardware. Past research in dynamic reconfiguration of parallel computation for improved performance has been primarily concentrated either at the application level or at the object level in the operating system. For example, MACH and CHORUS support operating system reconfiguration at the server level. CHOICES operating system supports reconfiguration at the object level. PRESTO supports reconfiguration at the application level. Similarly, KTK supports object reconfiguration (using attributes, and a special kind of object called policies) and various flavors of invocation at the object level. We argue that it is more appropriate to incorporate reconfiguration at the micro-kernel level: as the operating system and the applications are layered on top of the micro-kernel, both layers will benefit from any performance improvement caused due to reconfiguration at the micro-kernel level. the micro-kernel encapsulates the hardware and so is the most suitable candidate to capture hardware reconfiguration ( network reconfiguration, device configurations etc. ) and changes in hardware. The next section describes the architecture of such a reconfigurable micro-kernel. 3. The Basic Architecture ---------------------- The basic components (Figure ) of the micro-kernel being developed by our group are: Processor and thread management Low-level memory management Thread synchronization/communication management Minimal exception handling Customized monitoring support The micro-kernel provides specific support for configurability, reconfigurability, and adaptability of its different components: It defines a set of abstraction-specific attributes . We are concerned with attributes that characterize a component's internal implementation such as its data structures and some of its methods. Such attributes are altered dynamically by applications explicitly or implicitly by a ``user-provided adaptation policy''. We assume that changes in attributes may be performed both synchronously or asynchronously with method invocations. This requires the introduction of two additional time-dependent properties of attributes: (1) attribute mutability and (2) attribute ownership. An attribute is mutable whenever its current value may be changed. For example, a lock object's attribute specifying its waiting policy (not its scheduling policy) is permanently mutable because it may be changed at any time. However, its scheduling policy is likely to be immutable whenever threads are waiting on the lock, due to the inordinate potential expenses involved with the reorganization of internal lock data structures like thread waiting queues . It provides support for efficient customized state monitoring. Such information (about kernel components) may be used at the application level in the form of an ``adaptation policy'' to build adaptive operating system components. For example, a simple adaptive lock samples the lock contention (in terms of number of waiting threads) periodically at a user-defined rate, and adjusts its implementation-specific attributes (such as spin-time , sleep-time , release ) according to an adaptation policy. It permits an application to define adaptation policies that guide the reconfiguration of the kernel components, it uses. Figure shows the structure of a ``closely coupled'' reconfigurable kernel component . Such components provide mechanisms to alter the implementation of one or more of their methods dynamically (at execution time). A reconfigurable component consists of its state, a set of mutable attributes, a set of acquisition methods (used by an external agent to explicitly acquire one or more mutable attributes for reconfiguration), a customized monitor module (to monitor a set of predefined state variables, and attributes), and an adaptation policy (to guide the reconfiguration operation) . The reconfigurable micro-kernel consists of a collection of such adaptive/reconfigurable components. Figure illustrates the conceptual structure of the micro-kernel. Each component contains a set of implementation dependent attributes (IA) and a customized monitor module (CM). The reconfiguration of each individual component is guided by adaptation policies which constitute the ``adaptation policy space''. A few sample adaptation policies are described elsewhere . As demonstrated by earlier research, individual abstractions cannot often be configured independently. For example, Ousterhout demonstrated that co-scheduling considerably decreases inter-thread communication/synchronization overhead. Similarly, lock configurations, cache affinity affect scheduling overhead and scheduling policy, which in turn, affect thread exception handling overhead. In NUMA multiprocessors, memory management policy has a significant impact on IPC overhead and scheduling policy. Hence, the reconfigurable micro-kernel provides support to jointly reconfigure a group of kernel components using the notion of attribute families . We define an attribute family to be a group, collection, or set of attributes belonging to more than one kernel component. The run-time support for attribute families includes simultaneous reconfiguration of more than one of its components, customized monitoring of attribute families designed to sense the inter-abstraction interactions, support for user defined adaptation policy for simultaneous adaptation of multiple abstractions. The remainder of this section briefly describes the individual components of the micro-kernel: Micro-kernel threads: Like most other operating systems, the micro-kernel supports the thread abstraction for managing execution. It supports multiple threads of control within one address space by providing minimal support for thread management such as thread creation, termination etc. Each individual thread has its own program counter, its own stack and a predefined set of hardware registers (currently, all the hardware registers are used by each thread due to lack of support for register partitioning). At the micro-kernel level, threads communicate only via shared memory as opposed to the next layer -- the object layer -- which provides support for multiple mechanisms for inter-thread communication such as object invocations, message passing (DSA), and RPC. Currently, the threads component has only one attribute -- the stack-size -- which is configured statically. Synchronization Component: The micro-kernel threads synchronize using reconfigurable locks . Such locks provide mechanisms for static and dynamic lock reconfiguration and permits applications to define their own adaptation policies. Reconfigurable locks contain two kinds of attributes -- (1) wait attributes ( spin-time, sleep-time, delay-time, timeout etc. ) which implement a spectrum of lock waiting policies and (2) scheduling attributes ( registration attribute logging all threads desiring lock access, acquisition attribute determining the waiting mechanism and policy to be applied to each registered thread and release attribute that grants new threads access to the lock upon release) which determine the way lock requests are served. The scheduling attributes of a reconfigurable lock determine the delay in lock acquisition experienced by threads, whereas, its wait attributes specify the manner in which a thread is delayed while attempting to acquire the lock . Scheduling Component: The scheduling component of the micro-kernel schedules application threads to allocated processors. The micro-kernel assumes the existence of pre-allocated cluster of processors for applications . Earlier research demonstrates that parallel applications exhibit a wide range of scheduling requirements. For example, interactive applications require preemptive scheduling whereas search applications ( e.g. branch-and-bound algorithms) perform better using non-preemptive scheduling with a specific application-dependent queue ordering scheme. The scheduling policies and scheduling mechanisms matching application requirements can differ widely, and can be expressed with a few dynamic implementation dependent attributes such as granularity, storage/retrieval policy, locality, preemption and hints. The granularity attribute of a scheduler defines the scheduling unit. Group scheduling decreases interprocess communication cost, thus improving the performance of applications that require frequent thread communication . However, group scheduling techniques exhibit higher scheduling overhead. Therefore, information regarding the pattern/frequency of communication among the application's threads can be used to determine the appropriate granularity of the thread scheduler. The storage/retrieval policy attributes of a thread scheduler specify the scheme used to store and retrieve threads to and from the thread repository ( queue ordering scheme for run queues and free lists). Different applications require different queue ordering schemes for enhanced performance. For example, optimization codes using search methods are easily mapped to algorithm-specific queue ordering schemes ( FIFO, LIFO ) whereas applications implementing prioritized services require priority queues. It should be apparent that the use of queue structures supporting specific storage/retrieval policies is superior in performance to the use of queue structures able to implement multiple policies. The locality attribute of a scheduler determines the structure of the thread repository (e.g., centralized, distributed, etc.) and the load balancing strategies used in distributed repositories. Especially on the KSR machine, global queue organizations are not suitable for scalable performance in thread scheduling and thread management, so that queues must be internally fragmented across different processors . Additional scheduler attributes must capture whether threads may be migrated across processors, which affects load balancing and cache performance. Alternative implementations of load balancing techniques have been demonstrated to be useful in multiprocessors to increase processor performance by decreasing processor idle time. However, certain applications exhibit performance degradation because load balancing strategies ignore cache-affinity of application threads . Therefore, for enhanced performance, application characteristics will determine the locality attribute of the thread scheduler. Scheduling policies may be classified as enforcing (1) involuntary preemption, which means that a scheduler can preempt a thread at any time, (2) voluntary preemption, which means that threads must release processors voluntarily, and (3) non-preemptive, which implies that threads are never preempted. Such differences result in major changes in the implementation of thread context switching (saving or not saving signal masks, for instance); they occur for real-time vs. non-real-time schedulers. Preemption introduces extra overhead in scheduling cost. However, interactive applications, master/slave programs often exhibit improved performance with preemption. Furthermore, in , the author demonstrates performance improvements of some applications (especially, the applications consisting of threads which frequently synchronize and/or communicate) using scheduling techniques that employ application-provided hints such as Discouragement hints and Hand-off hints when making scheduling decisions. Even though the micro-kernel defines a collection of default schedulers such as non-preemptive FIFO scheduler, preemptive round-robin scheduler, priority scheduler etc , applications are encouraged to write their own customized schedulers (static configuration) for improved performance using the following framework: xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx CLASS Scheduler is STATE internal state is LOCK sched lock; int node; int active threads; QUEUE T free list; QUEUE T *lqueue; THREAD T scheduler thread; END ATTRIBUTESET static scheduler attributes is int no of readyqs; int no of threads per proc HINT T hint; OPERATION global init (..); OPERATION vproc init (..); OPERATION thread init (..); OPERATION get thread (..); OPERATION put thread (..); OPERATION empty readyq (..); OPERATION proc idle (..); OPERATION schedule (..); BEGIN END The above class defines an application-specific scheduler. The internal state of the scheduler provides data structures to define/implement multiple scheduling policies. The attributeset of a scheduler consists of various levels of initialization operations ( global init initializes at the processor level, vproc init initializes at the virtual processor level, and thread init initializes at the thread level), different policy operations ( get thread to get the next thread to run, put thread to put a thread back to a bank of ready queues, empty readyq to determine if a ready queue is empty, proc idle to specify work for an idle processor), and a set of operations to handle various exceptions/interrupts (primarily used to implement preemptive scheduling based on inter-processor interrupts, timer interrupts etc.). An application defines the scheduler(s) at pre-execution time and installs them to the specific processors using a call to: install scheduler(processor number, scheduler) The performance of many applications can be improved if different, concurrently executable parts of their computations can be scheduled with differing thread schedulers -- termed heterogeneous scheduling . To demonstrate performance gains, we have studied distributed implementations of the ``work queue'' abstraction (implemented using the DSA module of the object layer) in a parallel branch and bound application. In this program, performance improves when using a non-preemptive FIFO scheduler for application threads and simultaneously using a preemptive priority scheduler for asynchronously executable threads implementing the communications among the different fragments of the work queue abstraction. Similarly, performance improvements are possible when a FIFO scheduler is used for application threads while using a priority scheduler for the transmission of monitoring data collected about the application during its execution. Other Components: The exception handling component of the micro-kernel permits applications to define their own customized exception handlers. The micro-kernel also provides a set of default handlers for various exception conditions. The memory handler module provides support for low-level dynamic storage allocation and is configurable in that it permits applications to vary the minimum chunk of memory allocation for performance improvement. 4. Performance ----------- A prototype reconfigurable micro-kernel has been implemented on a 32-node GP1000 BBN Butterfly, SGI multiprocessor, and a 32-node KSR. Since, currently we are not interested in issues regarding protection, the prototype is implemented completely at the user level. Table lists the costs of a few basic operations provided by the micro-kernel implemented on two different machines -- BBN Butterfly and KSR. The operation thread-yield just yields the processor to the next ready thread, the operation pingpong bounces the processor between two threads. The numbers listed in the table for spin-locks and blocking locks are obtained from the reconfigurable lock -- configured as spin and block. These numbers are comparable to primitive spin and blocking locks. The operations monitor-an-attribute and reconfigure-an-attribute are the basic attribute monitoring operation, and attribute reconfiguration operation respectively. We have already investigated the possible application performance gains due to dynamic reconfiguration of locks . Using artificial work-loads and a representative multiprocessor application, a parallel branch-and-bound program, on a 32 node BBN Butterfly Multiprocessor we demonstrate that: For improved performance, an application requires lock schedulers that are most appropriate for the locking patterns exhibited by its threads. An experiment with a client-server application on the BBN Butterfly multiprocessor demonstrates that priority and handoff lock schedulers outperform FIFO lock schedulers by 13 , thus demonstrating the importance of application-specific lock schedulers ( configurability ) . The optimal waiting policy for a lock depends on the application's dynamically changing locking pattern. An experiment with reconfigurable locks with a workload generator on a BBN Butterfly multiprocessor demonstrates that dynamic reconfiguration of waiting policy considerably enhances application performance ( reconfigurability ) . Experimentation with the parallel branch and bound program, and Time-Warp kernel demonstrates that ``closely coupled'' adaptive locks outperform other available locks by up-to 17 , thus demonstrating the need of adaptation in operating system abstractions ( adaptability ). Repetition of the above experiments with the Travelling Sales Person application, and the Time-Warp kernel on KSR demonstrated similar results . Our current research concerns the demonstration of similar performance gains from the dynamic configuration of thread schedulers. Specifically, for improved performance and ease of programming, we posit that an application should be provided with the scheduler that best suits its needs. We are studying the following issues. First, we are now investigating the runtime configuration of a single thread scheduler -- termed scheduler configuration -- performing real-time scheduling for Time-Warp discrete event simulations . Second, the performance of many applications can be improved if different, concurrently executable parts of their computations can be scheduled with differing thread schedulers -- termed heterogeneous scheduling . Currently, we are studying scheduling characteristics of different applications to investigate whether heterogeneous scheduling technique results in application performance improvement. Initial results of the experimentation on dynamic scheduler reconfiguration and heterogeneous scheduling are documented in a companion paper . 5. Related Research ---------------- Some of the notions introduced in PRESTO , CHOICES and CHAOS are similar to our work. PRESTO uses the encapsulation property of objects to build a configurable parallel programming environment. Structurally, PRESTO's synchronization object is somewhat similar to an adaptive lock . However, a synchronization object does not support dynamic reconfiguration of implementation-dependent attributes, adaptation, and object state monitoring. CHOICES is an example of an object based reconfigurable operating system which can be tailored for a particular hardware configuration or for a particular application. The focus of CHOICES is to structure the operating system kernel in an object oriented way whereas the focus of our research is to build a micro-kernel that can be reconfigured at the thread level. Even though we use the object model at the application level, we do not use objects to build the run-time system. CHAOS A C oncurrent, H ierarchical, A daptable O perating S ystem supporting a tomic, r eal-time c omputations. is a family of object-based real-time operating system kernels. The family is customizable in that existing kernel abstractions and functions can be modified easily. CHAOS supports reconfiguration at the object level and is structured on top of the reconfigurable micro-kernel. 6. Conclusion ---------- The major contributions of this paper are twofold: First, it proposes an architecture for a reconfigurable micro-kernel including the introduction of implementation-specific mutable attributes, object attribute monitoring, implementation dependent adaptation policies, and attribute families. Then, it describes the implementation of reconfigurable kernel components like locks and schedulers. In this paper, we demonstrate the need for reconfiguration at the micro-kernel level by implementing prototype systems on KSR and Butterfly multiprocessors. Currently, we are using the system to experiment with different component specific adaptation policies. In the future, we will investigate the group reconfiguration problem using the concept of attribute family. This paper does not address issues regarding operating system protection. Currently, we do not assume any hardware protection boundary between user and the kernel. We, however, hypothesize that the concepts will hold in the presence of protection boundary. But, as any other micro-kernelized operating systems, applications will experience performance degradation due to frequent crossing of the protection boundary.