################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Microkernels and Other Kernel Architectures Symposium San Diego, California, September 20-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Fast Interrupt Priority Management in Operating System Kernels Daniel Stodolsky J. Bradley Chen Brian N. Bershad School of Computer Science Department of Computer Science Carnegie Mellon University and Engineering 5000 Forbes Avenue University of Washington Pittsburgh PA 15213 Seattle WA 98195 danner@cs.cmu.edu bershad@cs.washington.edu bchen@cs.cmu.edu ABSTRACT In this paper we describe a new, low-overhead technique for manipulating processor interrupt state in an operating system kernel. Both uniprocessor and multiprocessor operating systems protect against uniprocessor deadlock and data corruption by selectively enabling and disabling interrupts during critical sections. This happens frequently during latency-critical activities such as IPC, scheduling, and memory management. Unfortunately, the cycle cost of modifying the interrupt mask has increased by an order of magnitude in recent processor architectures. In this paper we describe optimistic interrupt protection, a technique which substantially reduces the cost of interrupt masking by optimizing mask manipulation for the common case of no interrupts. We present results for the Mach 3.0 microkernel_operating_system,_although_the technique is applicable to other kernel architectures, both micro and monolithic, that rely on interrupts to manage devices. --------- This research was sponsored in part by The Advanced Research Projects Agency, Information Science and Technology Office, under the title ``Research on Parallel Computing'', ARPA Order No. 7330, issued by ARPA/CMO under Contract MDA972-90-C-0035, by the Xerox Corporation, and by Digital Equipment Corporation. Bershad was partially supported by a National Science Foundation Presidential Young Investigator Award. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ARPA, XEROX, DEC, the NSF, or the U.S. government. 1 1 Introduction This paper describes a new technique, optimistic interrupt protection, that reduces interrupt management cost over conventional methods. While modern processor architectures have led to substantial overall performance improvements, operating systems have received significantly less benefit than application code [Anderson et al. 91, Chen & Bershad 93, Ousterhout 90]. One processor function that has not scaled well with processor speed is interrupt management. Operating systems use interrupts to control scheduling and I/O, and use interrupt masking to guarantee integrity of system resources shared across interrupt levels. This approach was efficient in many previous processor architectures (e.g, VAX), where the cost changing interrupt levels was small - generally less than ten instructions [Morse et al. 82, Levy & Eckhouse 89]. In modern architectures, however, interrupt masking may be up to an order of magnitude more expensive, contributing to poorer performance of system code. Optimistic interrupt protection avoids the performance penalty of interrupt mask manipulation while preserving the semantics of the interrupt model. We have implemented optimistic interrupt protection in the Mach 3.0 microkernel for several different processor architectures. On the Omron Luna88k, we observed a 50% reduction in interrupt management overhead, resulting in a 5.3% speedup for interprocess communication. The rest of this paper describes the technique and its performance. In Section 2 we review the basic problems introduced by interrupts, discuss the general model of interrupt handling into which optimistic interrupt protection fits, and motivate the need for a high performance mechanism. In Section 3 we describe the use and implementation of optimistic interrupt protection. In Section 4 we discuss the performance of our approach. In Section 5 we discuss related work. Finally, in Section 6 we present our conclusions. 2 Interrupt management Operating systems generally rely on interrupts to respond to asynchronous events. Because interrupts introduce concurrency into the operating system kernel, system-level mechanisms are necessary to avoid deadlocks and protect system data structures from inadvertant concurrent accesses. Interrupt masking is a common technique for data protection in the presence of asynchronous events. Access to a potentially concurrent data is protected by setting the processor interrupt level to prevent all events that could potentially alter the data in question. 2 ______________________________________________________________________________ |_Machine_______________|Processor_________________|__Instructions__|_Cycles_| | Luna88k | Motorola 88100 (25MHz) | 68 | 96 | | Flamingo | Alpha 21064 (150MHz) | 25 | 21 | | DECstation 5000/120 | R3000 (20MHz) | 65 | 80 | | DECstation 5000/200 | R3000 (25MHz) | 14 | 14 | | 386 PC | Intel 386DX-25 (25MHz) | 51 | 179 | |_Gateway_66V___________|_Intel_486DX2-66_(66MHz)__|_______51_______|___279__| Table 1: Overhead of changing the interrupt mask. Cycle counts are estimated, assuming no cache misses. Interrupt masking has been used successfully in a large number of operating systems, including Mach, Unix, VMS, and NT [Accetta et al. 86, Leffler et al. 89, Levy & Eckhouse 89, Custer 93]. It maps well onto a diverse array of hardware, from systems with a single interrupt level to processors with a rich interrupt structure [Bell et al. 82, Intel 90]. On a uniprocessor, no additional synchronization constructs are required. An important property of the interrupt masking model is that latency-sensitive events can preempt long-running low priority activities. Although alternatives to the interrupt model have been proposed [Cheriton 88, Massalin & Pu 89], simplicity, as well as the significant investment in existing system code and programmer experience provide significant economic incentives for preservation of interrupts as a model of system data protection. Traditionally, interrupt masking has been efficient, requiring only a few cycles. Unfortunately, the time required to modify the hardware interrupt level has not scaled with processor speed improvements. In pipelined processors, writing the processor interrupt mask typically requires a pipeline flush [Motorola 90, DEC 92]. In superscalar systems, interrupt level manipulations require scalar instruction issue, further limiting performance [Sites 92]. Many recent RISC CPU implementations provide only a part of the interrupt mask logic on the processor package, with the remainder of interrupt masking implemented by off-processor hardware [Motorola 90, DEC 92]. For these systems, interrupt masking is a three step process: 1) disable processor interrupts, 2) write the off-chip mask register(s), and 3) finally re-enable processor interrupts. The first stage requires a pipeline flush, and the second stage requires a potentially expensive off-chip access. This represents a significant increase in the relative latency of interrupt mask manipulations. Table 1 shows the cost of a general interrupt mask raise/lower pair within the Mach 3.0 microkernel on a variety of architectures. 3 3 Optimistic interrupt protection Optimistic interrupt protection exploits the fact that, in the common case, interrupts do not occur during critical sections. When a processor executing in the kernel enters a critical section, its sets a software interrupt mask, which indicates the interrupts that need to be masked. The hardware interrupt mask is not changed. In the uncommon case that a lower-priority interrupt does occur, the interrupt handler prologue constructs an interrupt continuation (described below), updates the hardware interrupt mask as specified by the software interrupt mask, and returns control to the interrupted activity. Updating the hardware interrupt mask when the interrupt actually occurs prevents additional logically masked interrupts from occurring until the deferred handler has been executed. Though not strictly necessary, this tends to simplify the code. Moreover, it occurs after the interrupt, and is therefore off the anticipated fast path. [This figure has been omitted for the ASCII version of the paper.] Figure 1: Conventional and Optimistic Interrupt Protection An interrupt continuation is a data structure containing the state of the system at the time an interrupt is deferred. The interrupt continuation contains sufficient information to service the interrupt condition at a later time. The amount of information is typically quite small (e.g, the program counter and interrupt vector). At the end of the critical section, the processor checks for an interrupt continuation. Normally there is none, and processing continues following the critical section. If an interrupt continuation does exist, the processor handles the corresponding interrupt condition before resuming ``normal'' computation (see Figure 1). The interrupt continuation handles the deferred interrupt, restores the hardware interrupt mask to its original level, and returns to the normal execution stream. As with traditional interrupt control, optimistic interrupt protection defers the execution of a masked interrupt handler until the end of the protected critical section. Unlike the traditional masking mechanisms, it requires that the (hardware and software) execution of the interrupt prologue code be both allowed and safe during protected sequences. As an example, if the interrupt prologue required a valid stack pointer, any code which places the stack pointer in an invalid state could not use optimistic interrupt protection. For the Mach 3.0 kernel, there are no such sequences on the Omron Luna88k, DECstation, or DEC Alpha. In the optimistic case (the protected sequence runs without interruption), protection overhead is minimal. One variable is set before the critical section, and at the end of the critical section that variable is reset and another variable (corresponding to the interrupt continuation) is checked. In the 4 Omron Luna 88k implementation, this corresponds to two stores, one load and a test, all of which are executed by the processor at full speed. Not only is protection overhead small, it also scales with processor performance. 4 Performance We have implemented optimistic interrupt protection in the Mach 3.0 kernel on the Omron Luna88k and Mips R3000 DECstation series. In both architectures, the interrupt continuation consisted of the register state at the time of the trap and a few additional words of state. Implementation took less than 3 days and and required no modification to assembler code routines. Table 2 shows the fast path overhead for interrupt management on these architectures. This sequence replaces the interrupt mask manipulations of Table 1. By using optimistic interrupt protection the length of the interrupt management path has been roughly halved. ___________________________________________________________________ |_Machine_______________|_Processor_______|_Instructions_|_Cycles_| | Luna88k | Motorola 88100 | 51 | 51 | | DECstation 5000/120 | R3000 | 31 | 31 | |_DECstation_5000/200___|_R3000___________|______31______|___31___| Table 2: Overhead of virtual interrupt mask manipulation. Cycle counts are estimated, assuming no cache misses. The Luna88k is a multiprocessor, so the virtual interrupt state is maintained on a per CPU basis. Most of the extra 20 cycles of overhead on the Luna88k are directly attributable to multiprocessor induced array indexing computations. To measure the impact of optimistic interrupt protection, we measured the performance of the Mach interprocess communication path. This path has already been highly optimized and contains only one interrupt protected critical section [Draves et al. 91]. Table 3 shows the performance of a cross address space null RPC with conventional and optimistic interrupt protection. The performance gain is larger than suggested by Tables 1 and 2 due to the idealized nature of those numbers. Both tables assume no TLB misses, cache misses, invalidation traffic or write buffer stalls; in practice, operating system code incurs a large contribution to cycles per instruction from all these factors [Chen & Bershad 93]. The reduction in path length and number of memory references in the interrupt management path therefore produces a greater than predicted benefit. 5 _____________________________________________________________________________ |_Machine_____________|Conventional_|__Optimistic_|_Speedup_|_Cycles__saved_| | Luna88k | 4400 | 4225 | 5.3% | 175 | | DECstation 5000/120 | 2140 | 1840 | 14% | 300 | |_DECstation_5000/200_|____1234_____|____1198_____|__2.9%___|______36_______| Table 3: IPC performance. Shown are the cycles for a null RPC with optimistic and conventional interrupt management. One cycle on the Luna88k is 40 nanoseconds, so IPC latency is reduced by 7 microseconds from 176 to 169. The cycle times on the 5000/120 and 5000/200 are 50 and 40 nanoseconds respectively. 5 Related Work One of the fundamental design decisions in an operating system is how to handle coordination between synchronous and asynchronous event handlers. Synchronous events happen within the context of the current execution stream (e.g, a system call), while a given asynchronous event can occur in the context of any instruction stream (e.g, I/O completion interrupts). Three approaches have been taken: interrupt masking as previously described, non-preemptable handlers, and lock-free synchronization. In the non-preemptable approach, both synchronous and asynchronous event handlers run uninterruptably to completion. The V kernel and many real time systems follow this approach [Berglund 86, Stankovic & Ramamritham 88]. Unfortunately, non-preemptable interrupt handlers impose serious constraints on handler structure: all handlers must be short to ensure that the latency of high priority events is low, and handlers cannot containing blocking operations (e.g. device status register polling). While this approach can lead to a high performance operating systems, difficulties inherent in this code style have prevented its widespread use. Recent research has demonstrated the use of highly concurrent lock-free data structures [Herlihy 90, Wing & Gong 92]. A system using lock-free synchronization can be free from data corruption, deadlock and priority inversion even in the case of interrupts [Massalin & Pu 91]. In addition, lock-free data structures provide the necessary synchronization for both multiprocessors and non-preemptive execution. Consequently, lock-free data structures suggest an attractive approach for structuring operating systems. Unfortunately, lock-free data structures can require special synchronization hardware that is neither generally available nor inexpensive [Herlihy_91_,__Motorola_90_](1)._ Recently, researchers (1)With only minimal hardware support (e.g, load linked / store conditional), implementation of nontrivial data structures (doubly linked lists, trees) requires substantial data copying. In addition, when these primitives have been implemented, their 6 have proposed architectural modifications to efficiently support lock-free operations [Herlihy & Moss 93]. The division of synchronization mechanisms into an inexpensive optimistic and (relatively more) expensive pessimistic case has been applied elsewhere. Restartable atomic sequences offers a mechanism for constructing efficient user-level synchronization primitives in a preemptively scheduled environment [Bershad et al. 92] . 6 Conclusions Optimistic interrupt protection is an application of optimistic synchronization to interrupt priority management in operating system kernels. It provides the same semantics as traditional interrupt management with much less overhead. A measurable speedup of the IPC path in the Mach 3.0 microkernel was obtained by using this technique. The method is applicable to any kernel that uses interrupt masking to guarantee data integrity. References [Accetta et al. 86] Accetta, M. J., Baron, R. V., Bolosky, W., Golub, D. B., Rashid, R. F., Tevanian, Jr., A., and Young, M. W. Mach: A New Kernel Foundation for Unix Development. In Proceedings of the Summer 1986 USENIX Conference, pages 93--113, July 1986. [Anderson et al. 91] Anderson, T. E., Levy, H. M., Bershad, B. N., and Lazowska, E. D. The Interaction of Architecture and Operating System Design. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 108--121, April 1991. [Bell et al. 82] Bell, C. G., Newell, A., and Siewiorek, D. P. Computer Structures:Principles and Examples, pages 110--128. McGraw-Hill, 1982. [Berglund 86] Berglund, E. J. An Introduction to the V-System. IEEE Micro, 10(8):35--52, August 1986. [Bershad et al. 92] Bershad, B. N., Redell, D. D., and Ellis, J. R. Fast Mutual Exclusion for Uniprocessors. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992. [Chen & Bershad 93] Chen, J. B. and Bershad, B. N. The Impact of _________Operating_System_Structure_on Memory System Performance. performance is often quite poor (alpha L_l/S_c) - often as long as an uncached load and store [DEC 92]. 7 In Proceedings of the 14th ACM Symposium on Operating System Principles, December 1993. To appear. [Cheriton 88] Cheriton, D. R. The V Distributed System. Commun. of the ACM, 31:314--333, March 1988. [Custer 93] Custer, H. Inside Windows NT. Microsoft Press, Redmond, WA, 1993. [DEC 92] DEC. DECChip 21064-AA RISC Microprocessor Preliminary Data Sheet. Digital Press, Manyard, MA, 1992. [Draves et al. 91] Draves, R. P., Bershad, B. N., Rashid, R. F., and Dean, R. W. Using Continuations to Implement Thread Management and Communication in Operating Systems. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 122--136, October 1991. [Herlihy & Moss 93] Herlihy, M. and Moss, E. Transactional Memory - Architectural Support for Lock Free Data Structures. In The 20th Annual International Symposium on Computer Architecture, May 1993. [Herlihy 90] Herlihy, M. A Methodology for Implementing Highly Concurrent Data Structures. In Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 197--206, March 1990. [Herlihy 91] Herlihy, P. M. Wait-free synchronization. ACM Transcations on Programming Languages and Systems, 13(1):124(26), January 1991. [Intel 90] Intel. i486 Microprocessor Programmer's Reference Manual. Intel, Mt. Prospect, IL, 1990. [Leffler et al. 89] Leffler, S. J., McKusick, M., Karels, M., and Quarterman, J. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley, 1989. [Levy & Eckhouse 89] Levy, H. M. and Eckhouse, R. H. Computer Programming and Architecture: The VAX-11 (2nd Edition). Digital Press, Bedford, MA, 1989. [Massalin & Pu 89] Massalin, H. and Pu, C. Threads and Input/Output in the Synthesis Kernel. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 191--201, December 1989. [Massalin & Pu 91] Massalin, H. and Pu, C. A Lock-Free Multiprocessor OS Kernel. Technical Report CUCS-005-91, Department of Computer Science, Columbia University, 1991. [Morse et al. 82] Morse, S. P., Ravenel, B. W., Mazor, S., and Pohlman, W. B. Computer Structures:Principles and Examples, pages 615--646. McGraw-Hill, 1982. [Motorola 90] Motorola. MC88100 RISC Microprocessor User's Manual. Prentice Hall, Englewood Cliffs, NJ, 1990. 8 [Ousterhout 90] Ousterhout, J. K. Why Operating Systems Aren't Getting Faster As Fast As Hardware. In Proceedings of the Summer 1991 USENIX Conference, pages 247--256, June 1990. [Sites 92] Sites, R. L., editor. Alpha Architecture Reference Manual. Digital Press, 1992. [Stankovic & Ramamritham 88] Stankovic, J. and Ramamritham, K. A New Hard Real-Time Kernel. In Hard Real-Time Systems, pages 361--370. IEEE, 1988. [Wing & Gong 92] Wing, J. M. and Gong, C. A Library of Concurrent Objects. CMU CS TR 90-151, School of Computer Science, Carnegie Mellon University, 1992. 9