Turning the AIX Operating System into an MP-capable OS Jacques Talbot Bull Abstract This paper describes those MP features that Bull and IBM together introduced into the AIX operating system to support the Symmetric Multiprocessor machine marketed by Bull under the Escala name and by IBM under the RS/6000 Models G30, J30 and R30 names. The PowerPC architecture and the AIX operating system present some specific challenges. We present the major problems encountered and how they were solved. 1. Introduction This paper is quite broad in the sense that changing a sophisticated UNIX implementation such as the AIX operating system into an MP-capable system is a complex task. So we have tried to briefly describe the problems encountered and then to provide a more detailed description of some of them, which were specific to the AIX operating system and the PowerPC. 2. The Hardware Architecture: a summary Only the HW features which have a significant impact on the SW will be described here. The architecture is a so-called SMP, Symmetric Multi Processor based on PowerPC [IBM94]. An SMP is a system where several CPUs share a common memory address space and I/O subsystem. There is no memory local to CPUs (except caches which are almost invisible to SW). The Symmetric attribute is perhaps not the most significant one since most machines in the MP category to-day are symmetric, in the sense that all processors get equal access to all I/Os. The exceptions to this rule are primarily low-end Intel-based MP servers derived from PC technology or machines dedicated to specific needs (e.g. real-time). So SMP is better described as "Shared memory with HW coherency". It means that all CPUs see the "good old programming model" of a single memory with a more or less uniform access time, i.e. not depending on the addresses accessed. Moreover, all CPUs, when reading the "same" address, get the same value. In presence of copy-back caches, it is not so easy to achieve this simple property; this is the purpose of the HW cache coherence protocol, named according to the MESI acronym which designates the 4 possible states of a cacheline in the state machine (Modified, Exclusive, Shared, Invalid). The SW can safely ignore this complexity, assuming that the HW is properly functioning. The same characteristics of uniform access time and coherent memory allow to run a single copy of the operating system on the machine. So all CPUs can share a single queue of ready-to-run execution threads, resulting into better system load balancing. Getting applications for the platform is pretty simple. There is actually no porting activity in the general case, since applications can rely on the traditional assumptions that they have made since computers are with us. Binary compatibility with the uniprocessor applications is the rule. In contrast, architectures such as MPP (Massively Parallel Processors) and other kinds of tightly coupled clusters, where the above two characteristics are not present, generally require that applications have some knowledge of the HW architecture. 2.1 PowerScale architecture This specific SMP architecture is called PowerScale. There are up to 8 PowerPC (601, 604 or 620) CPUs accessing a common memory through a cross-bar for data and a bus for addresses. A cross-bar is a data exchange device modeled after a telephone cross-bar switch where each agent has a private data path to the interconnection HW and multiple connections can occur simultaneously. An I/O bridge allows access to two MCA busses. The PowerScale architecture is optimal from a price- performance standpoint for the target we were shooting at, i.e. a mid-range SMP optimized for 4 CPUs, still efficient at 8 CPUs and sellable in monoprocessor. 2.2 Caches Each of the 8 PowerPC CPUs has a (relatively) small internal (32 or 64 Kbyte) Level-1 cache and a large external (1 or 2 MByte) Level-2 cache. Each cache is organized in "lines". Data transfers and coherency updates take place in unit of lines. A cache line is 32 or 64 bytes wide, depending on the PowerPC model. Both caches are copy-back (sometimes called write-back), meaning that when a write is performed, the modified value is written only in the cache, and pushed in lower memory hierarchy levels only when required, typically when the "dirty" cache line is needed for other data. The other option is write-through, where main memory is updated on each write. This technique is much simpler but considerably less efficient. In a copy-back cache, when several copies of a "clean" (unmodified) line exist in several caches, and one of these is modified, there are two options for other copies: update or invalidate. Write-invalidate is best from a performance standpoint for data which will probably not be used in the near future by the CPUs where the line is invalidated. Write-update is best when the probability of reuse is high. In most cases, for UNIX applications and kernel pages, write-invalidate is the preferred mode since most lateral misses (cache to cache) are related to a working set migration which is a one-way event and seldom associated with real data sharing. Since the PowerPC architecture implements only write-invalidate, there is anyway no choice to make. It is possible, HW-wise, to tag pages as cached (copy-back or write-through) or uncached. Except for some phases of bootstrap in AIX, all memory is tagged as cached and copy- back. 2.3 Atomic operations To construct the locks that SMP SW needs, an atomic operation is needed. It is not the traditional test_and_set or compare_and_swap instructions of CISC processors that are used. The PowerPC, like many RISC architectures, implements load_and_reserve and store_conditional instructions such that if a reservation bit is "broken" between the Load and the Store operations, the Store fails. These two instructions allow the efficient implementation of all the atomic operations that concurrent SW needs, including test_and_set and compare_and_swap. 2.4 Memory order model Until recently, UNIX multiprocessors had a strongly ordered memory model. This means that the order of loads and stores as seen by the program cannot be changed between the output of a CPU and the memory system, where the other CPUs see the effect of these operations. Modern architectures are weakly ordered. The HW is allowed to reorder loads and stores between the CPU and the memory. This is done to optimize performance, when the first operation has to go to main memory for completion while the second one can be rapidly globally performed in the Level-1 cache. In that case, the second is globally performed before the first. Of course, the order of operations as seen by one CPU is not changed, because the HW keeps track of data dependencies and always provides the latest value of a data to the CPU, even if not yet globally performed. Only the visibility of these operations from the other CPUs is impacted. This has some importance only when CPUs synchronize or communicate with one another, because one CPU could get stale values for some data. A similar issue results from aggressive prefetching of data (speculative execution found on 604 and 620) which allows a CPU to read some data very much in advance, and even across a test-and-branch sequence. This again allows a CPU to read data which has stale value. Special instructions (isync and sync), also called import and export fences, allow to solve these problems. However, inserting these instructions stalls the pipelines and is therefore a performance drain. 2.5 Interrupt handling The HW allows interrupts from one bus adapter to be addressed either to a specific CPU or to a set of CPUs. This set actually designates "one among all CPUs" and this CPU is dynamically chosen by the HW at each interrupt from a set of registers, one register associated with each CPU. Each CPU writes in this register a value indicating its willingness to receive interrupts. 3. Some high-level SW issues Let's now jump abruptly to a very different level of abstraction. The main issue to transform a UniProcessor (UP) Operating System like the AIX operating system Version 3 into an MP Operating System like AIX Version 4 is to protect the coherency of the data structures managed by the kernel. This is because several CPUs can simultaneously manipulate these data structures. The conflict can be a thread-thread conflict or a thread-interrupt conflict. On a UP machine, the thread-thread conflict cannot exist since only one flow of execution exists at any time (true only for a non- preemptable kernel); the thread-interrupt conflict is solved by masking the processor against interrupts. In an MP OS, locks must be added. Apart from becoming an MP OS, the AIX operating system was also turned from a process-based kernel to a thread-based kernel. Several threads of execution exist within the scope of a process which is now a passive entity representing an address space and access rights. Threads are especially useful for MP architectures since they enable finer-grain parallel programming, compared to current multi-process applications; this gives an opportunity to have a better speed-up on multi-threaded applications, as opposed to the scale-up (more throughput) which can be obtained without parallel programming. A single multithreaded application will run faster for a single user on an SMP machine. Threads-based programming requires a new API (POSIX 1003.1c) and thread-safe libraries. It opens the way to portable thread-based applications. We will not describe threads in more detail in this paper. Turning a UP UNIX kernel into an MP kernel has been described several times during the last years in various USENIX proceedings [Cam91] [Car93] [Pow91]. We will therefore try to focus below on AIX-specific or PowerPC- specific issues. 4. Locks 4.1 Some interesting AIX Version 3 features AIX Version 3 has two specific properties which affect the MP process. + preemptability: when an interrupt occurs that makes schedulable a thread which has a higher priority than the currently running thread(s), a context switch is required. If the current thread is running in user mode, there is no issue. It is temporarily suspended, and this process is called preemption. However, if the current thread is running in kernel mode, most UNIXes do not allow the preemption to occur. This is to prevent corruption of the kernel data structures. The context switch is delayed until the current thread goes back to user mode. Since this can last quite a while on long kernel operations, this behavior is detrimental to response time and real time requirements. There are two possibilities to overcome this: preemption points and full preemptability. Preemption points are the poor man's solution, meaning that the kernel is not preemptable except at some specific points. Full preemptability means that the kernel is always preemptable except during a few code sequences. AIX Version 3 is fully preemptable. This implies a locking scheme to maintain data coherency. + pageability: the AIX kernel code and data structures are pageable, except for some parts pinned in memory to avoid deadlock scenarios. This minimizes the amount of real physical memory required by the kernel by paging- out seldom used code and data. It allows to over- dimension kernel tables so that the hideous error message "too many processes or messages or whatever" encountered on traditional UNIXes is never seen on AIX. When a demand peak for a resource occurs, one or more pages of the corresponding table are paged-in and used. When no longer required, they are paged-out. So page faults and their associated context switches can be encountered at any point in time when kernel code runs. This is another case of unexpected kernel preemption. To implement the two properties above, AIX Version 3 has a simple locking scheme with a coarse granularity; this scheme cannot be used for an MP system because the coarse lock granularity results in a very high contention rate on the few locks and a very poor overall scalability versus the number of processors. Processors spend their time waiting for the locked resources. However, the V3 locking scheme provided pointers to the critical areas that needed to be addressed. These two AIX features obviously had to be kept for the MP version. It should be noted that the refinement of the locking scheme adopted for the sake of MP efficiency had the side effect of improving the preemptability of the kernel. In fact, tuning the locks for low contention tends to reduce the pathlength of the code spent with preemption off. 4.2 What should be locked? Two basic approaches to locking exist: + code-based locks are coarse grain locks which encapsulate large pieces of code and associated data. Basically, when entering a subsystem, a lock is taken and released only at subsystem exit. This is simple but collision rates can be high since such locks are taken for a long time. + data-based locks are associated with data structures, or part of them, and lock-unlock operations are inserted in the code so that data protected by the lock are kept coherent when the lock is free. Also data- based locks can be coarse-grained (a whole table), medium grained (some elements of a table) or fine grained (one element of a table, or even a single item in a structure). Both types of locks are used in AIX Version 4. The choice of lock type is dependent upon how critical to performance is a given subsystem. For example, the CD-ROM filesystem has few requirements for simultaneous access due to the physical constraints of the device. So a code-lock can be used. On the other hand, the table of messages for System V IPC must be fine-grained because this is critical to the aggregate message switching capabilities of the system, in presence of several message queues, readers and writers. 4.3 Which type of locks? Many lock semantics can be imagined. Because we wanted to be able to borrow some code from the OSF/1 kernel, and because of the experience Bull had accumulated with the MP aspects of OSF/1, we decided to model our locks on the OSF/1 kernel locking scheme. In the case of a resource being unavailable, the requester thread blocks. This can be either by spinning (sometimes called "busy wait") or by releasing the CPU and going to sleep. Sleeping is not allowed if called from an interrupt level because it leads to deadlock situations. Two types of locks are used: + simple locks: as the name implies, this is the most rudimentary form and as a result the most efficient and frequently used. The lock is a simple word and if busy, the requester spins. However, the lock implementation has been optimized as follows: - After some amount of spinning, the requester of a busy lock will sleep (except if the request is made from a interrupt-level or masked against interrupts). The appropriate spinning time is approximately the time spent to execute a process switch. - If a lock is busy and the owner of the lock is not running on any of the CPUs, the requester goes to sleep. This is because the probability that the lock is freed in the spinning time-frame is almost zero. The owner of a lock is not running either because it is runnable but not currently running on a processor, or because it is waiting for some event (e.g. page fault resolution). + complex locks: these are multiple-readers, single- writer semaphores. They can optionally be tagged as recursive. Recursivity is generally discouraged in the locking scheme to keep it simple and deadlock free. However, to be able to import recursive code, this option is left open to the programmer. Threads requesting a busy complex lock spin and then sleep just like requesters of simple locks. They are used only if Reader-Writer semantics make sense (e.g. in the filesystem) or if recursivity is needed. Both types of locks can create the priority-inversion syndrome. This happens when high priority threads wait for a resource held by a lower priority thread. It results in a degradation of the response time. To overcome this, the priority of the thread owning a lock is temporarily raised to the priority of the highest priority thread waiting for the lock. This is sometimes called "priority promotion". 4.4 How many locks, how often? A system may have many or few locks. It may use them often or seldom. In a first approach, fine-grain parallelization implies many locks, taken often. Coarse-grain parallelization has few locks seldom taken. Obviously, there is a continuum in a two-dimensional space. It is not always obvious to characterize a locking scheme. At first sight, it looks that fine-grain locking is always better. It results in very short time spent while holding a lock, so low probability for collision (finding a lock busy). However, locking does not come for free. As opposed to older architectures where writing a word and accessing a lock through a test_and_set instruction were in the same ballpark in terms of execution time, there are orders of magnitude of difference between reading or writing a word in Level-1 cache (1 cycle) and taking a lock (around 50 cycles in the best case). The high number of cycles associated to locking is due to several factors, in order of importance: + need to synchronize the pipelines on lock operations + cache misses: processors are much faster these days, but memory does not keep up. + SW overhead: for various reasons and despite all the tuning, several bookkeeping instructions are needed around the atomic core So too much locking and unlocking is detrimental to performance. Instead of taking a lock for the minimal amount of time needed from a data structure coherency standpoint, the programmer has to consider whether the lock will not be needed again soon, and keep the lock during this interval, even if not strictly needed. The same applies to whether it is better to have several locks for a set of data structures and play subtle games or have a single lock held for a longer time but toggled less often. So locking the kernel is finding a trade-off between too few locks and too much contention or too many locks and too much overhead. In both cases, performance suffers. There is an optimum but finding it is a matter of calculations, experience and finally trial-and-error process [Cam91]. Some figures below give an idea of the locking scheme complexity in AIX version 4. They must be considered as orders of magnitude more than exact figures. There are around 200 classes (types) of locks. The actual number of instances of locks depends on the size of various tables and is not very significant. The number of lock operations (statically counted) is around 500. The number of locks operations performed per second on a timesharing workload (SPEC SDET) is around 10,000 lock-unlock pairs per second, on a quad machine with 75MHz 601 processors. 4.5 How to avoid locks Some parts of the kernel fit into well defined frameworks and in that case the burden of locking can be delegated to the framework code. + Drivers which have no strong performance requirements can be "funneled". In that case the drivers framework locks the driver on a "master" processor and all accesses are serialized via a single lock. This scheme is used only for third party-devices. None of the Bull or IBM-developed drivers use it. + The streams framework [Rit84], taken from OSF/1, can provide locking at various levels, described below from the highest level of concurrency to the the lowest level : - queue level: only access to message queues is protected. Modules must provide their own locking for shared state or upstream - downstream synchronization. - queue pair level: both upstream and downstream queues are protected. If some data are not "per- stream", i.e. they are shared between several streams, the module must take care of their locking. This is used by TTY line disciplines, which have only per-stream data. - module level: all queues and shared data associated with a module are protected. This is the default mode which can be safely used by most UP streams modules, which want to ignore MP issues. - arbitrary level: allows to arbitrarily group together from the locking standpoint modules which share data other than via queues. - global: one single lock for the whole streams subsystem, for debug purpose only In AIX Version 4, the TTY subsystem, SNA and Netware stacks are implemented within the streams framework. 4.6 Deadlocks As soon as we have more than one lock in a system, a potential for deadlock exists. One well-known strategy to avoid deadlocks is to implement a lock hierarchy, so that locks are always taken in the same order. We chose to implement a partial hierarchy, i.e. not global to the whole kernel but per subsystem. This hierarchy is defined in the registration process (see below under Lock instrumentation). However, it is not global nor enforced dynamically for reasons similar to the ones explained in [Pac91]: - strict ordering is hard to enforce globally and not always necessary, if the programmer knows that the deadlock cannot occur for some reason. - impact on lock code pathlength and therefore performance - desire to be able to import code from OSF/1 without major restructuring We chose instead to use a static deadlock analyzer, called SDLA (Static DeadLock Analyzer). This tool processes kernel code to detect potential deadlocks by exploring the locking scheme tree [Kor89]. It is derived from lint. SDLA does not use the static description of the lock hierarchy but constructs its own vision of the lock hierarchy while walking the tree. Two common problems found in many similar tools are the "noise" (false deadlocks) and the time taken to explore the tree and perform the analysis. We found several techniques to overcome these in a practical manner (patents pending) and we were able to discover several kernel design errors. We use the SDLA to continuously inspect the source tree to ensure that the addition of features and correction of bugs do not introduce new deadlocks. We foresee an enhancement to SDLA to be able to detect underlocking. Underlocking happens when data normally protected by a lock are manipulated without the lock being held. This can lead to kernel data corruption and system crashes. It is the most critical area for an MP kernel, since deadlocks are less difficult to debug. This will be similar to WARLOCK [Ste93]. 4.7 Lock instrumentation A comprehensive lock instrumentation is key to the kernel tuning process [Car93]. A symbolic naming and registration scheme has been adopted for all locks. This facilitates lock designation by the analysis tools. Locks are named as: Subsystem_name$Lock_family_name$Occurrence_number We implemented two levels of lock instrumentation: + One is almost always on during validation phases (selection at bootstrap time) and records the number of acquisitions, misses and sleeps. The overhead has to be as low as possible. The current implementation adds 10% to the lock-unlock pair execution time. A tool named lockstat allows the customer or field-service to analyze the behavior of kernel locks. Below is a display of lockstat output, taken during the run of a database benchmark early during development. The 20 locks with largest %Ref are shown: ____________________________________________________________________ Subsys Name Ocn Ref/s %Ref %Block %Sleep ____________________________________________________________________ PROC PROC_INT_CLASS -1 8434 29.89 18.77 0.00 IOS UPHYSIO_LOCK_CLASS -1 2483 8.80 6.74 0.00 PMAP PMAP 0 2108 7.47 1.60 0.00 DISK SCDISK_LOCK_CLASS -1 1495 5.30 4.22 0.00 IPC SEM_LOCK_CLASS 2 1472 5.22 7.75 0.00 LOCKL LOCKL 7 1035 3.67 0.00 0.00 IOS IOS_IPOLL_CLASS 2 1030 3.65 0.48 0.00 PFS JFS_LOCK_CLASS -1 997 3.53 1.46 0.00 XLVM LVM_LOCK_CLASS 0 996 3.53 3.24 0.00 IPC SEM_LOCK_CLASS 2 569 2.02 4.12 0.00 IPC SEM_LOCK_CLASS 2 524 1.86 3.73 0.00 IPC SEM_LOCK_CLASS 2 523 1.86 3.90 0.00 VMM VMM_LOCK_SCB 443 499 1.77 1.66 0.00 IPC SEM_LOCK_CLASS 2 475 1.68 3.49 0.00 PROC TRB_LOCK_CLASS 3 445 1.58 0.05 0.00 PROC TRB_LOCK_CLASS 2 443 1.57 0.03 0.00 PROC TRB_LOCK_CLASS 0 441 1.56 0.02 0.00 PROC TRB_LOCK_CLASS 1 438 1.55 0.04 0.00 TCPKER DEMUXER_LOCK_FAMILY 45172 417 1.48 3.08 0.00 PROC TOD_LOCK_CLASS -1 399 1.42 12.23 0.00 ____________________________________________________________________ Subsys Name Ocn designate the lock. Ref/s is the number of requests for this lock per second. %Ref is the % of requests for this lock versus all lock requests %Block is the % of blocking (spin or sleep) requests versus all requests for this lock. %Sleep is the % of blocking requests resulting in sleeping versus all requests for this lock. + For a more detailed lock analysis, we have a trace- based instrumentation, which can be turned on-off dynamically. We use the AIX trace tool and insert trace hooks in lock code. This allows us to compute the time spent under locks and get a comprehensive view of lock behavior. The LCA (Lock Contention Analyzer) is a Motif-based tool giving a complete picture of all lock parameters. It is used only in the development organization. Below is a display of LCA output, taken during the run of a database benchmark early during development. The 20 locks with largest Coll Sec are shown: TABLE Collision Section (times in 10 s) +---------------------+--------+---------+--------+-------------+-----------+---------+---------+ | Class | Ocn | Type | Count | Coll Sec(%) | CPU (ms) | CPU Avg | CPU Max | +---------------------+--------+---------+--------+-------------+-----------+---------+---------+ | PROC_INT_CLASS | -- | Simple | 93579 | 6.108 | 2885.994 | 30.84 | 1712.00 | | SEM_LOCK_CLASS | 2 | Simple | 15367 | 2.816 | 1330.436 | 86.58 | 2323.46 | | UPHYSIO_LOCK_CLASS | -- | Simple | 25658 | 2.466 | 1165.270 | 45.42 | 1609.22 | | SCDISK_LOCK_CLASS | -- | Simple | 15397 | 2.039 | 963.444 | 62.57 | 1336.19 | | LVM_LOCK_CLASS | 0 | Simple | 10265 | 1.552 | 733.498 | 71.46 | 576.38 | | DEMUXER_LOCK_FAMILY | 45172 | Simple | 4267 | 1.132 | 534.809 | 25.34 | 365.82 | | IOS_IPOLL_CLASS | 30 | Simple | 2580 | 1.114 | 526.462 | 04.05 | 546.30 | | LOCKL | 45 | Lockl | 13043 | 1.071 | 506.051 | 38.80 | 1790.21 | | IOS_IPOLL_CLASS | 27 | Simple | 2383 | 1.049 | 495.487 | 07.93 | 545.28 | | DEMUXER_LOCK_FAMILY | 41076 | Simple | 3783 | 0.967 | 456.725 | 20.73 | 518.27 | | SEM_LOCK_CLASS | 2 | Simple | 5899 | 0.805 | 380.185 | 64.45 | 1300.61 | | SEM_LOCK_CLASS | 2 | Simple | 5231 | 0.719 | 339.934 | 64.98 | 1806.98 | | SEM_LOCK_CLASS | 2 | Simple | 5407 | 0.713 | 336.979 | 62.32 | 1276.16 | | DEMUXER_LOCK_FAMILY | 61556 | Simple | 2104 | 0.673 | 318.065 | 51.17 | 309.89 | | SEM_LOCK_CLASS | 2 | Simple | 4800 | 0.654 | 309.150 | 64.41 | 1171.71 | | DEMUXER_LOCK_FAMILY | 116 | Simple | 1937 | 0.608 | 287.322 | 48.33 | 318.59 | | VMM_LOCK_SCB | 489 | Simple | 5234 | 0.559 | 264.332 | 50.50 | 387.33 | | JFS_LOCK_CLASS | -- | Simple | 11736 | 0.373 | 176.423 | 15.03 | 1192.45 | | U_TIMER_CLASS | 41 | Simple | 13639 | 0.366 | 173.051 | 12.69 | 581.76 | | IOS_IPOLL_CLASS | 2 | Simple | 11182 | 0.302 | 142.643 | 12.76 | 47.62 | +---------------------+--------+---------+--------+-------------+----------+---------+----------+ Class Ocn designate the lock Type is the lock type: simple or R/W or Lockl (AIX V3 backward compatibility) Count is the number of lock requests during the period Coll Sec(%) is the collision section, i.e. the % of CPU time when the lock is held CPU (ms) is total CPU time (in ms) spent with the lock held during the period CPU Avg is average CPU time (in microseconds) spent with the lock held during one locking (i.e. between one lock and the corresponding unlock) CPU Max is maximum CPU time (in microseconds) spent with the lock held during one locking The data above were taken during the development process and do not reflect the behavior of the shipped system. For example, at that point in time, %Sleep was always 0 because the locks were tuned to spin for a very long time. A lock in the process management (PROC_INT_CLASS) was a point of contention. This is clear from the %Ref and %Block in the lockstat display or the Coll Sec(%) in the LCA display. This lock is both taken often (%Ref) and often missed (%Block). With this information, this problem was taken care of in the MP tuning process. Precedence rules among locks in a subsystem are registered in a database. This allows the LCA tool to detect rule violations and issue warnings. 4.8 Debugging The identification of the thread owning a lock is systematically registered so that the deadlock detector contained in the crash analyzer can provide extended information. This is true even in production kernels. During the development phase, locks can be compiled with additional sanity checks turned on so that "asserts" can be used. They are however too costly to go into production. The symbolic kernel debugger has been enhanced so that when a CPU encounters a breakpoint, the other CPUs are also stopped through the CPU to CPU communication channel (MPC, see below). It is then possible to switch the debugger from one CPU to another for CPU-specific data. 4.9 Testing For each subsystem, we have a set of tests specifically targeted to lock stressing. With the lockstat tool, we are able to ensure that the collision rate is high enough on each and every lock so that we are sure that they are appropriately stressed, looking either for overlocking (-> deadlocks) or underlocking (-> race situations leading to data corruption). The collision rate is also randomly varied between maximum value and 0 to explore windows of vulnerability. 4.10 Kernel Implementation issues The price of locks After carefully coding lock and unlock primitives for simple locks, in assembly language, it was determined that the lock-unlock pair price in the lock free case was around 100 cycles (on a PowerPC 601). Notice that the number of instructions is much smaller. Some lock instructions (e.g. sync) are costly in terms of cycles because the 601, as any sophisticated processor, needs to flush its read and write queues. MP/UP overhead Locking and unlocking brings an overhead compared to a UP kernel. We took the goal to limit this overhead to 5% on the throughput of macro benchmarks like the TPC family or SPEC SDM and LADDIS. The core of the AIX kernel (i.e. not counting the drivers and loadable kernel extensions like NFS) actually exists in two versions derived from a single source tree. In the MP version, all locks are enabled. In the UP version, only locks mandatory to allow pageability and preemptability are enabled using #ifdef. So, the UP systems do not incur the MP performance penalty. The appropriate kernel is automatically selected at bootstrap time. Thundering herd problem This occurs when several threads are queued waiting for a resource, the resource is freed and several waiting threads are awakened at the same time. For simple locks, this is avoided by selectively waking only the highest priority sleeping thread. For complex RW locks, the highest priority writer is awakened, or all the readers if no writer is waiting. Interrupts and locks On an MP system, to protect thread- interrupt critical sections, it is necessary to take both a lock (for protection against other CPUs) and to mask interrupts appropriately (for protection against your own CPU). For performance and readability reasons, we packaged these operations together into two primitives: disable_lock() and unlock_enable(). AIX Version 3 already had two levels for interrupt handling. Part of the interrupt code is executed at the HW interruption level. When HW critical management has been completed, the bulk of the interrupt processing, if necessary, is handled "off-level" at a lower SW interrupt level. Since the AIX Version 4 kernel is thread-based, there was an opportunity to transform interrupt handling into normal thread-level code [Pow91]. We decided not to do so because of the performance overhead of thread level handling versus off-level interrupt handling. 4.11 Tuning the lock system Because we had a previous experience in MP UNIXes, we decided to shoot directly for a 4-way MP-efficient for the first release. A number of guidelines were given to developers. The collision section of a lock is the % of the total CPU time spent under this lock in the uniprocessor case. The objective was that no collision section should be greater than 2.5%, for significant benchmarks (e.g. TPC-C, SDET, LADDIS). This can be measured with the LCA tool on a UP system. So even without any SMP HW available, we were able to determine early during the development process all the potential granularity problems and established a list of "locks to be broken". We also identified too fine-grained locks which had to be coalesced. We believe that the scheme put in place can scale up to 8- way without major redesign, requiring only a series of measure-and-tune cycles that we will implement between the first and the second MP AIX releases. 4.12 User mode locks Some applications and most importantly database managers implement locks in user-mode because they cannot afford the performance overhead of system provided semaphores (System V IPC) and started implementing multithreaded applications before a standardized API was available. These locks are often called latches. On a UP AIX 3.2, these locks are implemented with a compare_and_swap() library routine which uses the load_and_reserve and store_conditional instructions on PowerPC machines or a system call on POWER machines, where atomic instructions are not available. On MP machines, this scheme does not work because of the weakly ordered memory model described in the HW section. As the insertion of the fence instructions is a delicate process and moreover because these fence instructions depend on the type of PowerPC processor (601, 604 or 620) for optimal performance, we provided a simple API to hide this complexity: _check_lock() same semantics as compare_and_swap() and issues the appropriate import fence. _clear_lock() issues the appropriate export fence and clears the lock-word. Programmers can then combine these to implement the appropriate type of lock for their application. An alternative is to use the POSIX1003.1c mutexes and condition variables provided with the pthreads library. 5. Other atomic operations The availability of load_and_reserve and store_conditional instructions enables the construction of other atomic operations. They provide better performance when compared to locks for certain operations. They usually do not need fences because no data other than the parameter of the operation is implicitly involved. We use mainly: _fetch_and_add to implement counters, statistics _fetch_and_and|or to implement bit masks _compare_and_swap to implement some carefully chosen list insertion/deletions primitives (singly- linked lists) 6. Other SW issues on an SMP system 6.1 Scheduling and dispatching: affinity The architecture being what it is, the natural tendency is to let the scheduler dispatch threads on processors from a single run-queue and according to priorities. However, the presence of the large Level-2 caches creates some affinity between threads and processors because the working set of threads tends to accumulate in the caches [Tor92]. So if the thread is migrated to another CPU and therefore another pair of caches, a warm-up of the cache is unavoidable, with a burst of lateral misses (i.e. miss where data comes from other caches, as opposed to vertical misses where data comes from main memory). This temporarily high miss rate decreases the performance. So we implemented an affinity scheduler with a simple algorithm to ensure that the pathlength increase in the scheduler code is minimal, so that the global outcome is positive. Each thread remembers its level of affinity with processors (actually, it is simply the last processor where the thread executed). When a processor becomes idle, the scheduler scans the runqueue and dispatches preferably a thread with affinity for it. The algorithm must however ensure fair access to CPU time. Several parameters (e.g. length and depth of the runqueue scan) were tuned to insure this. 6.2 Binding threads to processors An API allows the programmer to explicitly bind a thread or a process to a specific processor, preventing the scheduler to migrate it when possible. Caution is recommended with this user-driven scheduling, because the scheduler generally does a better job in terms of global performance optimization. It should therefore be used only for special dedicated applications (e.g. real-time). 6.3 Managing interrupts Masking against interrupts is done by writing some mask value into a HW register associated with the CPU. When the CPU is running at thread level, i.e with no interrupts masked, it writes a code in this register to distinguish the idle task from the normal processing. This register is used by HW to dispatch interrupts to the "least loaded processor", so all interrupts will go to one of the idle processors or, if none, to one of the processors not masked against interrupts. As interrupt handling involves significant overhead, this "interrupt steering" improves response time. Maximum throughput is not much improved since, in that case, CPUs are seldom idle. 6.4 Clock management Ideally, only one clock interrupt per system (as opposed to one clock interrupt per CPU) is needed on an SMP since almost all clock related actions are independent of the CPU. However, actions such as profiling a CPU i.e. determining which piece of code it is currently executing, need interruption of the CPU. So we ended up with a scheme where each CPU is interrupted at each clock tick but only one of them implements the system-wide time-related tasks. 6.5 Multi Processor communication (MPC) SMP kernel implementations require some sort of basic inter-CPU communication. In AIX Version 4, it is called MPC. The MPC allows a processor to send an interrupt to another processor and have it perform a specific action. We tried to maintain this layer as simple as possible. The MPC services allow: + To register a service routine, which will be called by the target processor upon reception of an inter CPU interrupt. + To trigger an inter CPU interrupt on a specific processor or on all the other processors (broadcast). The MPC services are used for various functions such as funneling, timer management, dump and kernel debugger (to stop all the other processors). 7. Overall results The results of all these technical choices can be summarized in scalability figures for various types of workloads. These figures being highly sensitive, you will have to refer to the marketing brochures of both companies to get them. Suffice it to say that from a technical standpoint we are very happy with the overall scalability. 8. How we did it AIX 4.1 is the result of a joint development program between Bull (Grenoble, France) and IBM (Austin, TX). Bull brought its SMP experience in DPX/2 and DPS7000 and IBM its knowledge of AIX and the PowerPC architecture. The architectural design was done by a joint team of SW architects. The work sharing was done by a joint team of technical managers. The component level design was done by the teams where each component responsibility had been assigned. A common team in a single location did the locking of the kernel core because these parts interact so heavily that a single team was necessary. The team was composed of engineers from both companies. The rest of the work in kernel, libraries and commands was separable and so could be dispatched between Austin and Grenoble with the help of a fast transatlantic communication link allowing us to work on the same source tree with a unique configuration management system in almost real time. The whole project lasted all in all more than 2 years, from early Bull-IBM technical contacts to shipment. However, this involves the SMP HW development and the entirety of the AIX Version 4 development cycle. This latter decision must be drawn to the attention of those who may have heard of shorter development cycles with respect to MP-enabling a UNIX OS implementation. Instead of freezing the functionality of AIX 3.2 to make it support SMP HW and let AIX evolve functionally in parallel to meet market requirements on UP only, we commonly decided to proceed with a single development path. This by no doubt made the MP enablement more complex, but avoided the confusion and costs resulting from having 2 versions of AIX. So it is difficult to size the SW SMP effort per se, but it lasted around 18 months (design to General Availability) and involved more than 100 people. With a lot of good will on both sides, this cooperation worked and is still working incredibly well, probably due to our common technical culture which is a mix of UNIX background and mainframe experience. 9. References This is not a complete bibliography on MP aspects of the UNIX kernel, because it would be too big. [Cam91] Lock Granularity Tuning Mechanisms in SVR4/MP, Mark D. Campbell, Russ Holt, John Slice, USENIX SEDMS II, March 1991, p. 221 [Car93] Measuring Lock Performance in Multiprocessor Operating System Kernels Joseph P. CaraDonna, Noemi Paciorek, Craig E. Wills, USENIX SEDMS IV, Sept 1993, p. 37 [IBM94] PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, Prentice-Hall 1994 [Kor89] Sema: a LINT-like tool for analyzing semaphore usage in a multithreaded UNIX kernel, Joe Korty, 1989 Winter USENIX [Pac91] Debugging Multiprocessor Operating System Kernels, N.Paciorek, S.LoVerso, A.Langerman, USENIX SEDMS II, March 1991, p. 185 [Pow91] SunOS Multi-thread Architecture, M.L. Powell, S.R. Kleiman, S. Barton, D. Shah, D. Stein, M. Weeks, 1991 Winter USENIX, p. 65 [Rit84] A Stream Input-Output System, D.Ritchie, AT&T Bell laboratories Technical Journal, Vol. 63 no. 8 (October 1984) [Ste93] WARLOCK - A Static Data Race Analysis Tool, N.Sterling, 1993 Winter USENIX, p. 97 [Tor92] Evaluating the benefits of cache affinity scheduling in shared memory multiprocessors, J.Torellas, A.Tucker, A.Gupta Technical report: CSL-TR-92-536 (Stanford Univ.) August 1992 10. Trademarks Escala, PowerScale, BOS, DPS and DPX are trademarks of Bull S.A. AIX, RS/6000 and PowerPC are registered trademarks of IBM UNIX is a registered trademark licensed exclusively through X/Open OSF/1 is a registered trademark of OSF 11. Author information Jacques Talbot is responsible of SW Architecture in the Open Systems department at Bull Grenoble since 1989. He was previously in charge of the Bull BOS UNIX kernel development, and then project manager for the Bull DPX/2 200 UNIX platform. He graduated in 1972 from the Ecole Superieure d'Electricite. Address: 1, rue de Provence 38130 Echirolles FRANCE Email: J.Talbot@frec.bull.fr This paper has been written by Jacques Talbot from Bull. However, it is the result of the work of many people in both Bull and IBM. Special thanks to Jack O'Quin, John O'Quin, Jeff Peek from IBM and Andre Albot, Corrado Clementi, Jean- Jacques Guillemaud, Steve Hinde (*), Michel Sanchez from Bull. (*) now with another company