Check out the new USENIX Web site.
2001 USENIX Annual Technical Conference, June 25-30, 2001, Boston, MA

Pp. 155–164 of the Proceedings

Improving the FreeBSD SMP implementation

Greg Lehey
IBM LTC Ozlabs

ABSTRACT

UNIX-derived operating systems have traditionally have a simplistic approach to process synchronization which is unsuited to multiprocessor application. Initial FreeBSD SMP support kept this approach by allowing only one process to run in kernel mode at any time, and also blocked interrupts across multiple processors, causing seriously suboptimal performance of I/O bound systems. This paper describes work done to remove this bottleneck, replacing it with fine-grained locking. It derives from work done on BSD/OS and has many similarities with the approach taken in SunOS 5. Synchronization is performed primarily by a locking construct intermediate between a spin lock and a binary semaphore, termed mutexes. In general, mutexes attempt to block rather than to spin in cases where the likely wait time is long enough to warrant a process switch. The issue of blocking interrupt handlers is addressed by attaching a process context to the interrupt handlers. Despite this process context, an interrupt handler normally runs in the context of the interrupted process and is scheduled only when blocking is required.

Introduction

A crucial issue in the design of an operating system is the manner in which it shares resources such as memory, data structures and processor time. In the UNIX model, the main clients for resources are processes and interrupt handlers. Interrupt handlers operate completely in kernel space, primarily on behalf of the system. Processes normally run in one of two different modes, user mode and kernel mode. User mode code is the code of the program from which the process is derived, and kernel mode code is part of the kernel. This structure gives rise to multiple potential conflicts.

Use of processor time

The most obvious demand a process or interrupt routine places on the system is that it wants to run: it must execute instructions. In traditional UNIX, the rules governing this sharing are: This method works acceptably for the single processor machines for which it was designed. In the following section, we'll see the reasoning behind the last decision.

Kernel data objects

The most obvious problem is access to memory. Modern UNIX systems run with memory protection, which prevents processes in user mode from accessing the address space of other processes. This protection no longer applies in kernel mode: all processes share the kernel address space, and they need to access data shared between all processes. For example, the fork() system call needs to allocate a proc structure for the new process. The file sys/kern_fork.c contains the following code:
int
fork1(p1, flags, procp)
	struct proc *p1;
	int flags;
	struct proc **procp;
{
	struct proc *p2, *pptr;

...
	/* Allocate new proc. */
	newproc = zalloc(proc_zone);

The function zalloc takes a struct proc entry off a freelist and returns its address:
	item = z->zitems;
	z->zitems = ((void **) item)[0];
...
	return item;

What happens if the currently executing process is interrupted exactly between the first two lines of the code above, maybe because a higher priority process wants to run? item contains the pointer to the process structure, but z->z_items still points to it. If the interrupting code also allocates a process structure, it will go through the same code and return a pointer to the same memory area, creating the process equivalent of Siamese twins.

UNIX solves this issue with the rule ``The UNIX kernel is non-preemptive''. This means that when a process is running in kernel mode, no other process can execute kernel code until the first process relinquishes the kernel voluntarily, either by returning to user mode, or by sleeping.

Synchronizing processes and interrupts

The non-preemption rule only applies to processes. Interrupts happen independently of process context, so a different method is needed. In device drivers, the process context (``top half'') and the interrupt context (``bottom half'') must share data. Two separate issues arise here: each half must ensure that any changes to shared data structures occur in a consistent manner, and they must find a way to synchronize with each other.

Protection

Each half must protect its data against change by the other half. For example, the buffer header structure contains a flags word with 32 flags, some set and reset by both halves. Setting and resetting bits requires multiple instructions on most architectures, so the potential for data corruption exists. UNIX solves this problem by locking out interrupts during critical sections. Top half code must explicitly lock out interrupts with the spl functions.
The naming goes back to the early days of UNIX on the PDP-11. The PDP-11 had a relatively simplistic level-based interrupt structure. When running at a specific level, only higher priority interrupts were allowed. UNIX named functions for setting the interrupt priority level after the PDP-11 SPL instruction, so initially the functions had names like spl4 and spl7. Later machines came out with interrupt masks, and BSD changed the names to more descriptive names such as splbio (for block I/O) and splhigh (block out all interrupts).
One of the most significant sources of bugs in drivers is inadequate synchronization with the bottom half.

Interrupt code does not need to perform any special synchronization: by definition, processes don't run when interrupt code is active.

Blocking interrupts has a potential danger that an interrupt will not be serviced in a timely fashion. On PC hardware, this is particularly evident with serial I/O, which frequently generates an interrupt for every character. At 115200 bps, this equates to an interrupt every 85 µs. In the past, this has given rise to the dreaded silo overflows; even on fast modern hardware it can be a problem. It's also not easy to decide interrupt priorities: in the early days, disk I/O was given a high priority in order to avoid overruns, while serial I/O had a low priority. Nowadays disk controllers can handle transfers by themselves, but overruns are still a problem with serial I/O.

Waiting for the other half

In other cases, a process will need to wait for some event to complete. The most obvious example is I/O: a process issues an I/O request, and the driver initiates the transfer. It can be a long time before the transfer completes: if it's reading keyboard input, for example, it could be weeks before the I/O completes. When the transfer completes, it causes an interrupt, so it's the interrupt handler which finally determines that the transfer is complete and notifies the process. Traditional UNIX performs this synchronization with the functions sleep and wakeup, though current BSD no longer uses sleep: it has been replaced with tsleep, which offers additional functionality.

The top half of a driver calls sleep or tsleep when it wants to wait for an event, and the bottom half calls wakeup when the event occurs. In more detail,

This method has problems even on single processors: the time to wake processes depends on the number of sleeping processes, which is usually only slightly less than the number of processes in the system. FreeBSD addresses this problem with 128 hashed sleep queues, effectively diminishing the search time by a factor of 128. A large system might have 10,000 processes running at the same time, so this is only a partial solution.

In addition, it is permissible for more than one process to wait on a specific address. In extreme cases dozens of processes wait on a specific address, but only one will be able to run when the resource becomes available; the rest call tsleep again. The term thundering horde has been devised to describe this situation. FreeBSD has partially solved this issue with the wakeup_one function, which only wakes the first process it finds. This still involves a linear search through a possibly large number of process structures, and it has the potential to deadlock if two unrelated events map to the same address.

Adapting the UNIX model to SMP

A number of the basic assumptions of this model no longer apply to SMP, and others become more of a problem:

The initial FreeBSD model

The original version of FreeBSD SMP support solved these problems in a manner designed for reliability rather than performance: effectively it found a method to simulate the single-processor paradigm on multiple processors. Specifically, only one process could run in the kernel at any one time. The system ensured this with a spinlock, the so-called Big Kernel Lock (BKL), which ensured that only one processor could be in the kernel at a time. On entry to the kernel, each processor attempted to get the BKL. If another processor was executing in kernel mode, the other processor performed a busy wait until the lock became free:
MPgetlock_edx:
1:
	movl	(%edx), %eax
	movl	%eax, %ecx
	andl	$CPU_FIELD,%ecx
	cmpl	_cpu_lockid, %ecx
	jne	2f
	incl	%eax
	movl	%eax, (%edx)
	ret
2:
	movl	$FREE_LOCK, %eax
	movl	_cpu_lockid, %ecx
	incl	%ecx
	lock
	cmpxchg	%ecx, (%edx)
	jne	1b
	GRAB_HWI
	ret
In an extreme case, this waiting could degrade SMP performance to below that of a single processor machine.

How to solve the dilemma

Multiple processor machines have been around for a long time, since before UNIX was written. During this time, a number of solutions to this kind of problem have been devised. The problem was less to find a solution than to find a solution which would fit in the UNIX environment. At least the following synchronization primitives have been used in the past: There is some confusion in terminology with these locking primitives. In particular, the term mutex has been applied to nearly all of them at different times. We'll look at how FreeBSD uses the term in the next section.

One big problem with all locking primitives with the exception of spin locks is that they can block. This requires a process context: a UNIX interrupt handler can't block. This is one of the reasons that the old BKL was a spinlock, even though it could potentially use up most of processor time spinning.

The new FreeBSD implementation

The new implementation of SMP on FreeBSD bases heavily on the implementation in BSD/OS 5.0, which has not yet been released. Even the name SMPng (``new generation'') was taken from BSD/OS. Due to the open source nature of FreeBSD, SMPng is available on FreeBSD before on BSD/OS.

The most radical difference in SMPng are:

Interrupt threads

The single most important aspect of the implementation is the introduction of a process or ``thread'' context for interrupt handlers. This change involves a number of tradeoffs: SMPng solves the latency and scheduling issues with a technique known as lazy scheduling: on receiving an interrupt, the interrupt stubs note the PID of the interrupt thread, but they do not schedule the thread. Instead, it continues execution in the context of the interrupted process. The thread will be scheduled only in the following circumstances: We expect this method to offer negligible overhead for the majority of interrupts.

From a scheduling viewpoint, the threads differ from normal processes in the following ways:

Experience with the BSD/OS implementation showed that the initial implementation of interrupt threads was a particularly error-prone process, and that the debugging tools were inadequate. Due to the nature of the FreeBSD project, we considered it imperative to have the system relatively functional at all times during the transition, so we decided to implement interrupt threads in two stages. The initial implementation was very similar to that of normal processes. This offered the benefits of relatively easy debugging and of stability, and the disadvantage of a significant drop in performance: each interrupt could potentially cause two context switches, and the interrupt would not be handled while another process, even a user process, was in the kernel.

Experience with the initial implementation met expectations: we have seen no stability problems with the implementation, and the performance, though significantly worse, was not as bad as we had expected.

At the time of writing, we have improved the implementation somewhat by allowing limited kernel preemption, allowing interrupt threads to be scheduled immediately rather than having to wait for the current process to leave kernel mode. The potential exists for complete kernel preemption, where any higher priority process can preempt a lower priority process running in the kernel, but we are not sure that the benefits will outweigh the potential bug sources.

The final lazy scheduling implementation has been tested, but it is not currently in the -CURRENT kernel. Due to the current kernel lock implementation, it would not show any significant performance increase, and problems can be expected as additional kernel components are migrated from under Giant.

Not all interrupts have been changed to threaded interrupts. In particular, the old fast interrupts remain relatively unchanged, with the restriction that they may not use any blocking mutexes. Fast interrupts have typically been used for the serial drivers, and are specific to FreeBSD: BSD/OS has no corresponding functionality.

Locking constructs

The initial BSD/OS implementation defined two basic types of lock, called mutex: The implementation of these locks was derived almost directly from BSD/OS, but has since been modified significantly.

In addition to these locks, the FreeBSD project has included two further locking constructs:

Condition variables are built on top of mutexes. They consist of a mutex and a wait queue. The following operations are supported:

Shared/exclusive locks, or sx locks, are effectively read-write locks. The difference in terminology came from an intention to add additional functionality to these locks. This functionality has not been implemented, so currently sx locks are the same thing as read-write locks: they allow access by multiple readers or a single writer.

The implementation of sx locks is relatively expensive:

struct sx {
	struct lock_object sx_object;
	struct mtx	sx_lock;
	int		sx_cnt;
	struct cv	sx_shrd_cv;
	int		sx_shrd_wcnt;
	struct cv	sx_excl_cv;
	int		sx_excl_wcnt;
	struct proc	*sx_xholder;
};
They should be only used where the vast majority of accesses is shared.

Removing the Big Kernel Lock

These modifications made it possible to remove the Big Kernel Lock. The initial implementation replaced it with two mutexes: This combination of locks supplied the bare minimum of locks necessary to build the new framework. In itself, it does not improve the performance of the system, since processes still block on Giant.

Idle processes

The planned light-weight interrupt threads need a process context in order to work. In the traditional UNIX kernel, there is not always a process context: the pointer curproc can be NULL. SMPng solves this problem by having an idle process which runs when no other process is active.

Recursive locking

Normally, if a lock is locked, it cannot be locked again. On occasions, however, it is possible that a process tries to acquire a lock which it already holds. Without special checks, this would cause a deadlock. Many implementations allow this so-called recursive locking. The locking code checks for the owner of the lock. If the owner is the current process, it increments a recursion counter. Releasing the lock decrements the recursion counter and only releases the lock when the count goes to zero.

There is much discussion both in the literature and in the FreeBSD SMP project as to whether recursive locking should be allowed at all. In general, we have the feeling that recursive locks are evidence of untidy programming. Unfortunately, the code base was never designed for this kind of locking, and in particular library functions may attempt to reacquire locks already held. We have come to a compromise: in general, they are discouraged, and recursion must be specifically enabled for each mutex, thus avoiding recursion where it was not intended.

Migrating to fine-grained locking

Implementing the interrupt threads and replacing the Big Kernel Lock with Giant and schedlock did not result in any performance improvements, but it provided a framework in which the transition to fine-grained locking could be performed. The next step was to choose a locking strategy and migrate individual portions of the kernel from under the protection of Giant.

One of the dangers of this approach is that locking conflicts might not be recognized until very late. In particular, the FreeBSD project has different people working on different kernel components, and it does not have a strong centralized architectural committee to determine locking strategy. As a result, we developed the following guidelines for locking:

One of the weaknesses of the project structure is that there is no overall strategy for locking. In many cases, the choice of locking construct and granularity is left to the individual developer. In almost every case, locks are leaf node locks: very little code locks more than one lock at a time, and when it does, it is in a very tight context. This results in relatively reliable code, but it may not be result in optimum performance.

There are a number of reasons why we persist with this approach:

Migrating interrupt handlers

This new basic structure is now in place, and implementation of finer grained locking is proceeding. Giant will remain as a legacy locking mechanism for code which has not been converted to the new locking mechanism. For example, the main loop of the function ithread_loop, which runs an interrupt handler, contains the following code:
if ((ih->ih_flags & IH_MPSAFE) == 0)
        mtx_lock(&Giant);
....
ih->ih_handler(ih->ih_argument);
if ((ih->ih_flags & IH_MPSAFE) == 0)
        mtx_unlock(&Giant);
The flag INTR_MPSAFE indicates that the interrupt handler has its own synchronization primitives.

A typical strategy planned for migrating device drivers involves the following steps:

Probably the most difficult part of the process will involve larger components of the system, such as the file system and the networking stack. We have the example of the BSD/OS code, but it's currently not clear that this is the best path to follow.

Kernel trace facility

The ktr package provides a method of tracing kernel events for debugging purposes. It is not intended for use during normal operation, and should not be confused with the kernel call trace facility ktrace.

For example, the function sched_ithd, which schedules the interrupt threads, contains the following code:

CTR3(KTR_INTR, 
     "sched_ithd pid %d(%s) need=%d",
     ir->it_proc->p_pid, 
     ir->it_proc->p_comm, 
     ir->it_need);
...
if (ir->it_proc->p_stat == SWAIT) {
	CTR1(KTR_INTR, 
            "sched_ithd: setrunqueue %d",
	    ir->it_proc->p_pid);
The function ithd_loop, which runs the interrupt in process context, contains the following code at the beginning and end of the main loop:
for (;;) {
	CTR3(KTR_INTR, 
             "ithd_loop pid %d(%s) need=%d",
	     me->it_proc->p_pid, 
             me->it_proc->p_comm, 
             me->it_need);
...
	CTR1(KTR_INTR, 
            "ithd_loop pid %d: done",
             me->it_proc->p_pid);
	mi_switch();
	CTR1(KTR_INTR, 
             "ithd_loop pid %d: resumed",
             me->it_proc->p_pid);
The calls CTR1 and CTR3 are two macros which only compile any kind of code when the kernel is built with the KTR kernel option. If the kernel contains this option and the bit KTR_INTR is set in the variable ktr_mask, then these events will be masked to a circular buffer in the kernel. The ddb debugger has a command show ktr which dumps the buffer one page at a time, and gdb macros are also available. This gives a relatively useful means of tracing the interaction between processes:
2791 968643993:219224100 cpu1 ../../i386/isa/ithread.c:214 ithd_loop pid 21 ih=0xc235f200: 0xc0324d98(0) flg=100
2790 968643993:219214043 cpu1 ../../i386/isa/ithread.c:197 ithd_loop pid 21(irq0: clk) need=1 
2789 968643993:219205383 cpu1 ../../i386/isa/ithread.c:243 ithd_loop pid 21: resumed
2788 968643993:219190856 cpu1 ../../i386/isa/ithread.c:158 sched_ithd: setrunqueue 21
2787 968643993:219179402 cpu1 ../../i386/isa/ithread.c:120 sched_ithd pid 21(irq0: clk) need=0
This example traces the arrival and processing of a clock interrupt on the i386 platform, in reverse chronological order. The number at the beginning of the line is the trace entry number.

Witness facility

The witness code was designed specifically to debug mutex code. It keeps track of the locks acquired and released by each thread. It also keeps track of the order in which locks are acquired with respect to each other. Each time a lock is acquired, witness uses these two lists to verify that a lock is not being acquired in the wrong order. If a lock order violation is detected, then a message is output to the kernel console detailing the locks involved and the locations in question. Witness can also be configured to drop into the kernel debugger when an order violation occurs.

The witness code also checks various other conditions such as verifying that one does not recurse on a non-recursive lock. For sleep locks, witness verifies that a new process would not be switched to when a lock is released or a lock is blocked on during an acquire while any spin locks are held. If any of these checks fail, the kernel will panic.

Project status

The project started in June 2000. The major milestones in the development are: The main issue in the immediate future is to migrate more and more code out from under Giant. In more detail, we have identified the following major tasks, some of which are in an advanced state of implementation:

Performance

The implementation has not progressed far enough to make any firm statements about performance, but we are expecting reasonable scalability to beyond 32 processor systems.

Acknowledgements

The FreeBSD SMPng project was made possible by BSDi's generous donation of code from the development version 5.0 of BSD/OS. The main contributors were: Further contributors were Tor Egge, Seth Kingsley, Jonathan Lemon, Mark Murray, Chuck Paterson, Bill Paul, Alfred Perlstein, Dag-Erling Sm¿rgrav and Peter Wemm.

Bibliography

Per Brinch Hansen, Operating System Principles. Prentice-Hall, 1973.

Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, The Design and Implementation of the 4.4BSD Operating System, Addison-Wesley 1996.

Curt Schimmel, UNIX Systems for Modern Architectures, Addison-Wesley 1994.

Uresh Vahalia, UNIX Internals. Prentice-Hall, 1996.

Further reference

FreeBSD SMP home page.



This paper was originally published in the Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, June 25-30, 2001, Boston, Masssachusetts, USA

Last changed: 21 June 2001 bleu

Technical Program
Conference index
USENIX Home