################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX MACH III Symposium Santa Fe, New Mexico, April 1993. For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Using Continuations to Build a User-Level Threads Library Randall W. Dean School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 (412) 268-7654 Internet: rwd@cs.cmu.edu Fax: (412) 681-5739 Abstract We have designed and built a user-level threads library that uses continuations for transfers of control. The use of continuations reduces the amount of state that needs to be saved and restored at context switch time thereby reducing the instruction count in the critical sections. Our multiprocessor contention benchmarks indicate that this reduction and the use of Busy Spinning, Busy Waiting and Spin Polling increases throughput by as much as 75% on a multiprocessor. In addition, flattening the locking hierarchy reduces context switch latency by 5% to 49% on both uniprocessors and multiprocessors. This paper describes the library's design and compares its overall performance characteristics to the existing implementation. 1. Introduction Mach 3.0 [8] has been available since 1990 for anonymous FTP as a platform for building operating systems and embedded applications. To support this development environment, a library that implements multiple threads of control and mutual exclusion is provided. In Mach 2.5 [1] C-Threads [6] is the library that supplies these needs. The Mach 2.5 implementation maps user threads one-to-one with Mach kernel threads resulting in extensive usage of kernel data structures and expensive kernel context switches. The old Mach 3.0 implementation solves the one-to-one mapping problem by multiplexing user threads onto equal or fewer kernel threads. The old implementation uses a complex locking hierarchy suitable for multiprocessors that requires a small state machine inside the context switch critical section. Combined with the need to acquire multiple locks and save full user register state, this approach results in a relatively high latency for user context switching. The need to save full register state at context switch time also affects throughput on multiprocessors. Regardless of how finely grained the locking hierarchy is, it is necessary to hold a single central lock during a context switch to protect the blocking computation from being resumed by another thread. The time needed to save and restore the user registers becomes the lower bound on critical section size. We have applied continuations to C-Threads to address the effects on latency and throughput caused by long critical sections. Continuations have existed in the programming language community as an abstraction for describing a future computation for many years [14]. They have also been used in the Mach 3.0 kernel [7] as a basis for control transfers between execution contexts. Their use in describing control transfers in the kernel reduced context switching times and kernel stack utilization. In C-Threads, continuations have not only directly reduced critical section size, but have enabled further optimizations. The direct effect of applying continuations is a reduction of the number of words saved and restored at context switch time. By building a continuation when a potentially blocking C-Thread call is made, the thread state needed at context switch time can be described with only 2 words. The building of the user continuation can be done upon entering the C-Threads library so no lock needs to be held. This reduction in critical section size makes it possible to flatten the locking hierarchy which further reduces the size of the critical section. The net result is a reduction in latency of 5% to 49% on our uniprocessor context switch benchmark. These changes along with Busy Waiting, Busy Spinning, and Spin Polling increase throughput by as much as 75% as measured by our multiprocessor contention benchmark. Considerable research has been done in the area of user-level thread packages. Of particular relevance to the work described in this paper is the work on scheduler activations [2] [3] and adaptive spinning strategies [12]. We believe there would be synergistic benefits from the integration of these efforts. Section 2 gives an overview of the C-Threads library interface and describes the relevant internal interfaces and implementations. Section 3 describes how continuations were implemented and used with the new version of C-Threads. Section 4 describes the optimizations used in implementing the new version. Section 5 compares latency and throughput in the old and new versions of C-Threads and shows how certain optimizations effected throughput. 2. C-Thread Internals C-Threads [6] was originally designed as a threading library that transparently used Mach features to supply parallelism. Implementations included a version that used a Mach task for each C-Thread, a version that used a single thread in a single task to supply coroutines, and a version that used one Mach task with a Mach thread per C-Thread. The old Mach 3.0 version of the library multiplexes C-Threads onto equal or fewer kernel threads because of concerns about kernel data structure use such as kernel stacks and kernel context switching times. The remainder of this section outlines the C-Thread external interface. It also describes enough of the old Mach 3.0 implementation's internal interfaces and structure to provide a basis for understanding the changes made in the new version. 2.1. Interface A detailed description of the C-Threads interface can be found in the CMU technical report by Cooper and Draves [6]. This section briefly reviews the interface described in that document. Where additions to the interface were made, motivation and a more detailed description of the calls is given. 2.1.1. Threads C-Threads are created and managed with the following calls: * cthread_t cthread_fork(cthread_fn_t func, void *argument) creates a C-Thread that executes function func with argument arg. * void cthread_exit(void *status) exits the current C-Thread returning status. * void *cthread_join(cthread_t thread) waits for C-Thread thread to terminate and returns the exit status of that thread. 2.1.2. Mutexes C-Threads provides a mutual exclusion primitive called mutex that guarantees at most one holder. Mutexes are manipulated by the following calls: * mutex_lock(mutex_t m) Acquires the mutex m. * mutex_unlock(mutex_t m) releases the mutex m. 2.1.3. Condition Variables Condition variables allow one thread to wait for another thread to signal that event or change in state occurred. The following calls handle condition variables: * void condition_wait(condition_t c, mutex_t m) waits for condition c releasing mutex m while waiting. When the condition is signaled, the mutex is reacquired before condition_wait() returns. * condition_signal(condition_t c) signals condition c waking up at least one waiting thread. 2.1.4. Kernel Threads The old C-Thread library multiplexes multiple C-Threads onto equal or fewer kernel threads. The best performance is achieved when pre-emptive kernel context switching is minimized without losing parallelism. Two new interface calls were added to allow the application designer to control the number of kernel threads used: * void cthread_set_kernel_limit(int limit) sets the maximum number of kernel threads that can be created. A value of zero indicates that there is no limit. * int cthread_kernel_limit(void) returns the current maximum number of kernel threads. 2.1.5. Thread Wiring The C-Threads library includes a mechanism to allow a user to disable C-Threads multiplexing on a per C-Thread basis: * void cthread_wire(void) binds a C-Thread to its underlying kernel thread. * void cthread_unwire(void) removes the binding making it possible for multiplexing to occur. Once a thread is wired it becomes possible for the user to make Mach thread calls such as thread_priority(), thread_assign(), or thread_info() on the kernel thread and have the calls only affect the associated C-Thread. Programs can use this facility to run specialized threads at elevated priority, for example, the BSD 4.3 Single Server [8] uses this for threads that interact with kernel devices. 2.2. Old Implementations The old Mach 3.0 implementation draws from the Mach 2.5 coroutine and threads versions. The coroutine version multiplexes multiple C-Threads onto a single Mach kernel thread. This requires machine dependent code for user context switching, per-object queues, and a global queue of runnable threads called the run_queue. With only one kernel thread there can be no contention for the internal data structures so no locking is required in the coroutine version. The threads version binds C-Threads to kernel threads and has as many kernel threads as there are C-Threads. Per-object queues must now be locked because of pre-emptive scheduling, but no machine dependent context switching code or global run_queue is needed since this is all handled by the kernel. From the Mach 2.5 implementations was taken all of the queues, all of the locking, the multiple kernel threads, and the machine dependent context switching code. With these two sources also came an internal programming layer called cprocs. Cprocs provides an implementation independent virtual processor layer that is used to implement the higher level C-Threads primitives. The differences between the various C-Threads implementations are hidden within this layer. The remainder of this section will describe interfaces and implementation details that reside below the cproc layer. 2.2.1. Locking Locking internal C-Threads data structures cannot be handled with mutexes since they are implemented by the library. Instead, C-Threads does its locking with spin_locks. Spin_locks are implemented using the hardware's native test-and-set, compare-and-swap, or exchange-memory instruction. On uniprocessors that do not supply such an instruction, a kernel-supported software restartable-atomic-sequence is used [4]. In order to minimize bus contention when locking a spin_lock, we implement them as a test followed by the atomic operation. This keeps most of the spinning in the cache. For each mutex and condition variable there is a queue for C-Threads that are blocked waiting on that object. There is also a global run_queue on which a C-Thread is enqueued when it becomes runnable. Each of these queues is protected by a different spin_lock. The spin_lock guarding the run_queue is called the run_lock. 2.2.2. Blocking and Resuming The routine invoked when a C-Thread blocks is called cproc_block(). This routine does one of two things depending on whether or not the blocking thread is wired. If the thread is wired, it blocks by calling mach_msg(), waiting for a message to wake it up indicating that it can run again. If the thread is not wired, it checks the run_queue for another thread to run. If there is a C-Thread present, it contexts switches to that thread. If a thread is not present, then the underlying kernel thread must disassociate itself from the blocking C-Thread and become idle. Since the stack that the kernel thread uses in user space is associated with the C-Thread and not the kernel thread, the first thing necessary is to acquire a stack to use. It is not possible to use the blocking C-Thread's stack, since the blocking thread could become runnable and start running on another kernel thread. A pool of idle C-Threads with small stacks is kept for this purpose. A C-Thread is pulled from this pool or created if necessary. The blocking C-Thread then context switches to this idle C-Thread invoking the routine cproc_waiting(). Cproc_waiting() simply sits in a loop calling mach_msg() waiting for a message to wake it up. When it receives a message, it acquires the run_lock and checks for C-Thread. If a C-Thread is found, its execution will be resumed. If not, the idle thread loops back to mach_msg(). When an operation such as condition_signal() or mutex_unlock() makes a C-Thread runnable again, cproc_ready() is invoked. This routine inserts the newly runnable thread on the run_queue and wakes up any idle kernel threads. 2.2.3. The State Machine A small state machine is used to keep track of each C-Thread. A thread can be either RUNNING, BLOCKED, or SWITCHING. The first two states have the obvious meaning and are and easy to deal with in cproc_ready() and cproc_block(). The third state, SWITCHING, occurs when a thread is in the process of blocking and has released the lock on the queue of the object for which it is waiting. 2.3. Evaluation of Old Implementation While the old Mach 3.0 implementation is a great improvement over the Mach 2.5 implementations by enabling user level context switching, we felt that further enhancements could be made. The old implementation is very aggressive about blocking. Spinning before blocking has been shown to be superior to just spinning or just blocking [12]. Mach 3.0 has better scheduling primitives for synchronization then mach_msg() [5]. Finally, the old Mach 3.0 code is complex and hard to maintain. We created the new implementation to address our concerns with the shortcomings of the old implementation. 3. Continuations A continuation is an object that fully describes a future computation. A continuation can be built, invoked, and passed around as an argument like any other object. The Mach 3.0 kernel has two types of continuations. The first type of continuation describes the user's state and is built when the thread crosses the protection boundary from user space to kernel mode. The second type describes the state that has accumulated since entering the kernel and is built when the thread blocks inside the kernel. As with the kernel implementation, there are two distinct continuations in the new version of the C-Threads library. A continuation that describes the user's computation at the time a C Thread routine is called is built upon entrance to the library. We call this the User Continuation. The other continuation present in this implementation is used internally to describe the computation that should occur when a C-Thread is resumed after blocking. We call this the Internal Continuation. We have two different continuations because, like the kernel, the computations they describe are so different. The User Continuation describes the user state that has accumulated up to the point of a C-Threads operation being invoked. This is potentially large and difficult to examine, because it resides on the user's stack. Furthermore, the user's state is machine dependent. We build the User Continuation to encapsulate this machine dependent state as a machine independent object. The Internal Continuation is a small, easily examined, easily manipulated, machine independent object consisting of a function pointer and an argument. 3.1. Building and Invoking Continuations The kernel has a distinct protection boundary that makes it easy to build and invoke user continuations when the boundary is crossed. To make it possible for C-Threads to easily build user continuations, we supply a wrapper to potentially blocking C-Thread calls that acts much like the user-to-kernel protection boundary crossover. Building the User Continuation is done without holding any locks. The continuation is built by saving all the callee saved user registers on the current stack and saving the resulting stack pointer. Upon exit from the C-Threads library, this continuation is invoked. This entails setting the stack pointer to that saved at continuation creation, restoring the user's registers, and returning to the user routine which originally called C-Threads. With the user's state encapsulated in the User Continuation, all that is needed to describe the state of the library is a single function pointer and argument. The Internal Continuation's two words are saved inside the User Continuation in space reserved for this purpose. When this thread is resumed and its continuation is invoked, the function is called with the specified argument. The thread's stack pointer is derived from the address of the previously saved User Continuation. 3.2. The filter Mechanism To centralize the manipulation of continuations, a single object called a filter is used. All manipulations of continuations within C-Threads are done through this object. A filter is a per thread specified function that is invoked each time the manipulation of a continuation is necessary. A filter takes as a parameter the type of manipulation necessary. These are: * FILTER_USER_BUILD: This is invoked by the wrapper of a potentially blocking C-Thread routine. It calls the real function, which is passed to it as an argument. * FILTER_INTERNAL_BUILD: This is invoked when a C-Thread actually blocks. It is passed the function and argument that must be invoked when the C-Thread continues. It calls the FILTER_INTERNAL_INVOKE of the C-Thread that is being resumed. * FILTER_INTERNAL_INVOKE: This is invoked by a blocking C-Thread from FILTER_INTERNAL_BUILD. It invokes the routine and argument saved in the Internal Continuation. * FILTER_USER_INVOKE: This is invoked upon leaving C-Threads after entering with FILTER_USER_BUILD. It invokes the User Continuation. * FILTER_PREPARE: This is invoked when C-Threads creates a new C-Thread. It is passed a routine and argument like FILTER_INTERNAL_BUILD that will be invoked when the created C-Thread begins running. This is essentially the same as FILTER_INTERNAL_BUILD without the invocation of FILTER_INTERNAL_INVOKE upon exit. Filters can be defined outside of C-Threads. The filter described in the section, the default_filter, is the filter specified for a thread when it is created. The following two calls allow the filter for a C-Thread to be examined and changed: * cthread_fn_t cthread_filter(cthread_t thread) returns the current filter for C Thread thread. * void cthread_set_filter(cthread_t thread, cthread_fn_t func) sets the current filter for C-Thread thread to func. Stack User Computation . . . condition_wait() filter(USER_BUILD) User Continuation Internal Continuation condition_wait_real() cthread_block() filter(INTERNAL_BUILD) filter(INTERNAL_INVOKE) mutex_lock_real() filter(USER_INVOKE) User Computation . . . Figure 3-1: Example of user blocking with condition_wait() 3.3. Example of Continuation Use Figure 3-1 shows the path taken as a user computation blocks with condition_wait(). The user calls condition_wait() which is a wrapper that invokes the filter to build the User Continuation. The User Continuation and space for the Internal Continuation appear next on the stack as built by the filter. The real condition_wait() is then called and the C-Thread blocks. When it blocks it invokes the filter again to build the Internal Continuation. This is saved in the space previously allocated for it when the User Continuation was built. The blocking thread then invokes the Internal Continuation of the C-Thread to which it is switching, removing the kernel thread from the blocking C-Thread and leaving it blocked. When the blocked thread is resumed by another thread invoking the blocked thread's Internal Continuation, mutex_lock() is called to reacquire the mutex released when it blocked. Assuming this succeeds, the User Continuation is invoked and the user's computation continues with the stack back where it originally was when condition_wait() was first called. 4. Optimizations Continuations enabled a number of changes to the new implementation of C-Threads. These include a flattening of the locking hierarchy and various sorts of spinning when the computation is no longer able to make progress. The optimizations that require spinning occur only on multiprocessors and assume either sequentially consistent [13] or processor consistent [9] memory. This section describes these changes and the motivations behind them. 4.1. Locking Techniques The old implementation uses separate spin_locks to protect the run_queue and the internal queue of each mutex and condition. The old implementation also requires a state machine to keep track of C-Thread state as threads are enqueued and dequeued and locks are acquired, released and reacquired. By using only a single spin_lock to protect all the queues and the internal state, we eliminate the need for the SWITCHING state of the state machine. Once the SWITCHING state is removed, there is no longer a reason to have the state machine at all. While the use of only one spin_lock creates a single point of contention, it also makes it easier to add contention reducing optimizations. With this change and the use of continuations, the critical section of the context switch path actually becomes shorter than in the old implementation. 4.2. Busy Spinning and Busy Waiting Busy Spinning is defined as spinning until some global state changes when there may be other work that the processor could accomplish. Spinning before blocking has been shown to be superior to only spinning or only blocking [12]. The new implementation spins for a constant number of iterations before taking action to yield the processor. The number of iterations to spin is currently a static, machine-dependent value determined at compile time. Busy Spinning currently occurs whenever a spin_lock must be taken and in mutex_lock() idle threads. We only Busy Spin on multiprocessors, since spinning would guarantee the obstruction of work on uniprocessors. In mutex_lock() the thread spins attempting to acquire the mutex. If the mutex is acquired, the thread returns by invoking the user continuation. Otherwise, after spinning for the maximum number of iterations and not acquiring the mutex, the thread then attempts to acquire the run_lock, queue the mutex, and block. An idle thread spins on the run_queue without holding the run_lock waiting for a C-Thread to appear that can be run. If this occurs, the idle thread attempts to acquire the run_lock, dequeue a C-Thread, and switch to it. After spinning for the maximum number of iterations and not having a C-Thread appear, the idle thread depresses its priority and begins again [5]. This priority depression yields the processor to any runnable kernel thread. When there are no runnable kernel threads except the depressed thread, the system automatically aborts the priority depression and resumes the idle thread. We call this depressed priority spinning or Busy Waiting. C-Threads that are wired also use Busy Waiting. Unfortunately, Busy Waiting will not work perfectly on a machine with other tasks running if these tasks have compute bound threads. Once an idle thread depresses its priority, it would never run again if there were another thread on the system running. To address this, we use a timeout when we depress priority in the general spin_lock case so that starvation does not occur. For the idle thread we keep a queue of the kernel threads currently idle, and when we insert a thread on the run_queue, we call thread_depress_abort(). This aborts the priority depression and returns the thread to its original priority. The old Mach 3.0 C-Threads implementation uses mach_msg() instead of Busy Waiting in the idle thread. When the run_queue is empty the idle thread waits to receive a message indicating that a C-Thread has been inserted in the queue. This disadvantage of this approach is that it is not possible for the idle thread to act on the newly runnable C-Thread until it receives a message. The C-Thread languishes in the run_queue for the duration of a message send followed by a receive. It is also the case that sending a message and receiving a message are considerably costlier operations than depressing priority and aborting it. 4.3. Spin Polling In both of the previously described specific cases of Busy Spinning, one of the termination conditions of spinning results in attempting to acquire the run_lock. Since the run_lock is the only spin_lock protecting state used internally by C-Threads, in a finely grained parallel application contention is likely. With mutex_lock() this occurs after a failure to acquire the mutex. With an idle thread, it occurs when a C-Thread appears in the run_queue. In both of these cases, instead of only spinning attempting to acquire the run_lock, we spin both attempting to acquire the lock and checking that the condition that required its acquisition is still true. If after a fixed number of iterations the thread neither acquire the run_lock nor determine that the thread no longer need to acquire it, thread yield the processor by depressing our priority. This optimization is called Spin Polling. 4.4. Continuation Recognition The default_filter uses continuation recognition, an optimization first proposed and used in the Mach 3.0 kernel [7]. When the FILTER_INTERNAL_BUILD call of the default_filter prepares to call FILTER_INTERNAL_INVOKE of the thread it is activating, it checks to see if the filter of the resuming thread is the default_filter. If it is, then the default_filter can manipulate the Internal Continuation of the newly activated thread to be resumed directly, without needing to make the generalized FILTER_INTERNAL_INVOKE call. Another form of continuation recognition occurs within condition_signal(). The continuation of a thread that is being continued by Condition_signal is mutex_lock_solid(). By examining the Internal Continuation and determining the actual mutex that the blocked thread will be attempting to acquire, condition_signal() can make a better choice about which queue to insert the thread into. If the mutex is currently held, then the thread is inserted into the mutex queue where it will later be resumed when the mutex is released. If the mutex is not held, then the thread is inserted into the run_queue with the expectation that it will succeed in acquiring the mutex when resumed. 5. Performance This section describes two benchmarks designed to measure throughput and latency. We first compare the old and new implementations and then examine the effects of the previously described implementation decisions and optimizations. Two platforms are used to show these results: a Decstation 5000/20 Workstation with a 20MHz R3000 [11] and a Sequent Symmetry with twenty 20MHz 80386's [10]. 5.1. Latency To show the latency of blocking operations, we use a simple PingPong test that has two C Threads context switch back and forth via a condition variable. Three versions of this program were created: one that sets the kernel thread limit to two threads, a second that sets the limit to one thread, and a third that does not limit the number of kernel threads, and wires each thread. Figure 5-1 shows the results of these tests. Decstation Sequent Version Old New Ratio Old New Ratio 1 Thread 146*s 75*s 1.95 151*s 133*s 1.14 2 Threads 578*s 75*s 7.71 268*s 146*s 1.84 Wired 503*s 256*s 1.96 1225*s 447*s 2.74 Figure 5-1: PingPong benchmark results Running the PingPong test with one kernel thread measures the latency of context switching. This is labeled "1 Thread" in Figure 5-1. As can be seen, nearly a two-fold improvement is seen on the Decstation. The latency on the Sequent improves also, but not by as much. This relative difference in improvement might be attributable to differences in the respective memory systems. Running the PingPong test with the threads wired measures the cost of the synchronization primitive used by the underlying kernel. As described earlier, the old implementation uses mach_msg() for synchronization, while the new implementation uses thread priority depression. On the uniprocessor Decstation, the differences in synchronization costs result in nearly a factor of two performance improvement. The multiprocessor Sequent benefits more since explicit synchronization is not actually necessary to resume the second thread. With the old implementation, the second thread will not run until it receives a message from the first. This results in dead time during each iteration equivalent to the cost of a message send and message receive. Since the new implementation uses Busy Waiting in wired threads, the second thread will begin running immediately upon the condition being signaled. On a multiprocessor, the execution of the first thread's thread_depress_abort() overlaps the progress being made in the second thread. There is no equivalent cost to the message receive. Running the PingPong test with two unwired kernel threads adds a way for progress to be made by allowing user-level context switching to occur. On the Decstation, a quantum, the amount of time before a pre-emptive kernel context switch will occur, is 15625us. It is possible for a kernel thread to execute as many as 200 context switches before the another kernel thread will be pre emptively scheduled. In the PingPong test, when the second kernel thread does run, if it cannot make progress, it yields the processor back to the first. If it can make progress, it will continue processing user-level context switches until another kernel context switch occurs. In contrast, the old implementation, with its higher latency and higher synchronization costs, can execute at most 100 user context switches between kernel context switches. 5.2. Contention Comparisons To measure throughput, we ran a benchmark that generates lock contention and measures how long it takes multiple threads to make progress while contending for these locks. The contention benchmark creates a set of locks and associates a pool of threads which each lock. Each thread does some number of units of simulated work then acquires its lock, increments a common counter, and releases the lock. On the Sequent one unit of work takes 5.2us. 0 200 400 600 800 1000 0 20 40 60 80 100 uSeconds Units of Work 2 locks with 10 threads each Old Threads New Threads 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 uSeconds Units of Work 5 locks with 4 threads each Old Threads New Threads Figure 5-2: Comparisons of old and new thread packages on the Sequent Figure 5-2 shows the contention benchmark on the Sequent comparing the old and new implementations with two different lock and thread parameters. If there were no lock contention, these graphs would be straight lines with slopes equal to the duration of one unit of work and intercepts equal to the overhead associated with creating and deleting the threads. As can be seen, the graphs are not straight lines and contention does exist as the amount of work done between acquisitions approaches zero. The results of this benchmark show the throughput of new implementation is significantly greater than that of the old implementation. The overhead in the new implementation is lower because its primitives for creating and deleting threads are faster. In the old implementation there is a higher threshold of work above which contention is insignificant, which further widens the performance gap between the two implementations. When work approaches zero, the new and old implementations perform about equally well. 0 500 1000 1500 2000 0 2 4 6 8 10 12 14 16 18 20 uSeconds Processors Old Threads Package 0 Units of Work 20 Units of Work 40 Units of Work 60 Units of Work 80 Units of Work 0 500 1000 1500 2000 0 2 4 6 8 10 12 14 16 18 20 uSeconds Processors New Threads Package 0 Units of Work 20 Units of Work 40 Units of Work 60 Units of Work 80 Units of Work Figure 5-3: Old and New versions varying processors and amount of work on the Sequent To better show the effect of adding processors to a finely grained application, we ran the contention benchmark with one lock and a varying number of threads, each doing a fixed amount of work. Figure 5-3 shows these results for a number of different amounts of work. As can be seen, the new implementation scales considerably better than the old one. If there is no contention, adding a thread that can run on an available processor should have no effect on the execution time of the benchmark. At some number of processors, the amount of time to execute this test begins increasing. With the old implementation and 40 units of work, contention begins to affect the execution time at approximately 11 threads. The new implementation is not affected until approximately 13 threads. In general, it appears the new implementation can support two more threads without contention than the old one. 5.3. Effects of Different Techniques To examine the effects of Busy Spinning and Spin Polling, three versions of the new C-Threads implementation were built with one or both of these features disabled. Figure 5-4 shows the contention benchmark run on the Sequent with 40 threads contending for one lock. 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 50 100 150 200 250 300 350 uSeconds Work 1 Lock with 40 threads each New Threads New Without Busy Spinning 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 0 50 100 150 200 250 uSeconds Work 1 Lock with 40 threads each New Threads New Without Spin Polling Figure 5-4: Effects of Busy Spinning and Spin Polling on throughput on the Sequent The left graph in Figure 5-4 shows the effect of disabling Busy Spinning. The use of Busy Spinning provides an order of magnitude performance improvement in throughput under contention. The right graph shows the effect of disabling Spin Polling. While disabling Spin Polling alone does not make a significant difference, disabling both Busy Spinning and Spin Polling degrades performance by two orders of magnitude. To better quantify the effects of these optimizations, measurements were made to examine the relative frequency of events, such as, acquiring the mutex while Busy Spinning or the run_queue becoming empty while attempting to acquire the run_lock in the idle thread. The contention benchmark was run. on the Sequent using one lock, 40 threads, and 80 units of work. As seen in Figure 5-4, 80 units of work is a point of prime contention. The total number of iterations for this test was 100,000 yielding 4,000,000 total attempts to acquire the mutex lock. Figure 5-5 shows these statistics. Count Percent Mutex Missed 3,913,671 97.8 of Total Acquired while Busy Spinning 3,761,246 96.1 of misses Acquired while Spin Polling 28,007 0.7 of misses Mutex Blocked 124,418 3.2 of misses Threads Idled 98,754 79.3 of blocked C-Thread lost while Spin Polling 33,223 25.2 of waiters Figure 5-5: Statistics for the Sequent with 1 lock and 40 threads As expected from Figure 5-4, Figure 5-5 shows high contention on the mutex with a 97.8% chance of failing to acquire the mutex on the first attempt with 80 units of work. Busy Spinning accounts for 96.1% of the cases where the mutex was missed. This is consistent with the effects shown in the first graph in Figure 5-4; removing Busy Spinning decreases performance by almost an order of magnitude. Spin Polling is only beneficial in about 1% of the misses but this translates to 18.4% of the cases where the thread was preparing to block. The remaining 3.2% of the misses result in the thread actually blocking and calling cthread_block(). Of the calls made to cthread_block(), almost 80% result in the kernel thread becoming idle. Spin Polling detects the emptying of the run_queue before acquiring the run_lock in 25% of the idle threads. These results show that Busy Spinning is extremely important when there is contention on a mutex in order to avoid passing that contention to the run_lock. When contention is passed on to the run_lock, Spin Polling removes the need to acquire the lock in about 1 in 5 cases. 6. Conclusions We have built a C-Threads library with lower latency and faster throughput than our previous implementation. Continuations decreased critical section size and made a number of important performance optimizations possible. The new C-Threads library allows applications to be built that will run well on both uniprocessors and multiprocessors without need to recompile or relink. In addition, the use of continuations, which allows the removal of the locking hierarchy and the state machine, both make the system faster and greatly simplifies the code. 7. Acknowledgments Thanks go to everyone who helped in the work and commented on the paper. These include Brian Bershad, Rich Draves, David Golub, Joanne Karohl, Daniel Stodolsky, and Peter Stout. References [1] Accetta, M.J., Baron, R.V., Bolosky, W., Golub, D.B., Rashid, R.F., Tevanian, A., and Young, M.W. Mach: A New Kernel Foundation for UNIX Development. In Proceedings of Summer Usenix. July, 1986. [2] Anderson, T.E., Bershad, B.N., Lazowska E. D. and Levy, H.M. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. In Proceedings of the 13th ACM Symposium on Operating Systems Principles. October, 1991. [3] Barton-Davis, P., McNamee, D., Vaswani, R. and Lazowska, E.D. Adding Scheduler Activations to Mach 3.0. In Proceesdings of the 3rd Usenix MACH Symposium. April, 1993. To appear. [4] Bershad, B.N., Redell, D., and Ellis, J. Mutual Exclusion for Uniprocessors. In Proceeedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. October, 1992. [5] Black, D.L. Scheduling and Resource Management Techniques for Multiprocessors. PhD thesis, Department of Computer Science, Carnegie Mellon University, July, 1990. [6] Cooper, E.C. and Draves, R.P. C Threads. Technical Report CMU-CS-88-154, Department of Computer Science, Carnegie Mellon University, February, 1988. [7] Draves, R.P., Bershad, B., Rashid, R.F., and Dean, R.W. Using Continuations to Implement Thread Management and Communication in Operating Systems. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles. 1991. [8] Golub, D., Dean, R.W., Forin, A., and Rashid, R.F. Unix as an Application Program. In Proceedings of Summer 1990 USENIX Conference. June, 1990. [9] Goodman, J.R. and Woest, P. The Wisconsin Multicube: A New Large-Scale Cache-Coherent Multiprocessor. In Proceedings of the 15th Annual Symposium on Computer Architecture, pages 422--431. Honolulu, Hawaii, June, 1988. [10] Intel. 386 Programmer's Reference Manual. Intel, Mt. Prospect, IL, 1990. [11] Gerry Kane. MIPS RISC Architecture. Prentice-Hall, Englewood Cliffs, NJ, 1988. [12] Karlin, A.R., Li, K., Manassee, M.S., Owicki, S. Empirical Studies of Competitive Spinning for Shared-Memory Multiprocessors. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles. 1991. [13] Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9):241-248, September, 1979. [14] Robert Milne and Christopher Strachey. A Theory of Programming Language Semantics. Halsted Press, New York, 1976. i Table of Contents 1. Introduction 2. C-Thread Internals 2.1. Interface 2.1.1. Threads 2.1.2. Mutexes 2.1.3. Condition Variables 2.1.4. Kernel Threads 2.1.5. Thread Wiring 2.2. Old Implementations 2.2.1. Locking 2.2.2. Blocking and Resuming 2.2.3. The State Machine 2.3. Evaluation of Old Implementation 3. Continuations 3.1. Building and Invoking Continuations 3.2. The filter Mechanism 3.3. Example of Continuation Use 4. Optimizations 4.1. Locking Techniques 4.2. Busy Spinning and Busy Waiting 4.3. Spin Polling 4.4. Continuation Recognition 5. Performance 5.1. Latency 5.2. Contention Comparisons 5.3. Effects of Different Techniques 6. Conclusions 7. Acknowledgments ii List of Figures Figure 3-1: Example of user blocking with condition_wait() Figure 5-1: PingPong benchmark results Figure 5-2: Comparisons of old and new thread packages on the Sequent Figure 5-3: Old and New versions varying processors and amount of work on the Sequent Figure 5-4: Effects of Busy Spinning and Spin Polling on throughput on the Sequent Figure 5-5: Statistics for the Sequent with 1 lock and 40 threads