An Efficient Kernel-Based Implementation of POSIX Threads
		 
			    Robert A. Alfieri
		        Data General Corporation

Abstract

This paper describes the kernel-based imple-
mentation of POSIX Threads (Pthreads) in the 
DG/UX operating system.  The implementation 
achieves time efficiency by using a general-purpose 
trap mechanism, known as a Kernel Function Call 
(KFC), that carries an order of magnitude less over-
head than a traditional system call. On a 50 MHz 
Motorola MC88110, the implementation can create 
and exit a thread (with the associated context switch) 
in 8.1 microseconds and yield to another thread in 4.0 
microseconds. The implementation also achieves 
space efficiency by paging and decoupling bulky data 
structures.

The advantages of a kernel-based implementa-
tion include design simplicity, less code redundancy, 
optimization of global (interprocess) operations, 
avoidance of inopportune preemption, and global 
semantic flexibility. The disadvantage is a monolithic 
design that lacks user-level flexibility.


1.  Introduction

Threads provide an efficient and convenient 
concurrent programming paradigm for applications 
running on shared-memory multiprocessors. As 
industry-standard thread interfaces such as POSIX 
Threads (Pthreads) [Pos93] find their way into open 
systems, an increasing number of portable applica-
tions are being written (or rewritten) to exploit 
threads.

Support for Draft 6 of Pthreads was shipped in 
version 5.4R3.00 of DG/UX, Data General's com-
mercial UNIX, operating system. DG/UX originated 
in 1984 as a rewrite of the UNIX kernel in order to 
support Symmetric Multiprocessing (SMP), full pre-
emption, and better modularity [Kel89].

Though this paper focuses on the techniques 
that DG/UX uses to implement threads efficiently in 
the kernel, any operating system could apply the same 
techniques to other thread packages or performance-
critical system calls. Most importantly, none of these 
techniques are specific to DG/UX or POSIX Threads.


2.  Design Overview

The overriding design goal is to allow standard 
multithreaded applications to map as efficiently as 
possible (both in terms of time and space) onto cur-
rent and future multiprocessor architectures. In order 
to realize this goal, the implementation employs three 
main features:

Kernel Function Calls (KFCs) 

KFCs are fast, general-purpose kernel traps 
that allow Pthreads to be implemented simply and 
efficiently in the kernel. 

The implementation uses the same set of KFCs 
to optimize both local (intraprocess) and global 
(interprocess) thread operations. Global operations 
are important today and will become increasingly 
important as applications and machines move towards 
a paradigm of distributed computing.

Low Memory Overhead

By keeping the per-thread physical and virtual 
memory consumption to a minimum, the implementa-
tion can easily support thousands of threads in a 
single process.

All per-thread data structures are pageable, 
except a 128-byte kernel-level structure. Moreover, 
space-consuming kernel stacks and other transient 
data are tied only to threads that have entered the ker-
nel for a traditional system call, page fault, or other 
type of full kernel entry.

Pthread Groups and Hierarchical CPU Affinity

These extensions give applications explicit 
control over the scheduling of threads onto the under-
lying multiprocessor machine. 

The affinity mechanism is hierarchical in that it 
parallels the CPU/cache/memory hierarchy of the 
underlying machine. Applications may group together 
threads working on the same data set, then affix that 
thread group to a set of CPUs sharing the same cache 
or local memory. This feature is particularly important 
for improving cache locality on large SMP or NUMA 
(Non-Uniform Memory Access) machines. If an 
application does not specify its thread grouping or 
affinity, the operating system performs automatic 
thread grouping and affinity assignments. 

The rest of this paper is devoted to describing 
the motivations and details of the implementation 
with respect to the first two features. The last feature 
is described separately [Alf94]. 


3.  Application Requirements

There are two broad classes of thread operations:

1) Local Operations are those operations that occur 
   among threads in the same process. Local opera-
   tions do not involve communication with other 
   processes and are often referred to as intrapro-
   cess or intra-address-space operations.

2) Global Operations are those operations that 
   occur between threads in two different processes 
   or between a thread and the kernel. Global opera-
   tions are often referred to as interprocess or inter-
   address-space operations.

A thread implementation should optimize both 
types of operations. Because global operations require 
more work, they tend to be somewhat slower than 
local operations. However, a request should only 
incur a cost that is proportional to the service that 
must be delivered.


3.1. Local Operations

Ideally, a threaded application spends all of its 
time in user space performing local operations and 
rarely talks to the kernel or to other processes. Exam-
ples of using local-only operations include situations 
in which

o Database back-end threads perform a parallel 
  sort of a database table stored in process memory.

o Real-time threads are used to prioritize certain 
  events that are handled entirely in user space.

o Simulation software breaks up a problem into 
  smaller simulations that run as threads on separate
  CPUs.


3.2. Global Operations 

Commercial applications also make extensive 
use of global operations. For data integrity reasons, 
applications are split into multiple processes, and 
shared system services reside in the kernel or in privi-
leged server processes. The kernel must always 
participate in (safe) global thread operations, as illus-
trated by the following examples in which

o Database front-end threads go through the kernel
  to talk to database back-end threads which, 
  in turn, make kernel system calls to transfer 
  data to or from permanent storage. 

o Real-time threads require kernel-based global 
  scheduling in order to minimize response time 
  to global asynchronous events.

o Distributed computing services are divided into 
  replicated multithreaded processes, which com-
  municate using a Local or Remote Procedure 
  Call (RPC) mechanism. 

Clearly, global interactions play an important 
role in current and future multithreaded applications. 
Implementations should optimize both local and glo-
bal cases.


4.  Previous Work

This section introduces library-based imple-
mentations of threads and explains why DG/UX has 
not adopted the same approach. Next, the discussion 
focuses on previous work within DG/UX that ulti-
mately led to an efficient kernel-based 
implementation. 


4.1. Multiplexed Libraries

In pursuit of optimal performance, numerous 
multiplexed thread libraries have been developed to 
avoid kernel system calls, especially during perfor-
mance critical operations such as thread creation and 
context switching [Gol90, Pow91, Ste92, Mue93]. 

Multiplexed libraries employ two levels of 
scheduling. User-level threads are multiplexed onto a 
typically smaller number of kernel-level entities (e.g., 
processes or kernel-level threads). In the simplest 
implementations, all user-level threads are multi-
plexed onto a single process. In more sophisticated 
implementations, the number of kernel-level entities 
varies with the number of CPUs that are assigned to 
the process. Context switches among kernel-level 
entities involve slower system calls. Context switches 
among user-level threads occur entirely in user space, 
thus optimizing performance. In addition, flexible 
multiplexed libraries [And91b, Mar91] allow applica-
tions to integrate their own thread schedulers.

Because multiplexed libraries reside in user 
space, highly tuned library primitives can be used 
only for local thread operations. Global thread opera-
tions, such as interprocess synchronization and RPCs, 
typically follow significantly slower system call paths 
or use dedicated kernel-entry mechanisms.

In addition, well-integrated multiplexed librar-
ies require complex algorithms to bridge the wide gap 
between library and kernel databases. In particular, 
the library and kernel must continuously inform each 
other of the number of runnable threads in user space 
and the number of available CPUs in the kernel. Refer 
to the section on "Kernel vs. User Threads" for a more 
detailed discussion of this and other issues.

Despite their complexity, multiplexed libraries 
prevail due to the perception that system call overhead 
would dominate the cost of a kernel-based thread 
primitive.


4.2. System Call Overhead

This section explains in greater detail why mul-
tiplexed library-based systems and DG/UX have 
avoided using traditional system calls to implement 
threads.

Have System Calls Gotten Relatively Slower? 

Recent research in the area of RISC processor 
performance [Ous90, And91a] asserts that operating 
system primitives such as system calls, exceptions, 
and process context switches have not experienced 
the same accelerated speedup as application integer 
and floating-point operations. There are two reasons 
why this observation may hold true for system calls in 
many commercial operating systems:

1) System calls are typically more memory-inten-
   sive than arithmetic operations, thereby 
   introducing more processor stalls in the form of 
   cache misses and degraded instruction-level par-
   allelism.

2) System call entry has grown more complex due 
   to the introduction of additional functionality and 
   complexities in the kernel.

As an example, DG/UX system call overhead 
has increased mainly for the following reasons:

1) RISC processors have required the kernel to save 
   and restore more registers and to do more work 
   during system calls than previous CISC proces-
   sors. 

2) The desire to provide more precise execution 
   time accounting has resulted in reading hardware 
   timers during system call entry and exit. Histori-
   cally, time accounting has been implemented 
   using a coarse ten-millisecond clock tick.

3) An effort to make the kernel more modular and 
   portable has introduced more inter-subsystem 
   calls during the system call path.

Reasons (1) and (3) likely apply to many other 
operating systems, while reason (2) may be specific 
only to DG/UX and a few other commercial systems.

Regardless of the actual implementation, sys-
tem call overhead is inherently significant due to user 
state saving and restoring, and preparing the caller for 
execution in a fully preemptive SMP kernel.

Impact on Thread Primitives

Most local thread primitives execute in micro-
seconds or tens of microseconds. System call 
overhead, which is on the order of tens of microsec-
onds, would contribute significantly to the overall cost 
of a kernel-based thread primitive. Roughly speaking, 
a system call imposes the equivalent of (at least) an 
additional thread-to-thread context switch on each 
kernel entry.


4.3. Down-Sizing System Calls

Fortunately, system calls are not the only way 
to enter the kernel. Modern RISC and CISC proces-
sors provide fast trap instructions, which can be used 
to invoke simpler kernel primitives that are an order of 
magnitude faster than traditional system calls.

For example, on a (relatively slow) 25 MHz 
Motorola MC88100 [Mot91], the cost of trapping into 
the kernel and returning from the trap, including 
proper Processor Status Register (PSR) and Program 
Counter (PC) setup and restoration, is less than one 
microsecond. A null system call on the same machine 
takes approximately 20 microseconds. 

The actual trap into the kernel is inexpensive. 
The real issue is the costly steps that must be per-
formed in order to set up for a traditional system call. 
In fact, most of these steps can be avoided by using a 
simpler primitive.


4.4. DG/UX Extended Operations (XOPs)

For a number of years, DG/UX has used fast 
kernel traps to implement certain extended operations 
(XOPs) that are not provided directly by the Motorola 
88000 instruction set, such as atomic-increment and 
conditional-store to user memory locations. Though 
these atomic operations could have been implemented 
completely in user space using a lock based on the 
88000's test-and-set instruction, there was a desire to 
avoid any unbounded delay associated with inoppor-
tune preemption (e.g., hardware interrupt, page fault, 
process abort) while holding a user-level spin lock or 
mutex. 

XOPs prevent this inopportune preemption by 
using simple traps across the user-kernel protection 
boundary. A null XOP takes less than one microsec-
ond and scales directly with processor speed. 
Atomicity among multiple CPUs is provided using a 
spin lock inside the kernel. In the 88000 family, there 
is no need to disable CPU interrupts explicitly 
because they are implicitly disabled by the trap. Thus, 
preemption is avoided while executing an XOP and 
holding a spin lock. 

Once the XOP has acquired the spin lock, the 
XOP is free to perform the atomic increment or condi-
tional store on the user memory location. However, 
this memory location could be paged out or invalid. 
Because the XOP holds a critical spin lock and has 
not fully entered the kernel, it must be prepared to 
deal with a fault when touching the user memory 
location. To this end, a flag is set in per-CPU data 
indicating that an XOP is in progress. If a fault occurs 
while touching user memory, the fault prehandler 
ignores the fault and backs out of the XOP. The failure 
is reported back to the user-space library routine, 
which is then responsible for touching the memory 
location. Touching the memory location causes a nor-
mal page fault to bring in the memory page. Then the 
library routine retries the XOP; the XOP almost 
always succeeds on the second try.


4.5. Fast Traps in Other Systems

Fast traps have been used in numerous other 
operating systems. In particular, fast traps have been 
successfully exploited for the purpose of speeding up 
Interprocess Communication (IPC) [Ber90, Lie92, 
Wal92, Lie93]. Like XOPs, these traps are specific to 
the primitive that they implement.


5.  Kernel Function Calls (KFCs)

The powerful combination of fast kernel trap 
and fault interception used in XOPs is at the heart of 
the kernel-based thread implementation. However, 
XOPs do not provide the level of semantic flexibility 
and ease-of-use that thread primitives require. In par-
ticular, XOPs must be implemented in assembly 
language, the XOP back-out mechanism is cumber-
some, and XOPs do not check for preemptions before 
leaving the kernel. For these reasons, XOPs were 
abandoned in favor of a more general and usable fla-
vor of fast kernel trap, known as a Kernel Function 
Call (KFC). 


5.1. KFC Semantics

As the name implies, a KFC is a function that 
resides in the kernel and is callable from user space. 
Calls to KFCs follow the same parameter passing and 
register saving conventions of the underlying proces-
sor architecture. Returns from KFCs follow the same 
conventions as UNIX system calls on the underlying 
processor architecture. KFCs carry more overhead 
than an XOP, but still an order of magnitude less over-
head than a traditional system call. Actual KFC 
overhead is on the order of 20-30 machine instruc-
tions and scales directly with processor speed. 

KFCs operate in a restricted environment 
where CPU interrupts are disabled, time is still 
charged to user space, and blocking with a kernel 
stack is not permitted. Although these restrictions 
could be lifted in order to implement all system calls 
using KFCs, only the shortest system calls would ben-
efit noticeably. Also, complete generality would bloat 
KFC overhead and introduce undesirable complexi-
ties, such as the need to unwind a KFC so that a user 
debugger could manipulate a thread's user registers. 
For these reasons, as well as the desire to avoid other 
likely additions to the KFC entry and exit paths, 
DG/UX limits KFC usage to performance-critical 
primitives (e.g. Pthreads) that can operate in this 
restricted environment. Other potential uses include 
getpid(), gettimeofday(), and RPCs.

Appendix A gives a complete list of thread-
related KFC interfaces. Table 1 illustrates the seman-
tic differences among XOPs, KFCs, and system calls. 
Note that numerous costly steps are avoided during 
KFCs and XOPs. 


Aspect of Kernel Entry/Exit		XOP    KFC    SYSCALL
-------------------------------------------------------------

Enters the kernel using a fast trap
instruction and leaves the kernel 	 X      X        X
using a fast return-from-trap 
instruction

Uses common entry/exit code			X	 X

Performs a full register state save on 		         X
entry and full register restore on exit

Switches to system time accounting 		
on kernel entry, then back to user 			 X	
time accounting on exit

Prepares the environment for full 			 X
kernel entry and enables interrupts

Executes with CPU interrupts 		 X      X
disabled

Can be (and is typically) written in C 		X	 X

Can block the calling thread and 		X        X
switch to another thread

Frees up the kernel stack for reuse 		X
when the calling thread is blocked

Can unblock other threads and check 
for preemption before leaving the 		X	 X
kernel

Handles faults directly without 			 X
explicit detection

Backs out of faults that occur while 
touching pageable user or kernel 	 X	X
memory

Backs out of faults and promotes to 
an internal system call to handle the 		X
fault

Follows system call return 			X	 X
conventions

			Table 1: Semantic Differences


5.2. KFC Details

The following sections describe the general 
KFC mechanism in more detail, with emphasis on 
using KFCs to implement thread primitives. The 
reader may wish to browse or skip these details, then 
return to them after reading the rest of this paper. 


5.2.1. KFC Entry

The steps in KFC entry proceed as follows for 
all KFCs:

1) A thread makes a call to a library routine.

2) The library routine typically performs a few 
   steps, then decides to invoke a KFC. The routine 
   packages some arguments and traps into the ker-
   nel through a common vector.

   How the arguments are packaged and how the 
   trap is performed are architecture-dependent. For 
   example, on RISC machines, the arguments are 
   typically passed into the kernel via registers. 
   
   On CISC machines, the processor may automati-
   cally copy the arguments into the kernel, or it 
   may provide enough registers to hold all of the 
   arguments.
   
   The amount of work that the trap instruction does 
   depends on the underlying processor architecture. 
   RISC traps are typically faster than CISC traps, 
   but CISC traps may perform more steps, such as 
   automatically switching to a kernel stack. On 
   modern RISC and CISC processors with the same 
   clock speed, KFC entry overhead tends to be sim-
   ilar.

3) Once in the kernel, common KFC entry code 
   saves the thread's user-level return address and 
   PSR in per-CPU data. 

   Assuming that all KFCs are written in C, the 
   entry code saves the user-level stack pointer and 
   switches to a kernel-level stack. On CISC archi-
   tectures, these operations may be performed 
   automatically by the trap instruction. 

4) Next, the entry code transfers control to the 
   actual KFC. On RISC architectures, the argu-
   ments to the KFC are typically already in the 
   appropriate registers. On CISC architectures, the 
   arguments may need to be pushed onto the kernel 
   stack before the call. However, argument pushing 
   comes cheaply on modern CISC processors.

5) The actual KFC looks like a traditional system 
   call written in C.


5.2.2. KFC Exit

When the thread has completed its KFC, the 
steps in KFC exit proceed as follows for all KFCs:

1) From within the KFC, the thread temporarily 
   stores its primary and secondary return values to 
   per-CPU data variables. Note that the use of per-
   CPU data is allowed because CPU interrupts are 
   disabled.

2) The thread returns naturally from the KFC. The 
   return status indicates whether the KFC com-
   pleted normally. 

3) The common KFC exit code reads a different per-
   CPU variable to determine if the calling thread 
   should check for priority preemption. This occurs 
   if the thread had awakened higher priority threads 
   during the KFC. In this case, the calling thread 
   yields to another thread and will eventually return 
   to this exit code.

4) The KFC exit code interrogates the return status 
   from the KFC. For a normal return, the saved pri-
   mary and secondary return values are returned to 
   the library routine. The conventions for returning 
   these values are the same as those for system 
   calls. For a failure return, the errno value is 
   extracted from the return status, and the primary 
   and secondary return values are ignored. 

5) The KFC exit code restores the partially saved 
   user state, and executes the appropriate instruc-
   tion(s) for returning to user space.

6) The library routine interrogates the results of the 
   KFC in the same way that a library routine would 
   interrogate the results of a system call.

7) The library routine may perform other steps 
   before returning to its caller.


5.2.3. Thread Reschedule

Some KFCs do more than enter the kernel, per-
form a few steps, and return. In numerous cases, the 
calling thread must suspend its execution. Examples 
of thread suspension include the following:

o The calling thread attempts to join (wait for) a 
  thread that has not yet terminated.

o The calling thread explicitly yields to another 
  thread of equal or better priority.

o The calling thread awaits the release of a mutex 
  or the signaling of a condition variable.

o The calling thread sleeps for a specific amount 
  of time.

When a thread suspends, it unwinds the KFC 
that it is executing so that the thread's register state is 
readily accessible. Then common KFC code saves the 
full register state in a standard mcontext_t struc-
ture that resides inside the kernel. Next, the thread 
calls into the dispatcher, which switches control to 
another thread.

Eventually, the original thread will be awak-
ened and a CPU will continue the thread's execution. 
The thread's continuation function [Dra91] restores 
the thread's register state from its mcontext_t 
structure and resumes the KFC where it left off.


5.2.4. KFC Fault Detection

Like XOPs, KFCs must back out of faults that 
occur while referencing pageable kernel or user mem-
ory. There are two main reasons for backing out:

1) The KFC operates in a restricted kernel environ-
   ment with CPU interrupts disabled. Full fault 
   handling is not permitted in this environment.

2) The KFC typically holds a critical spin lock. If 
   the KFC were to allow the fault to be processed, 
   the spin lock could remain held for an indefinite 
   period of time. This could lead to deadlock or 
   extremely long latencies for other threads in the 
   same process. Fault detection actually eliminates 
   long latencies associated with critical locks that 
   are used to implement threads.

Faulting on Kernel and User Memory

Because mcontext_t structures can con-
sume several hundred bytes, depending on the 
processor architecture, the implementation allows 
these structures to be paged. KFCs must be prepared 
to back out of page faults that occur while saving/re-
storing state to/from the per-thread mcontext_t 
structure located in the kernel.

In addition, a few thread KFCs need to refer-
ence user space from inside the kernel. For example, 
when a thread cannot immediately obtain a Pthread 
mutex in user space (i.e., failing the uncontested 
case), the thread invokes a KFC to await the release of 
the mutex. In this contested case, the KFC saves the 
thread's state and adds the thread to a kernel-level 
synchronization queue associated with the mutex. 
However, due to the potential for certain race condi-
tions, the KFC must re-check the user-level mutex to 
see if the mutex has been released before actually put-
ting the thread to sleep. 

The KFC, while attempting to touch the mutex, 
must be prepared to back out of two types of faults:

1) The user page holding the mutex is paged out.

2) The user mutex address passed into the KFC no 
   longer refers to valid user memory. Although the 
   Pthread library dereferences the mutex address 
   and checks the validity of the corresponding 
   mutex structure, another thread could errone-
   ously unmap the mutex memory before the 
   library invokes the KFC.

Detecting Faults 

Whenever a KFC needs to touch kernel or user 
memory that could cause a fault, the KFC calls a triv-
ial assembly language routine to attempt the memory 
operation. Before performing the operation, this com-
mon routine sets a per-CPU variable to the address of 
back-out code at the bottom of the routine. Under nor-
mal operation, this per-CPU variable holds a value of 
zero, indicating that faults are not being intercepted.

As with XOPs, if a fault occurs while touching 
the memory location, the appropriate fault prehandler 
checks the per-CPU variable. Since the variable has a 
nonzero value, the prehandler knows not to proceed 
normally to the full fault handler. Instead, the prehan-
dler ignores the fault and branches to the back-out 
code at the bottom of the common assembly language 
routine. The back-out code returns to the KFC, indi-
cating that a failure occurred while touching the 
memory location. If no fault had occurred, the assem-
bly language routine would have simply returned 
normally after successfully manipulating the memory 
location.

Unlike XOPs, whenever a KFC access to ker-
nel or user memory fails, the KFC is promoted 
directly to an internal system call without returning to 
user space, as discussed in the next section. 


5.2.5. KFC Promotion and Demotion

In most cases, thread operations are completed 
entirely at the KFC level. However, whenever some-
thing complicated occurs that cannot be handled at the 
KFC level, the KFC must be promoted to an internal 
system call. These rare complications include the fol-
lowing:

1) Faulting on user memory, such as a user-level 
   mutex.

2) Page faulting on kernel memory, such as the per-
   thread mcontext_t during register state save 
   or restore.

3) Failing to allocate memory resources without 
   blocking, for example, during thread creation 
   when no dead thread is available for "reincarna-
   tion."

4) Detecting the presence of a software interrupt 
   sent to the calling thread for such events as a sig-
   nal, Pthread cancellation, abort, or stop.

In each of these cases, the cost of completing 
the operation is high relative to system call overhead. 
Therefore, neither internal system call performance 
nor promotion overhead is critical.

Implementation

The basic idea behind KFC promotion is to 
make the environment appear as if the calling thread 
invoked a system call instead of a KFC. The promo-
tion proceeds as follows:

1) The detecting KFC stores in per-CPU data the 
   address of its associated internal system call and 
   parameters for the call.

   Typically, most of the parameters that were 
   passed to the KFC are also passed to the internal 
   system call. Furthermore, because complications 
   can occur during various phases of the KFC, an 
   indication of the phase is also passed as an extra 
   argument to the internal system call. This indica-
   tor tells the internal system call where to resume 
   the operation, for previous phases have already 
   been completed successfully.
   
2) The KFC returns a special status code, indicating 
   that the KFC is being promoted to a system call. 
   The common KFC exit code recognizes this spe-
   cial return status and proceeds with the 
   promotion instead of returning from the KFC. 
   Essentially, the KFC exit code arranges the envi-
   ronment so that it appears as if the calling thread 
   has just trapped into the kernel to execute a sys-
   tem call. However, instead of having passed in a 
   system call number, the thread has "passed in" 
   the kernel address of the internal system call 
   function. Also, unlike a failed XOP, the promoted 
   KFC does not force the thread to return to user 
   space; the promotion takes place completely 
   within the kernel.

3) The calling thread follows the normal system call 
   entry path, except that the system call entry code 
   understands that this is an internal system call. In 
   particular, the handler knows that the internal 
   system call address has been "passed in" rather 
   than a system call number.

4) The internal system call interrogates its argu-
   ments to determine the next phase that needs to 
   be completed. Because the thread is running in a 
   full-fledged system call, the thread is free to page 
   fault on user memory, handle signals, etc. 

5) Once the thread completes the slow phase, it may 
   decide to complete other phases, suspend itself, 
   or return from the system call. 

If the thread decides to suspend itself (in the same 
way that it would have in the KFC), it demotes 
the system call to the KFC so that it can free up 
its kernel stack for reuse (refer to the section on 
"Transient Data"). When the thread is continued, 
it resumes at the KFC level as if it had never been 
promoted. Note that the KFC may get promoted 
again if it encounters a complication during a 
later phase.

If, on the other hand, the KFC had been promoted 
to handle one of the last phases of the KFC, the 
thread simply returns from the internal system 
call after completing these phases. When the 
thread gets back to user space, the library routine 
recognizes no difference because KFCs follow 
the same return conventions as system calls. 

Note the following about KFC promotion and 
demotion:

o Only those KFCs that can encounter complica-
  tions are promotable.

o Each promotable KFC is associated with an 
  internal system call that mimics some of the 
  phases of the KFC.

o Internal system calls tend to be significantly 
  larger than their corresponding KFCs because 
  the KFCs handle only the common, perfor-
  mance-critical cases. In order to reduce code 
  redundancy, internal system calls share most of 
  the same support code as their corresponding 
  KFCs. With minor effort, it should be possible 
  to condense the KFC and the internal system 
  call into one routine.

o The system call entry code remains nearly the 
  same for both normal system calls and internal 
  system calls. The only differences are the speci-
  fication of the actual call routine and, on some 
  CISC architectures, where the arguments are 
  located. The system call exit code is exactly the 
  same for both types of system call.

o Given that promotion is rare, the only good rea-
  son for demotion is kernel stack reuse.


5.3. Performance Measurements

The following tables demonstrate that thread 
operations can be implemented efficiently using 
KFCs. All measurements were taken on a dedicated 
Motorola 50MHz MC88110 uniprocessor [Mot91] 
using Draft 6 of Pthreads as shipped in the DG/UX 
5.4R3.00 operating system. Each reported measure-
ment reflects the average elapsed time per iteration of 
a loop that repeatedly invokes the associated primi-
tive(s). 

Kernel Entry Overhead

Table 2 illustrates the costs of XOPs, KFCs, 
and system calls. All calls include the overhead of a 
user-level "wrapper" function in a shared library. 

Note that the null system call time is accurate 
for the DG/UX kernel even though the time exceeds 
the number reported by research projects such as 
[And91a]. This discrepancy is explained by the addi-
tional steps that DG/UX takes in order to support 
SMP, precise time accounting, multiple APIs, and 
greater internal code modularity.


Kernel Entry Type 		 Time 
				(usec)

Null XOP (basic trap overhead)    0.5

Null KFC (written in C) 	  1.4

Null System Call 		 19.5

	Table 2: Kernel Entry Overhead


Local Operation Overhead

Table 3 gives times for local (intraprocess) 
thread operations. As in library-based systems, local 
threads are the fastest because they share the same 
CPU time accounting (and timeslice) and get sched-
uled onto process-local queues. During local context 
switches, the implementation can thus avoid time 
bookkeeping and global scheduling decisions. 

In all cases, local thread operations outperform 
the null system call. The breakdown for thread cre-
ate/exit was determined using a hardware logic 
analyzer


Local Operation 			Time 
			       	       (usec)

Null thread yield (no context switch)    1.8

Thread Yield (with context switch) 	 4.0

Thread create/exit (with context switch) 8.1
Breakdown:
o create (with KFC entry/exit)		 4.0
o exit (with KFC entry)			 1.6
o context switch (with KFC exit)	 2.1
o invoke new thread's start routine	 0.4

Contested mutex lock/unlock (with 	12.3
context switch)

Condition wait/signal (with context 	13.2
switch)

Thread create/exit/join (with both 	14.1
context switches)

context switch and one-second time-out  17.7
overhead)

	Table 3: Local Thread Operations


Task Creation Overhead 

Table 4 gives the time required to create and 
wait for a thread or process to exit. Times are given 
for both locally and globally scheduled Pthreads. 
Unlike local threads, global threads have dedicated 
CPU time accounting and get scheduled onto system-
wide queues. The local time has been copied from 
Table 3 for comparison. Total time includes the con-
text switches to and from the new thread or process.


Create/Exit/Wait 	        Time
(with both context switches)   (usec)

Locally scheduled thread 	14.1

Globally scheduled thread 	57.1

Process (using fork) 	      7738.9

		Table 4: Task Creation Overhead


Context Switch Overhead

Table 5 gives times for various types of context 
switches involving Pthread condition variables. The 
overhead for obtaining and releasing the associated 
mutex is included in the condition-variable times. In 
Table 3, a time was given for locally scheduled 
threads using local (intraprocess) condition variables. 
Table 5 repeats the same value and includes those for 
other combinations of locally vs. globally scheduled 
threads and local vs. global condition variables. Table 
5 also includes a time for two processes using a global 
condition variable in shared memory. Note that two 
processes cannot communicate via local condition 
variables because local condition variables are, by 
definition, only available within a single process. 
Lastly, times are included for traditional UNIX (Sys-
tem V IPC) semaphores in order to illustrate the 
advantage of using condition variables in multitasking 
applications.


Sleep/Wakeup Pair 			Time
(with context switch)	       	       (usec)

Local threads, local condition  	13.2

Local threads, global condition 	13.7

Global threads, local condition 	33.6

Global threads, global condition 	38.4

Two processes, global condition 	38.9

Local threads, UNIX semaphore 		99.7

Global threads, UNIX semaphore         107.9

Two processes, UNIX semaphore 	       116.1

	Table 5: Context Switch Overhead


Comparison with Library Implementations

Table 6 compares the local-operation perfor-
mance of DG/UX threads with two library 
implementations that were presented at previous 
USENIX conferences. The first is a pure library 
implementation from Florida State University 
[Mue93], which reported best numbers for a 40MHz 
SPARC IPX. The second is the SunOS multi-
plexed implementation running on a 25MHz SPARC 
1+ [Pow91]. Given the disparity in processor types 
and the fact that thread implementations undergo con-
tinual tuning, Table 6 does not attempt to declare one 
implementation superior to the others. Nonetheless, if 
one scales the measurements in Table 6 by processor 
speed, one sees that kernel-based implementations 
can perform local Pthread operations in roughly the 
same time as library-based implementations.


Local Operation 	DG/UX	FSU     SunOS
		        50MHz   40MHz   25MHz
		        Moto    SPARC   SPARC
			88110    IPX      1+
			(usec)  (usec)  (usec)

Thread Create		  4.0    12.0    56.0
(no context switch)

Mutex Lock/Unlock	 12.3    51.0    not
(with context switch)			given

Condition wait/signal	 13.2    55.0   158.0
(with context switch)

   Table 6: Comparative Measurements (not scaled)


Summary

KFCs make both local and global thread opera-
tions efficient. In particular, the performance of local 
and global operations roughly matches the perfor-
mance of local operations cited by previous library 
implementations. 


6.  Minimizing Memory Consumption

In order to meet the design goal of minimizing 
per-thread memory consumption, the implementation 
pages and decouples data structures wherever possi-
ble. Since DG/UX already pages kernel data, the 
paging of thread structures did not require special 
work in the virtual memory subsystem.


6.1. Kernel-Level Thread Structure

The only per-thread data that cannot be paged 
is the 128-byte main control block structure that 
resides in the kernel address space. An application 
that uses as many as 1,024 threads consumes only 
128KB. Given current trends in memory capacities, 
this amount of overhead is insignificant. Even on a 
16MB workstation, swap space for 1,024 threads (and 
their associated user stacks) is a more limited resource 
than 128KB of physical memory. On a medium-sized 
server (the primary design target), the impact of this 
128KB is even more negligible.


6.2. User-Level Thread Structure

Every thread has a small user-level structure 
that resides at the base of the thread's stack. This 
structure and its enclosing thread stack are entirely 
pageable. The following information is stored in this 
structure:

o a cached copy of the thread's ID 

o the per-thread errno variable

o a pointer to the thread's most recently pushed 
  Pthread cleanup handler

o the values for the Pthread thread-specific data 
  variables


6.3. Kernel-Level mcontext_t 

Every thread has a kernel-level mcontext_t 
structure that is used to store the thread's register state 
when it is suspended in a KFC. This standard UNIX 
structure can consume several hundred bytes and is 
entirely pageable. As discussed earlier, KFCs must be 
prepared to back out of page faults on this structure 
while saving or restoring register state.


6.4. Transient Data

When a thread enters the kernel for a tradi-
tional system call, hardware interrupt, page fault, or 
other type of exception, the thread must run on a ker-
nel stack. The kernel stack, along with other variables 
that are needed for one trip inside the kernel, make up 
transient data. The size of transient data is approxi-
mately 8KB; thus, it is too expensive to provide each 
thread with a dedicated transient data area. In addi-
tion, transient data must be made temporarily non-
pageable while a thread is running on a CPU.

In order to reduce the amount of pageable and 
non-pageable transient data, the implementation 
decouples transient data from threads that are no 
longer using it. As the word "transient" implies, tran-
sient data are needed for only one trip inside the 
kernel and can migrate from thread to thread. The fol-
lowing rules summarize transient data allocation 
requirements:

o When a thread is actually running on a CPU, 
  the thread must have non-pageable transient 
  data because the thread must be able to enter 
  fully into the kernel for any reason (e.g. to pro-
  cess an exception).

o Once a thread enters fully into the kernel, its 
  transient data remain assigned to the thread 
  until the thread leaves the kernel or exits. How-
  ever, once the thread is descheduled (e.g. while 
  sleeping for a long time in a system call or 
  when the system is loaded), the kernel may 
  decide to page out the thread's transient data.

o In contrast, when a thread suspends in a KFC, it 
  relinquishes its transient data for reuse by other 
  threads in the process. This is possible because 
  the thread has not entered fully into the kernel, 
  has saved its state in its kernel-level 
  mcontext_t structure, and no longer needs 
  its kernel stack. This rule applies also to global 
  (interprocess) operations that are implemented 
  using KFCs.

o When a suspending thread demotes an internal 
  system call to a KFC, the suspending thread 
  relinquishes its transient data. Similarly, when a 
  thread is preempted directly out of user space as 
  a result of taking a timeslice or other hardware 
  interrupt, the suspending thread demotes to the 
  KFC level and frees up its transient data.

o The kernel regulates the number of transient 
  data areas in the process based on the current 
  number of threads that are making system calls, 
  page faults, etc. The kernel regulates the num-
  ber of non-pageable transient data areas based 
  on system-wide load and process priorities.

o In a typical application, it is common to have 
  many threads sharing a small number of tran-
  sient data areas.


7.  Kernel vs. User Threads

Given that threads can be implemented effi-
ciently in the kernel, what are the advantages and 
disadvantages of kernel-based threads vs. library-
based threads?


7.1. Design Simplicity

Kernel-based implementations are inherently 
simpler than multiplexed implementations because 
they employ only one level of thread scheduling; they 
need not treat both user-level threads and kernel-level 
entities. This eliminates the communication gap 
between library and kernel databases that is common 
in multiplexed implementations. The kernel sees all 
threads in the system and can simply schedule runna-
ble threads directly onto available CPUs with 
minimum latency. Furthermore, because critical data 
structures (including sleep queues) reside in the ker-
nel address space, an errant application cannot corrupt 
them.

A kernel-based implementation also reduces 
the amount of code redundancy found in multiplexed 
implementations. For example, there is no need to 
have two sets of user synchronization primitives, one 
for user-level threads and one for kernel-level entities. 
The same holds true for thread creation, signal han-
dling, thread timeslicing, user debugging support, etc.

On the other hand, many Pthread semantics 
creep into the kernel. If the thread library does not 
provide support for a particular thread package or 
function, the application writer may need to wait until 
the kernel provides the desired functionality. Natu-
rally, many thread packages can be emulated 
efficiently on top of existing Pthread or KFC seman-
tics. Nonetheless, as a kernel-based implementation 
accumulates more semantics, its monolithic design 
increases in complexity.


7.2. Performance Focus

Library-based systems focus their performance 
effort in the user-space thread library. Thus, highly-
tuned thread primitives can be exploited only for local 
thread operations. 

Kernel-based systems focus their performance 
effort in the kernel address space. Highly-tuned KFCs 
can be used for both local and global thread opera-
tions. For example, one KFC handles thread 
suspensions for both local and global condition vari-
ables. As a result, when the same type of thread is 
involved, the overhead for a global condition variable 
is only slightly greater than that for a local condition 
variable.


7.3. Avoiding Inopportune Preemption

Library-based implementations protect thread 
data structures with critical user-level locks. If a pre-
emption or page fault occurs while a critical lock is 
held, other thread operations could be delayed for tens 
of milliseconds (or more) until the critical lock is 
finally released. Such latencies hurt 	performance, 
scale badly with increased numbers of CPUs, and can 	
cause real-time applications to miss deadlines.

While complex schemes exist to work around 
this problem, a kernel-based implementation provides 
a natural solution. During KFCs, interrupts are dis-
abled, so no timeslice or other preemption can occur. 
Furthermore, if a page fault occurs while touching 
user space (e.g., retrying a mutex before going to 
sleep), the page fault is intercepted, any critical kernel 
spin lock is released, and the KFC is promoted to an 
internal system call to handle the page fault as if it 	had 
occurred from user space. 


7.4. Semantic Flexibility

Generally speaking, library-based implementa-
tions provide more local semantic flexibility, while 
kernel-based implementations provide more global 
semantic flexibility.

Unlike the most advanced multiplexed imple-
mentations [And91b, Mar91], kernel-based 
implementations do not allow applications to directly 
manipulate the internal scheduling mechanism or 
integrate their own scheduling policies. Though 
POSIX provides a rich set of basic mechanisms and 
policies on which more complex constructs can be 
built, a flexible library implementation will allow the 
same constructs to be expressed, while maintaining 
excellent integration with the rest of the library.

On the other hand, a kernel-based implementa-
tion can efficiently accommodate additional types of 
global semantics. For example, DG/UX allows appli-
cations to establish affinity relationships between 
groups of threads and sets of CPUs sharing the same 
cache or local memory [Alf94]. The implementation 
maintains local-operation performance within Pthread 
groups, while minimizing penalties for intergroup 
operations. Library-based implementations typically 
inflict greater penalty for CPU affinity because they 
require that user threads be "bound" to slower kernel-
level entities.


7.5. User Tools

In a kernel-based implementation, the kernel 
sees all threads in the system. This simplifies and 
enhances the development of user tools that support 
threads. For example, user debuggers need only con-
verse with the kernel, and the kernel always supplies a 
consistent set of information. In library-based imple-
mentations, the debugger must talk to both the kernel 
and the user-level library, and the library could be in 
an inconsistent state.

DG/UX provides a ps command option that 
displays the detailed status of all threads in the sys-
tem. This status information includes, for example, 
the user address of the mutex or condition variable on 
which a thread is blocked. Most library-based imple-
mentations show information only for the underlying 
kernel entities, which have little correlation with 
library-level threads.


7.6. The Best of Both Approaches

Library-based systems could implement their 
kernel-level entities more efficiently using KFCs. 
Such a hybrid would improve the performance of glo-
bal operations, while retaining flexibility in user 
space.

Conversely, kernel-based systems could pro-
vide additional mechanisms found in flexible 
multiplexed implementations. For example, a kernel-
based implementation could provide Pthread exten-
sions that allow a parallel language runtime library to 
better regulate the number of active worker threads 
based on the current number of assigned processors.


8. Conclusions

This work demonstrates how challenging basic 
assumptions (e.g., that threads cannot be implemented 
efficiently using system calls) can favorably redirect 
the course of a design. This work also illustrates the 
trade-offs involved in implementing operating system 
primitives in kernel space vs. user space.

For example, when KFCs are combined with 
the conservative use of memory, the result is a simple 
and efficient kernel-based implementation of POSIX 
Threads. While such an implementation forfeits some 
semantic flexibility in user space, it optimizes both 
local and global operations, and can easily accommo-
date additional global semantics.

Finally, any operating system could use the 
same general KFC mechanism to implement its own 
thread package or other types of fast kernel primi-
tives, including the following global operations:

o Local or remote procedure call

o Global asynchronous event notification in the 
  form of a newly created thread

o Interprocess thread creation or migration

o Quick gettimeofday(), getpid(), etc.


9. Acknowledgments

Eric Hamilton and Michael Kelley developed 
XOPs. Steve Daniel suggested using fast kernel traps 
to implement thread primitives. Jeff Kimmel provided 
a number of other useful design suggestions. Many 
thanks to the following persons who contributed to 
the DG/UX implementation of POSIX Threads: Rich-
ard Barnette, Jeff Beneker, Gina Couch, Ming 
Hwang, David Lakey, Etta LeBlanc, Ed McAdams, 
Earle MacHardy, Bill McGrath, Mark O'Connell, 
Cyril Sagan, Lee Sanders, Ed Savage, Guy Simpson, 
Eric Vook. Special thanks to Tom Ash, Dean Hering-
ton, and Andy Huber for their careful review of this 
paper.


10. References

[Alf94]	 R.A. Alfieri, "Pthread Groups and Hierar-
         chical CPU Affinity," Data General Corpo-
         ration, in progress.

[And91a] T.E. Anderson, H.M. Levy, B.N. Bershad, 
         and E.D. Lazowska, "The Interaction of 
         Architecture and Operating System 
         Design," Proceedings of the Fourth Inter-
         national Conference on Architectural Sup-
         port for Programming Languages and 
         Operating Systems, pp. 108-120, April 
         1991.

[And91b] T.E. Anderson, B.N. Bershad, E.D. 
         Lazowska, and H.M. Levy, "Scheduler 
         Activations: Effective Kernel Support for 
         the 	 User-Level Management of Parallel-
         ism," Proceedings of the Thirteenth ACM 
         Symposium on Operating Systems Princi-
         ples, pp. 95-109, October 1991.

[Ber90]	 B.N. Bershad, T.E. Anderson, E.D. 
         Lazowska, H.M. Levy, "Lightweight 
         Remote Procedure Call," ACM Transac-
         tions on Computer Systems, volume 8, 
         number 1, pp. 37-55, February 1990.

[Dra91]	 R.P. Draves, B.N. Bershad, R.F. Rashid, 
         R.W. Dean, "Using Continuations to 
         Implement Thread Management and Com-
         munication in Operating Systems," Pro-
         ceedings of the Thirteenth ACM 
         Symposium on Operating Systems Princi-
         ples, pp. 122-136, October 1991.

[Gol90]  D. Golub, R. Dean, A. Florin, R. Rashid, 
         "UNIX as an Application Program," Pro-
         ceedings of the Summer 1990 USENIX 
         Conference, pp. 87-95, June 1990.	 	

[Kel89]  M.H. Kelley, "Multiprocessor Aspects of 
         the DG/UX Kernel," Proceedings of the 
         Winter 1989 USENIX Conference, pp. 85-
         99, January 1989.

[Lie92]	 J. Liedtke, "Fast Thread Management and 
         Communication Without Continuations," 
         Proceedings of Microkernel and Other 
         Kernel Architectures, pp. 213-221, April 
         1992.

[Lie93]	 J. Liedtke, "Improving IPC by Kernel 
         Design," Proceedings of the Fourteenth 
         ACM Symposium on Operating Systems 
         Principles, pp. 175-188, December 1993.

[Mar91]  B.D. Marsh, T.J. LeBlanc, M.L. Scott, and 
         E.P. Markatos, "First-Class User-Level 
         Threads," Proceedings of the Thirteenth 
         ACM Symposium on Operating Systems 
         Principles, pp. 110-121, October 1991. 

[Mot91]	 MC88110: Second Generation RISC 
         Microprocessor User's Manual, Motorola 
         Inc., 1991.		

[Mue93]	 F. Mueller, "A Library Implementation of 
         POSIX Threads under UNIX," Proceed-
         ings of the 1993 Winter USENIX Confer-
         ence, pp. 29-41, January 1993.

[Ous90]	 J.K. Ousterhout, "Why Aren't Operating 
         Systems Getting Faster as Fast as Hard-
         ware?" Proceedings of the Summer 1990 
         USENIX Conference, pp. 247-256, June 
         1990.

[Pos93]  POSIX P1003.4a/D8, Threads Extension 
         for Portable Operating Systems, IEEE 
         Computer Society, October 1993.

[Pow91]	 M.L. Powell, S.R. Kleiman, S. Barton, D. 
         Shah, D. Stein, and M. Weeks, "SunOS 
         Multi-thread Architecture," Proceedings of 	
         the 1991 Winter USENIX Conference, pp. 
         65-80, January 1991.

[Ste92]  D. Stein, D. Shah, "Implementing Light-
         weight Threads," Proceedings of the 1992 
         Summer USENIX Conference, pp. 1-9, 
         June 1992.

[Wal92]	 J. Walpole, J. Inouye, R. Konuru, "Modu-
         larity and Interfaces in Micro-kernel 
         Design and Implementation: A Case Study 
         of Chorus on the HP PA-Risc," Proceed-
         ings of the 1992 Workshop on Micro-kernel 
         and Other Kernel Architectures, pp. 71-82, 
         April 1992.


Robert A. Alfieri works for Data General's 
UNIX kernel group in Research Triangle Park, NC, 
where he has developed threads, hierarchical affinity 
scheduling, virtual memory, and other kernel func-
tionality. He received a B.S. degree in Computer 
Science from Northwestern University in 1985. His 
technical interests include operating systems, molecu-
lar nanotechnology, and Italian cooking. He can be 
reached at Data General Corporation, 62 T.W. Alex-
ander Drive, Research Triangle Park, NC 27709, or 
electronically at alfieri@dg-rtp.dg.com.


Appendix A: Thread KFC Interfaces

This section lists all KFC interfaces for 
Pthreads and Pthread Groups. Like other UNIX sys-
tems [Pow91], DG/UX refers to kernel-level threads 
as LWPs (Light Weight Processes). However, DG/UX 
LWP interfaces differ and, as discussed earlier, 
DG/UX dedicates a very efficient LWP to each thread. 

In all cases, a trivial "wrapper" function in the 
Pthread library invokes the appropriate KFC. Upon 
completion, the KFC returns values and status in the 
same way as a traditional system call. Because 
DG/UX ships the Pthread library only in shared form 
and does not publish the KFC interfaces, these inter-
faces can change from revision to revision without 
requiring that users relink their applications.

The following KFCs create, terminate, join, 
detach, and manipulate the scheduling of threads. In 
the case of __lwp_create(), the new thread 
begins execution at a pre-registered library function 
that invokes the thread's start routine on the thread's 
user stack.

int __lwp_create(
   lwp_sched_info_t   sched_info,
   lwp_stack_info_t   stack_info,
   void             (*start_rtn)(),
   void *	      start_arg,
   lwp_group_id_t     group_id);

int __lwp_join(
   lwpid_t            lwpid);

int __lwp_exit(
   void *             exit_status);

int __lwp_detach(
   lwpid_t            lwpid);

int __lwp_set_sched(
   lwpid_t            lwpid,
   lwp_sched_info_t   sched_info);

int __lwp_get_sched(
   lwpid_t            lwpid);

int __lwp_yield(
   int                to_peers,
   int                is_global);

int __lwp_sleep(
   unsigned           seconds,
   unsigned           nseconds);


The following KFCs manipulate synchroniza-
tion queues that are used to implement the contested 
cases for mutexes and condition variables. The 
Pthread library does not invoke KFCs in the uncon-
tested cases.

int __lwp_sq_alloc(
   lwp_sq_info_t     creation_info);

int __lwp_sq_dealloc(
   lwp_sqid_t	     sq_id);

int __lwp_sq_sleep(
   lwp_cond_t *      cond_ptr,
   lwp_mutex_t *     mutex_ptr,
   struct timespec * abstime_ptr,
   lwp_sqid_t        sq_id);

int __lwp_sq_wakeup(
   lwp_sqid_t        sq_id,
   int               is_broadcast);

The following KFCs create, destroy, and 
manipulate DG/UX Pthread Groups, which are dis-
cussed separately [Alf94].

int __lwp_group_create(
   lwp_sched_info_t  sched_info);

int __lwp_group_destroy(
   lwp_group_id_t    lwp_group_id);

int __lwp_group_set_sched(
   lwp_group_id_t    lwp_group_id,
   lwp_sched_info_t  sched_info);

int __lwp_group_get_sched(
   lwp_group_id_t    lwp_group_id);

int __lwp_group_get_times(
   lwp_group_id_t    lwp_group_id,
   struct timespec * user_time_ptr,
   struct timespec * sys_time_ptr);


UNIX is a registered trademark of Unix Systems 
Laboratories, Inc.
DG/UX is a trademark of Data General Corporation. 	
SunOS is a trademark of Sun Microsystems, Inc.
SPARC is a trademark of SPARC International, Inc.