################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	      Proceedings of the USENIX 2nd Symposium on
	Operating Systems Design and Implementation (OSDI '96)
	       Seattle, Washington, October 28-31, 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


                CPU Inheritance Scheduling


                Bryan Ford       Sai Susarla


              Department of Computer Science
                    University of Utah
                 Salt Lake City, UT 84112
                     flux@cs.utah.edu
          https://www.cs.utah.edu/projects/flux/


                          Abstract

   Traditional processor scheduling mechanisms in
operating systems are fairly rigid, often supporting
only one fixed scheduling policy, or, at most, a few
``scheduling classes'' whose implementations are closely
tied together in the OS kernel.  This paper presents CPU
inheritance scheduling, a novel processor scheduling
framework in which arbitrary threads can act as
schedulers for other threads.  Widely different
scheduling policies can be implemented under the
framework, and many different policies can coexist in a
single system, providing much greater scheduling
flexibility.  Modular, hierarchical control can be
provided over the processor utilization of arbitrary
administrative domains, such as processes, jobs, users,
and groups, and the CPU resources consumed can be
accounted for and attributed accurately.  Applications,
as well as the OS, can implement customized local
scheduling policies; the framework ensures that all the
different policies work together logically and
predictably.  As a side effect, the framework also
cleanly addresses priority inversion by providing a
generalized form of priority inheritance that
automatically works within and among diverse scheduling
policies.  CPU inheritance scheduling extends naturally
to multiprocessors, and supports processor management
techniques such as processor affinity[29] and scheduler
activations[3].  We show that this flexibility can be
provided with acceptable overhead in typical
environments, depending on factors such as context
switch speed and frequency.


1  Introduction

Traditional operating systems control the sharing of the
machine's CPU resources among threads using a fixed scheduling
scheme, typically based on priorities.  Sometimes a few variants
on the basic policy are provided, such as support for
fixed-priority threads[17], or several ``scheduling classes'' to
which threads with different purposes can be assigned (e.g.,
real-time, interactive, or background)[25].  However, even these
variants are generally hard-coded into the system implementation
and cannot easily be adapted to the needs of individual
applications.

In this paper we develop a novel processor scheduling framework
based on a generalized notion of priority inheritance.  In this
framework, known as CPU inheritance scheduling, arbitrary threads
can act as schedulers for other threads by temporarily donating
their CPU time to selected threads while waiting on events of
interest such as clock/timer interrupts.  The receiving threads
can further donate their CPU time to other threads, and so on,
forming a logical hierarchy of schedulers as illustrated in
Figure 1.  A scheduler thread can be notified when the thread to
which it donated its CPU time no longer needs it (e.g., because
the target thread has blocked), so that the CPU can be reassigned
to another target.  The basic thread dispatching mechanism
necessary to implement this framework does not have any notion of
thread priority, CPU usage, or clocks and timers; all of these
functions, when needed, are implemented by threads acting as
schedulers.

Under this framework, arbitrary scheduling policies can be
implemented by ordinary threads cooperating with each other
through well-defined interfaces that may cross protection
boundaries.  For example, a fixed-priority multiprocessor
scheduling policy can be implemented by maintaining, among a
group of scheduler threads (one for each available CPU), a
prioritized queue of ``client'' threads to be scheduled; each
scheduler thread successively picks a thread to run and donates
its CPU time to the selected target thread while waiting for an
interesting event such as quantum expiration (e.g., a clock
____________________________
    This research was supported in part by the Defense Advanced Research
Projects Agency, monitored by the Department of the Army, under contract
number DABT63-94-C-0058. The opinions and conclusions contained in this
document are those of the authors and should not be interpreted as
representing official views or policies of the U.S. Government.


Figure 1:  Example scheduling hierarchy.  The dark circles
represent threads acting as schedulers, while the light circles
represent ``ordinary'' threads.


interrupt).  See Figure 2 for an illustration.  If the selected
thread blocks, its scheduler thread is notified and the CPU is
reassigned.  On the other hand, if a different event causes the
scheduler thread to wake up, the running thread is preempted and
the CPU is given back to the scheduler immediately.  Other
scheduling policies, such as timesharing[23],
fixed-priority[12, 17], rate monotonic[22], fair
share[9, 16, 19], and lottery/stride scheduling[30, 31, 32], can
be implemented in the same way.

This scheduling framework has the following features:


   o It supports multiple arbitrary scheduling policies on the
     same or different processors.

   o Since scheduler threads may run either in the OS kernel or


            Figure 2:  Example fixed-priority scheduler


     in user mode, applications can easily extend or replace the
     scheduling policies built into the OS.

   o It provides hierarchical control over the processor resource
     usage of different logical or administrative domains in a
     system, such as users, groups, individual processes, and
     threads within a process.

   o CPU usage accounting can be provided to various degrees of
     accuracy depending on the resources one is willing to
     invest.

   o Priority inversion is addressed naturally in the presence of
     resource contention, without the need for explicit priority
     inheritance/ceiling protocols.

   o CPU use is attributed properly even in the presence of
     priority inheritance.

   o The framework naturally extends to multiprocessors.

   o Processor affinity scheduling is supported.

   o Scheduler activations can be implemented easily.


Based on a prototype implementation in a user-level threads
package, we demonstrate that this framework behaves identically
to traditional multi-class schedulers in equivalent

configurations, and that it provides modular resource control,
load insulation, and priority inversion protection automatically
across multiple scheduling policies.  The additional scheduling
overhead caused by this framework is negligible in the test
environment, and we show that it should also be acceptable in
many kernel environments as well despite greater context switch
costs.

The rest of this paper is organized as follows.  Section 2
describes motivation and related work, Section 3 presents the CPU
inheritance scheduling algorithm in detail, and Section 4 shows
how traditional scheduling algorithms can be implemented on this
framework.  Section 5 demonstrates the behavior and performance
properties of our framework through experimental results, and
Section 6 concludes with a short reflective summary.


2  Motivation

Traditional operating systems divide a machine's CPU resources
among threads using a fixed scheduling scheme, typically based on
priorities.  However, the requirements imposed on an operating
system's scheduler often vary from application to application.
For example, for interactive applications, response time is
usually the most critical factor---i.e., how quickly the program
responds to the user's commands.  For batch jobs, throughput is
of primary importance but latency is a minor issue.  Hard
real-time applications require meeting application-specific
deadlines, while for soft real-time applications, missing a
deadline is unfortunate but not catastrophic.  There is no single
scheduling scheme that works well for all applications.

Over the years, the importance of providing a variety of
scheduling policies on a single machine has waxed and waned,
following hardware and application trends.  In the early years of
computing, use of the entire machine was limited to a single user
thread; this evolved into multiprogrammed machines in which a
single scheduling policy managed batch jobs effectively.  The
advent of timesharing on machines still used for batch jobs
caused a need for two scheduling policies.  As timesharing
gradually gave way to single-user workstations and PCs, a single
scheduling policy was again adequate for a while.

Supporting multiple scheduling policies is again becoming
important, not because of multiple users on a system, but because
of the variety of concurrent uses to which modern systems are
put.  Multimedia drives the need for soft real-time scheduling
policies on general purpose workstations.  Untrusted executable
content (e.g., Java applets) requires secure control of resource

usage while also providing soft real-time guarantees.  The hard
real-time domain is also making inroads onto general purpose
machines as processors and instruments supporting embedded
applications become networked, and some customers (e.g., the
military) need the ability to shift processing power dynamically
to the problem of the moment.  All of these additional policies
must still work alongside traditional interactive and batch
scheduling:  though multimedia and video conferencing may be the
current rage, this does not mean that users no longer care about
getting good interactive response from the word processors,
spreadsheets, and other applications that they still rely on
daily.  Therefore, as the diversity of applications increases,
operating systems need to support multiple coexisting processor
scheduling policies, in order to meet individual applications'
needs as well as to utilize the system's processor resources more
efficiently.


2.1   Related Work


One simple approach to providing real-time support in systems
with traditional timesharing schedulers, which has been adopted
by many commonly-used systems such as Unix, Mach[1, 4], and
Windows NT[25], and has even become part of the posix
standard[17], is support for fixed-priority threads.  Although
these systems generally still use conventional priority-based
timesharing schedulers, they allow real-time applications to
disable the normal dynamic priority adjustments on threads
specifically designated as ``real-time threads,'' so that those
threads always run at a programmer-defined priority.  By
carefully assigning priorities to the real-time threads in the
system and ensuring that all non-real-time threads execute at
lower priority, it is possible to obtain the correct behavior for
some applications.  However, this approach has serious,
well-known limitations; in many cases, entirely different
non-priority-based scheduling policies are needed[18].

Even in normal interactive and batch-mode computing, traditional
priority-based scheduling algorithms are showing their age.  For
example, these algorithms do not provide a clean way to
encapsulate a set of processes/threads as a single unit to
isolate and control their processor usage relative to the rest of
the system.  This shortcoming opens the system to various
denial-of-service attacks, the most well-known being the creation
of a large number of threads which overwhelm processor resources
and crowd out other activity.  These vulnerabilities generally
don't cause serious problems for machines only used by one

person, or when the users of the system fall into one
administrative domain and can ``complain to the boss'' if someone
is abusing the system.  However, as distributed and mobile
computing becomes more prevalent and administrative boundaries
become increasingly blurred, this form of system security is
becoming more important.  This is especially true when completely
unknown, untrusted code is downloaded and run in a ``secure''
environment such as that provided by Java[13] or OmniWare[2].
Schedulers have been designed that address this problem by
providing flexible hierarchical control over CPU usage at
different administrative boundaries[5, 14, 15, 30, 31, 32].
However, it is not yet clear how these algorithms will address
other needs, such as those of hard real-time applications;
certainly it seems unlikely that a single ``holy grail'' of
scheduling policies will be found.

With the growing diversity of application needs and scheduling
policies, it is increasingly desirable for an operating system to
support multiple independent policies.  On multiprocessor
systems, one simple but limited way of doing this is to run a
different scheduling policy on each processor.  A more general
approach is to allow multiple ``scheduling classes'' to run on a
single processor, with a specific scheduling policy associated
with each class.  The classes have a strictly ordered priority
relationship to each other, so the highest-priority class gets
all the CPU time it wants, the next class gets any CPU time left
unused by the first class, etc.  Although this approach shows
promise, one drawback is that since the schedulers for the
different classes generally don't communicate or cooperate with
each other closely, only the highest-priority scheduling class on
a given processor can make any assumptions about how much CPU
time it will have to dispense to the threads under its control.

Additionally, most existing multi-policy scheduling mechanisms
still require every scheduling policy to be implemented in the
kernel and to be closely tied with other kernel mechanisms such
as threads, context switching, clocks, and timers.  The only
existing system we know of that allows different scheduling
policies to be implemented in separate, unprivileged protection
domains is the Aegis Exokernel[8].  However, the Aegis scheduling
mechanism was not described at length and does not address
important issues such as timing, CPU usage accounting, and
multiprocessor scheduling.  Both Aegis's scheduling mechanism and
our framework are based on the use of a ``directed yield''
primitive which allows application-level threads to schedule each
other; this core concept originally comes from coroutines[6], in
which directed yield is the only way thread switching is done.

We believe our scheduling framework could be implemented in an
Exokernel environment through the use of suitable kernel
primitives, application-level support code, and standardized
inter-process scheduling protocols.

Finally, most existing systems still suffer from various priority
inversion problems.  Priority inversion occurs when a
high-priority thread requesting a service has to wait arbitrarily
long for a low-priority thread to finish being serviced.  This
problem can be addressed in priority-based scheduling algorithms
by supporting priority inheritance[7, 26], wherein the thread
holding up the service inherits the priority of the highest
priority thread waiting for service.  In some cases this approach
can be adapted to other scheduling policies, such as with ticket
transfer in lottery scheduling[31].  However, the problem of
resolving priority inversion between threads of different
scheduling classes using policies with different and incomparable
notions of ``priority'' has not been addressed so far.


3  CPU Inheritance Scheduling

In our scheduling model, as in traditional systems, a thread is a
virtual CPU whose purpose is to execute instructions.  A thread
may or may not have a real CPU assigned to it at any given
instant; a running thread may be preempted and its CPU reassigned
to another thread at any time.  (For the purposes of this
framework, it is not important whether these threads are
kernel-managed or user-managed threads, or whether they run in
supervisor or user mode.)

In traditional systems, threads are generally scheduled by some
lower-level entity, such as a scheduler in the OS kernel or a
user-level threads package.  In contrast, the basic idea of CPU
inheritance scheduling is that threads are scheduled by other
threads.  Any thread that has a real CPU available to it at a
given instant can donate its CPU temporarily to another specific
thread instead of using the CPU itself.  This operation is
similar to priority inheritance in conventional systems, except
that it is done explicitly by the donating thread, and no notion
of ``priority'' is involved, only a direct transfer of the CPU
from one thread to another; hence the name ``CPU inheritance.''

A scheduler thread is a thread that spends most of its time
donating its CPU resources to client threads; the client threads
thus inherit some portion of the scheduler thread's CPU
resources, and treat that portion as their virtual CPU. Client
threads can in turn act as scheduler threads, distributing their
CPU time among their own clients, and so on, forming a scheduling

hierarchy.

The only threads in the system that inherently have actual CPU
time available to them are the set of root scheduler threads;
other threads can only run if CPU time is donated to them.  There
is one root scheduler thread for each real CPU in the system;
each CPU is permanently dedicated to supplying CPU time to its
associated root scheduler thread.  The actions of the root
scheduler thread on a given CPU determine the base-level
scheduling policy for that CPU.


3.1   The Dispatcher


Even though all high-level scheduling decisions are performed by
threads, a small low-level mechanism is still needed to implement
primitive thread management functions.  We call this low-level
mechanism the dispatcher to distinguish it clearly from
high-level schedulers.

The role of the dispatcher is to to implement thread blocking,
unblocking, and CPU donation.  The dispatcher fields events and
directs them to threads waiting on those events, without actually
making any real scheduling decisions.  Events can be synchronous,
such as an explicit wake-up of a sleeping thread by a running
thread, or asynchronous, such an external interrupt (e.g., I/O or
timer).  The dispatcher itself is not a thread; it merely runs in
the context of whatever thread owns the CPU at the time an event
of interest occurs.

The dispatcher inherently contains no notion of thread
priorities, CPU usage, or even clocks and timers.  In a kernel
supporting CPU inheritance scheduling, the dispatcher is the only
scheduling component that must be in the kernel; all other
scheduling code could in theory run in user-mode threads outside
of the kernel (although this ``purist'' approach may be
impractical for performance reasons).


3.2   Requesting CPU Time


Because no thread (except a root scheduler thread) can ever run
unless some other thread donates CPU time to it, a newly-created
or newly-woken thread must request CPU time from some scheduler
before it can run.  Each thread has an associated scheduler which
has primary responsibility for providing CPU time to the thread.
When the thread becomes ready, the dispatcher notifies the
thread's scheduler that the thread needs CPU time.  The exact
form such a notification takes is not important; in our

implementation, notifications are simply IPC messages sent by the
dispatcher to Mach-like message ports.

When a thread wakes up, the notification it produces may in turn
wake up a scheduler thread waiting to receive such messages on
its port.  Waking up that scheduler thread will cause another
notification to be sent to its scheduler, which may wake up still
another thread, and so on.  Thus, waking up an arbitrary thread
can cause a chain of wakeups to propagate back through the
scheduler hierarchy.  Eventually, this propagation may wake up a
scheduler thread that is currently being supplied with CPU time
but is donating it to some other thread.  In that case, the
thread currently running on that CPU is preempted and control is
given back to the woken scheduler thread immediately; the
scheduler can then make a decision to re-run the preempted
client, switch to the newly-woken client, or even run some other
thread.  Alternatively, the propagation of wake-up events may
terminate at some point, for example because a notified scheduler
is already awake (not waiting for messages) but has been
preempted.  In that case, the dispatcher knows that the wake-up
event is irrelevant for scheduling purposes at the moment, so the
currently running thread is resumed immediately.


3.3   Relinquishing the CPU


At any time, a running thread may block to wait for one or more
events to occur, such as I/O completion or timer expiration.
When a thread blocks, the dispatcher returns control of the CPU
to the scheduler thread that provided it to the running thread.
That scheduler may then choose another thread to run, or it may
relinquish the CPU to its scheduler, and so on up the line until
some scheduler finds work to do.


3.4   Voluntary Donation


Instead of simply blocking, a running thread can voluntarily
donate its CPU to another thread while waiting on an event of
interest; this is done in situations where priority inheritance
would traditionally be used.  For example, when a thread attempts
to obtain a lock that is already held, it may donate its CPU time
to the thread holding the lock; similarly, when a thread makes an
RPC to a server thread, the client thread may donate its CPU time
to the server for the duration of the request.  When the event of
interest occurs, the donation ends and the CPU is given back to
the original thread.  In our implementation of this framework,
the basic synchronization and IPC primitives automatically invoke

the dispatcher to perform voluntary donation appropriately when
the thread blocks; however, voluntary donation could also be done
optionally or through explicit dispatcher calls.

It is possible for a single thread to inherit CPU time from more
than one source at a given time:  for example, a thread holding a
lock may inherit CPU time from several threads waiting on that
lock in addition to its own scheduler.  In this case, the effect
is that the thread has the opportunity to run at any time any of
its donor threads would have been able to run.  A thread only
``uses'' one CPU source at a time; however, if its current CPU
source runs out (e.g., due to quantum expiration), the dispatcher
will automatically send scheduling request notifications on
behalf of all the threads depending on (donating to) to the
preempted thread, effectively switching the thread automatically
to another available CPU source.  One potential worry is that a
thread consuming CPU time from many sources will cause an
``avalanche effect'' every time it is preempted or woken as the
dispatcher fires off several scheduling requests, each of which
may cause more scheduling requests as intermediate-level
schedulers are woken up.  We believe that in practice it should
be uncommon for a thread to inherit from more than one or two
other threads at once, so this should not be a major problem;
however, we have not yet examined this issue in detail.


3.5   The schedule operation


The call a scheduler thread makes to donate CPU time to a client
thread is simply a special form of voluntary CPU donation, in
which the thread to donate to and the event to wait for can be
specified explicitly.  In our implementation, the schedule
operation takes as parameters a thread to donate to, a port on
which to wait for messages from other client threads, and a
wakeup sensitivity parameter indicating in what situations the
scheduler should be woken.  The operation donates the CPU to the
specified target thread and puts the scheduler thread to sleep on
the specified port; if a message arrives on that port, such as a
notification that another client thread has been woken or a
message indicating that a timer has expired, then the schedule
operation terminates and control is returned to the scheduler
thread.

In addition, the schedule operation may be interrupted before a
message arrives, depending on the behavior of the thread to which
the CPU is being donated and the value of the wakeup sensitivity
parameter.  The wakeup sensitivity level acts as a hint to the
dispatcher allowing it to avoid waking up the scheduler thread


                    Figure 3:  CPU donation chain


except when necessary; it is only an optimization and is not in
theory required for the system to work.  Our system supports the
following three sensitivity levels:


   o WAKEUP_ON_BLOCK: If the target of the schedule operation
     blocks without further donating the CPU, then the schedule
     operation terminates and control is returned to the
     scheduler immediately.  For example, in Figure 3, if
     scheduler thread S1 has donated the CPU to thread T2 using
     this wakeup sensitivity setting, but T2 blocks and can no
     longer use the CPU, then S1 will receive control again.
     WAKEUP_ON_BLOCK is the ``most sensitive'' setting, and is
     typically used when the scheduler has other client threads
     waiting to run.

   o WAKEUP_ON_SWITCH: If the client thread using the CPU (T2)
     blocks, control is not immediately returned to its scheduler
     (S1):  the dispatcher behaves instead as if S1 itself
     blocked, and passes control on back to its scheduler, S0.
     If T2 is subsequently woken up, then when S0 again provides
     the CPU to S1, the dispatcher passes control directly back
     to T2 without actually running S1.  However, if a different
     client of S1, such as T1, wakes up and sends a notification
     to S1's message port, then S1's schedule operation will be
     interrupted.  This sensitivity level is typically used when
     a scheduler has only one thread to run at the moment and
     doesn't care when that thread blocks or unblocks, but it
     still wants to switch between different client threads
     manually:  for example, the scheduler may need to start and
     stop timers when switching between client threads.

   o WAKEUP_ON_CONFLICT: The scheduler is only awakened if a
     second client thread wakes up while the scheduler is already
     donating CPU to a client (e.g., if T1 wakes up while T2 is

     running).  If T2 blocks, the scheduler blocks too; then, if
     any single client of scheduler S1 is subsequently woken,
     such as T1, the dispatcher passes control directly to the
     woken client thread without waking up the scheduler.  At
     this weakest sensitivity level, the dispatcher is allowed to
     switch among client threads freely; the scheduler only acts
     as a ``conflict resolver,'' making decisions when two client
     threads are runnable at the same time.


4  Implementing High-level Schedulers

This section describes how the basic CPU inheritance mechanism
can be used to implement high-level scheduling policies and
related features such as CPU usage accounting, processor
affinity, and scheduler activations.


4.1   Single-CPU Schedulers


Figure 4 shows a simplified code fragment from a basic
non-prioritized FIFO scheduler in our system.  The scheduler
keeps a queue of client threads waiting for CPU time, and
successively runs each one using the schedule operation while
waiting for messages to arrive on its port (e.g., notifications
from newly-woken client threads).  When there are no client
threads waiting to be run, the scheduler uses the ordinary
non-donating msg_rcv operation instead of the schedule operation
in order to relinquish the CPU while waiting for messages.  If
there is only one client thread in the scheduler's queue, the
scheduler uses the weaker WAKEUP_ON_CONFLICT sensitivity level
when running it to indicate that the dispatcher may switch among
client threads arbitrarily as long as only one client thread
attempts to use the CPU at a time.


4.2   Timekeeping and Preemption


The simple FIFO scheduler above can be converted to a round-robin
scheduler by introducing some form of clock or timer.  For
example, if the scheduler is the root scheduler on a CPU, then
the scheduler might be directly responsible for servicing clock
interrupts.  Alternatively, the scheduler may rely on a separate
``timer thread'' to notify it when a periodic timer expires.  In
any case, a timer expiration or clock interrupt is indicated to
the scheduler by a message being sent to the scheduler's port.
This message causes the scheduler to break out of its schedule
operation and preempt the CPU from whatever client thread was

using it.  The scheduler can then move that client to the tail of
the ready queue for its priority and give control to the next
client thread at the same priority.


4.3   Multiprocessor Support


Since the example scheduler above only contains a single
scheduler thread, it can only schedule a single client thread at
once.  Therefore, although it can be run on a multiprocessor
system, it cannot take advantage of multiple processors
simultaneously.  For example, a separate instance of the FIFO
scheduler could be run as the root scheduler on each processor; a
client thread assigned to a given scheduler is effectively bound
to its scheduler's CPU. Although in some situations this
arrangement can be useful, e.g., when each processor is to be
dedicated to a particular purpose, in most cases it is not what
is needed.

In order for a scheduler to provide ``real'' multiprocessor
scheduling to its clients, where different client threads can be
dynamically assigned to different processors on demand, the
scheduler itself must be multi-threaded.  Assume for now that the
scheduler knows how many processors are available, and can bind
threads to processors.  (This is trivial if the scheduler is run
as the root scheduler on some or all processors; we will show
later how this requirement can be met for non-root schedulers.)
The scheduler creates a separate thread bound to each processor;
each of these scheduler threads then selects and runs client
threads on that processor.  The scheduler threads cooperate with
each other using shared variables, e.g., shared run queues in the
case of a multiprocessor FIFO scheduler.

Since a scheduler's client threads are supposed to be unaware
that they are being scheduled on multiple processors, the
scheduler exports only a single port representing the scheduler
as a whole to all of its clients.  When a client thread wakes up
and sends a notification to the scheduler port, the dispatcher
arbitrarily wakes up one of the scheduler threads waiting on that
port.  A good policy is for the dispatcher to wake up the
scheduler thread associated with the CPU on which the wakeup is
being done; this allows the scheduler to be invoked on the local
processor without interfering with other processors
unnecessarily.  If the woken scheduler thread discovers that the
newly woken client should be run on a different processor (e.g.,
because it is already running a high-priority client but another
scheduler thread is running a low-priority client), it can
interrupt the other scheduler thread's schedule operation by

sending it a message or ``signal''; this corresponds to sending
inter-processor interrupts in traditional systems.


4.3.1   Processor Affinity

Scheduling policies that take processor affinity into
consideration[27, 28, 29], can be implemented by treating each
scheduler thread as a processor and attempting to schedule a
client thread from the same scheduler thread that previously
donated CPU time to that client thread.  Of course, this will
only work if the scheduler threads themselves are consistently
run on the same processor.  Any processor affinity support in one
scheduling layer will only work well if all the layers between it
and the root scheduler also consider processor affinity; a
mechanism to ensure this is described in the next section.


4.3.2   Scheduler Activations

In the common case, client threads ``communicate'' with their
schedulers implicitly through notifications sent by the
dispatcher on behalf of the client threads.  However, there is
nothing to prevent client threads from explicitly communicating
with their schedulers through some additional interface.  One
particularly useful explicit interface is scheduler
activations[3], which allows clients to determine initially and
later track the number of actual processors available to them.
Clients can then create or destroy threads as appropriate in
order to make use of all available processors without creating
superfluous threads that compete with each other uselessly on a
single processor.

Since scheduler threads are notified by the dispatcher when a
client thread blocks and temporarily cannot use the CPU (e.g.,
because the thread is waiting for an I/O request or a page fault
to be serviced), the scheduler can notify the client in such a
situation and give the client an opportunity to create a new
thread to make use of the CPU while the original thread is
blocked.  For example, a client can create a pool of ``dormant''
threads, or ``activations,'' which the scheduler knows about but
normally never runs.  If a CPU becomes available, e.g., because
of another client thread blocking, the scheduler ``activates''
one of these dormant threads on the CPU vacated by the blocked
client thread.  Later, when the blocked thread eventually
unblocks and requests CPU time again, the scheduler preempts one
of the currently running client threads and notifies the client
that it should make one of the active threads dormant again.

Scheduler activations were originally devised to provide better
support for application-specific thread packages running in a

single user mode process.  In an OS kernel that implements CPU
inheritance scheduling, extending a scheduler to provide this
support should be quite straightforward.  However, in a
multiprocessor system based on CPU inheritance scheduling,
scheduler activations are also useful in stacking of first-class
schedulers.  As mentioned previously, multiprocessor schedulers
need to know the number of processors available in order to use
the processors efficiently.  As long as a base-level scheduler
(e.g., the root scheduler on a set of CPUs) supports scheduler
activations, a higher-level multiprocessor scheduler running as a
client of the base-level scheduler can use the scheduler
activations interface to track the number of processors available
and schedule its clients effectively.  (Simple single-threaded
schedulers that only make use of one CPU at a time don't need
scheduler activations and can be stacked on top of any
scheduler.)


4.4   Timing


Most scheduling algorithms require a measurable notion of time in
order to implement preemptive scheduling.  For most schedulers, a
periodic interrupt is sufficient, although some real-time
schedulers may need finer-grained timers whose periods can be
changed at each quantum.  With CPU inheritance scheduling, the
precise nature of the timing mechanism is not important to the
general framework; all that is needed is some way for a scheduler
thread to be woken up after some amount of time has elapsed.  In
our implementation, schedulers can register timeouts with a
central clock interrupt handler; when a timeout occurs, a message
is sent to the appropriate scheduler's port, waking up the
scheduler.  The dispatcher automatically preempts the running
thread if necessary and passes control to the scheduler so that
it can account for the elapsed time and potentially switch to a
different client thread.


4.4.1   CPU Usage Accounting

Besides simply deciding which thread to run next, schedulers
often must account for CPU resources consumed.  CPU accounting
information is used for a variety of purposes, such as reporting
usage statistics to the user on demand, modifying scheduling
policy based on CPU usage (e.g., dynamically adjusting thread
priority), or billing a customer for CPU time consumed for a
particular job.  As with scheduling policies, there are many
possible CPU accounting mechanisms, each with different
cost/benefit tradeoffs.  The CPU inheritance scheduling framework

allows a variety of accounting policies to be implemented by
scheduler threads.

There are two well-known approaches to CPU usage accounting:
statistical and time stamp-based[4].  With statistical
accounting, the scheduler wakes up on every clock tick, checks
the currently running thread, and charges the entire time quantum
to that thread.  This method is quite efficient, since the
scheduler generally wakes up on every clock tick anyway; however,
it provides limited accuracy.  A variation that provides better
accuracy at slightly higher cost is to sample the current thread
at random points between clock ticks[21].  Alternatively, with
time stamp-based accounting, the scheduler reads the current time
at every context switch and charges the appropriate thread for
the time since the last context switch.  This method provides
extremely high accuracy, but also imposes a high cost due to
lengthened context switch times, especially on systems on which
reading the current time is expensive.

In the root scheduler on a processor, these methods can be
applied directly.  To implement statistical accounting, the
scheduler simply checks what thread it ran last upon being woken
up by the arrival of a timeout message.  To implement time
stamp-based accounting, the scheduler reads the current time each
time it schedules a different client thread.  The scheduler must
use the WAKEUP_ON_BLOCK sensitivity level in order to ensure that
it can check the time at each thread switch and to ensure that
idle time is not charged to any thread.

For schedulers stacked on top of other schedulers, CPU accounting
becomes a little more complicated because the CPU time supplied
to such a scheduler is already ``virtual'' and cannot be measured
accurately by a wall-clock timer.  For example, in Figure 3, if
scheduler S1 measures T2's CPU usage using a wall-clock timer,
then it may mistakenly charge against T2 time actually used by
the high-priority thread T0, which S1 has no knowledge of because
it is scheduled by the root scheduler S0.  In many cases, this
inaccuracy caused by stacked schedulers may be ignored in
practice on the assumption that high-priority threads and
schedulers will consume relatively little CPU time; this
assumption is similar to the one made in many existing kernels
that hardware interrupt handlers consume little enough CPU time
that they may be ignored for accounting purposes.  In situations
in which this assumption is not valid and accurate CPU accounting
is needed for stacked schedulers, virtual CPU time information
provided by base-level schedulers can be used instead of
wall-clock time, at the cost of additional communication between
schedulers.  For example, in Figure 3, at each clock tick (for

statistical accounting) or each context switch (for time
stamp-based accounting), scheduler S1 could request its own
virtual CPU time usage from S0 instead of checking the current
wall-clock time.  It then uses this virtual time information to
maintain usage statistics for its clients, T1 and T2.


4.4.2   Effects of CPU Donation on Timing

As mentioned earlier, CPU donation can occur implicitly as well
as explicitly, e.g., to avoid priority inversion when a
high-priority thread attempts to lock a resource already held by
a low-priority thread.  For example, in Figure 5, scheduler S0
has donated the CPU to high-priority thread T0 in preference over
low-priority thread T1.  However, it turns out that T1 is holding
a resource needed by T0, so T0 implicitly donates its CPU time to
T1.  Since this donation merely extends the scheduling chain, S0
is unaware that the switch occurred, and it continues to charge
CPU time used to T0 instead of T1 which is the thread that is
actually using the CPU.

While it may seem non-intuitive at first, in practice this is
often precisely the desired behavior; it stems from the basic
rule that with privilege comes responsibility.  While T0 is
donating CPU to T1, T1 is effectively doing work on behalf of T0,
even if T1 is unaware that it is receiving CPU time from T0.
Since T0's share of the CPU is being used to perform this work,
the CPU time consumed must also be charged to T0, even if the
work is actually being done by another thread.  Demonstrated
another way, charging T1 rather than T0 would allow the
accounting system to be subverted.  For example, if high-priority
CPU time is ``expensive'' and low-priority CPU time is ``cheap,''
then T0 could collude with T1 to use high-priority CPU time while
being charged the low-priority ``rate'' simply by arranging for
T1 to do all the actual work while T0 blocks on a lock
perpetually held by T1.  This ability to charge the ``proper''
thread for CPU usage even in the presence of priority inheritance
is unnatural and difficult to implement in traditional systems,
and therefore is generally not implemented by them; on the other
hand, this feature falls out of the CPU inheritance framework
automatically.


4.5   Threads with Multiple Scheduling Policies


Sometimes it is desirable for a single thread to be associated
with two or more scheduling policies at once.  For example, a
thread may normally run in a real-time rate-monotonic scheduling
class; however, if the thread's quantum expires before its work

is done, it may be desirable for the thread to drop down to the
normal timesharing class instead of simply stopping dead in its
tracks.

Support for multiple scheduling policies per thread can be
achieved in our framework even under a dispatcher that supports
only a single scheduler association per thread, by creating one
or more additional ``dummy'' threads which perpetually donate
their CPU time to the ``primary'' thread.  Each of these threads
can have a different scheduler association, and the dispatcher
automatically ensures that the primary thread always uses the
highest priority scheduler available, as described in
Section 3.4.  In situations in which this solution is not
acceptable due to performance or memory overhead, the dispatcher
could easily be extended to allow multiple schedulers to be
associated with a single thread, so that when such a thread
becomes runnable the dispatcher automatically notifies all of the
appropriate schedulers.

Although it may at first seem inefficient to notify two or more
schedulers when a single thread awakes, in practice many of these
notifications never actually need to be delivered.  For example,
if a real-time/timesharing thread wakes up, finishes all of its
work and goes back to sleep again before its real-time scheduling
quantum is expired (presumably the common case), then the
notification posted to the low-priority timesharing scheduler at
wakeup time will be canceled (removed from the queue) when the
thread goes to sleep again, so the timesharing scheduler
effectively never sees it.


5  Analysis and Experimental Results

We have created a prototype implementation of this scheduling
framework and devised a number of tests to evaluate its
flexibility, performance, and practicality.


5.1   Test Environment


In order to provide a clean, easily controllable environment, we
implemented an initial prototype of this framework in a
user-level threads package.  The threads package supports common
abstractions such as mutexes, condition variables, and message
ports for inter-thread communication and synchronization.  The
package implements separate thread stacks with setjmp/longjmp,
and the virtual CPU timer alarm signal (SIGVTALRM) is used to
provide preemption and simulate clock interrupts.  We used the
virtual CPU timer instead of the wall-clock timer in order to

minimize distortion of the results due to other activity in the
host Unix system; in a ``real'' user-level threads package based
on this scheduling framework, the normal wall-clock timer would
probably be used instead.  Our prototype allows threads to wait
on only one event at a time; however, there is nothing about the
framework that makes it incompatible with thread models in which
threads can wait on multiple events at once[25].

Although implemented in user space, our prototype is designed to
reflect the structure and execution environment of an actual OS
kernel running in privileged mode.  For example, the dispatcher
itself is passive, nonpreemptible code executed in the context of
the currently running thread, an environment similar to that of
BSD and other traditional kernels.  The dispatcher is cleanly
isolated from the rest of the system, and supports scheduling
hierarchies of unlimited depth and complexity.  Our prototype
schedulers are also isolated from each other and from their
clients; the various components communicate with each other
through message-based protocols that could easily be adapted to
operate across protection domains using IPC.

Except where otherwise noted, all measurements were taken on a
100MHz Pentium PC with 32 megabytes of RAM, running FreeBSD
2.1.5.


5.2   Scheduler Configuration


The following experiments are based on scheduling hierarchy shown
in Figure 6, which is designed to reflect the activity present in
a general-purpose environment.  In this environment, the root
scheduler is a nonpreemptive fixed-priority scheduler with a
first-come-first-served policy among threads of same priority.
This scheduler arbitrates between three scheduling classes:  a
real-time rate-monotonic scheduler at the highest priority, a
lottery scheduler providing a timesharing class, and a simple
round-robin scheduler for background jobs.  On top of the lottery
scheduler managing the timesharing class, an application-specific
third-level scheduler manages two threads using lottery
scheduling (e.g., a Web browser managing Java applet threads).
Finally, a second application-specific scheduler in the
timesharing class schedules two cooperating threads using
nonpreemptive FIFO scheduling.


5.3   Scheduling Behavior


The purpose of our first experiment is simply to verify that our
framework works as expected, producing the same scheduling

behavior as traditional single- and multi-class schedulers do in
equivalent configurations.  Figure 7 shows the scheduling
behavior of the threads simulated in our scheduling hierarchy
over a time period covering the following sequence of events:


  1. A rate-monotonic thread, RM1, becomes runnable and starts
     consuming CPU time periodically using a fixed reservation of
     50%.

  2. A round-robin background thread, RR1, begins a long
     computation in which it consumes all the CPU time it can
     get.  The rate-monotonic thread, RM1, is unaffected because
     the fixed-priority root scheduler always schedules the
     real-time scheduler in preference to the background
     scheduler.

  3. A second round-robin background thread, RR2, becomes
     runnable at the same priority as RR1.

  4. An application thread in the timesharing class becomes
     runnable, stealing all available CPU time from the
     background jobs for a short burst of time (e.g., a
     spreadsheet recalculation).

  5. A second rate-monotonic thread, RM2, becomes runnable,
     consuming 25% of the total CPU time.  The time available to
     the timesharing class is reduced accordingly, but RM1
     remains unaffected.

  6. The timesharing thread, LS1, finishes its burst of activity,
     allowing the background jobs to continue once again.


5.4   Modular Control


In order to demonstrate the modular control provided by the
framework, we now consider the two third-level schedulers in our
example configuration, which represent application-specific
schedulers.  The first, implementing lottery scheduling,
simulates a web browser arbitrating between two downloaded Java
applets.  The browser can vary the ticket ratio allocated to each
applet according to some policy, e.g., giving more CPU time to
applets whose output windows are currently on-screen.  The second
application-level scheduler, a simple non-preemptive FIFO
scheduler, represents an application that uses threads merely for
programmatic purposes, and does not need to maintain any notion
of priority or timeliness between its threads.  Using a
non-preemptive scheduler in such an application eliminates

unnecessary contention and context switches between application
threads without giving up the benefits of preemptive scheduling
in other parts of the system.

In this example, the two Java applet threads perpetually consume
as much CPU time as possible, and each of the cooperating FIFO
threads alternately run for some time and then yield control to
the other.  Initially, the ticket ratio between the web browser
and the cooperative application is 4:1, the ratio between the
applet threads is 1:4, and each of the FIFO threads compute for
approximately the same time before yielding to the other.
Figure 8 shows the behavior of the system across four events:


  1. At time 2000, the web browser changes the relative ticket
     ratio between the Java applet threads to be 1:1, e.g.,
     because the first has come on-screen.  The amount of CPU
     time allocated to the cooperative application remains
     unchanged, however, demonstrating load insulation.  Since
     the timesharing-class scheduler and the web browser's
     scheduler are both lottery schedulers, this example is
     equivalent to the use of two currencies in a single lottery
     scheduler[31].

  2. The cooperative thread FIFO1 changes its computation so that
     it now consumes four times the amount of CPU time before
     yielding to FIFO2.  The effective distribution of CPU time
     changes to 4:1, reflecting the fact that the FIFO scheduling
     policy makes no attempt at fairness.  However, since the
     timesharing-class scheduler uses a proportional-share
     policy, the applications are insulated from each other and
     the web browser is unaffected.  This example demonstrates
     load insulation between different scheduling policies.

  3. The user changes the ticket allocation between the two
     applications to 1:1.  Both sub-schedulers automatically
     adjust to the new allocation according to their individual
     scheduling policies.

  4. Finally, the priority of RR1 is raised above that of RR2,
     causing it to receive all of the CPU time allocated to the
     round-robin scheduler while leaving the other scheduling
     domains unaffected.


5.5   Avoidance of Priority Inversion


In the next experiment we demonstrate how priority inversion can
be avoided in our framework even among threads running under

different scheduling policies.  For this experiment, we use
rate-monotonic thread RM1, lottery scheduled thread LS1, and
background thread RR1.  The rate-monotonic thread and the
background thread together simulate a data acquisition
application in which a high-priority real-time thread
periodically receives data and stores it in a memory buffer, and
a low-priority computational thread inspects that data in the
background using spare CPU cycles.  In our simulated environment,
the high-priority thread wakes up every five clock ticks,
acquires a shared mutex lock, and then releases it one clock tick
later.  The low-priority thread alternately acquires the lock and
computes for up to three clock ticks, and then releases it for a
varying amount of time.  The lottery scheduled thread, LS1, is a
medium-priority thread representing other priority
inversion-producing activity in the system; it alternately runs
and sleeps for random amounts of time up to 20 ticks, but
otherwise does not interact with the other threads.

Figure 9 shows a plot of the distribution of latencies
experienced by the high-priority thread in attempting to acquire
the shared mutex, with and without the voluntary donation
protocol described in Section 3.4.  When the inheritance protocol
is in effect, maximum latency is bounded by the maximum amount of
time that the low-priority thread runs with the lock held (three
clock ticks), and does not reflect the much larger delays that
would be imposed if LS1 could prevent RR1 from running while RM1
is waiting for it to release the shared lock.  Thus, the
framework provides correct real-time behavior even though all
three threads run under different scheduling policies:  it is not
priority inheritance, but the more general CPU inheritance
protocol, that makes this work.


5.6   Multiprocessor Scheduling


To test the ability of our framework to support multiprocessor
scheduling, we extended the test environment to simulate a
multiprocessor.  Since a real multiprocessor machine was not
available for this test, we emulated one by time slicing the BSD
process at a fine granularity among several ``virtual
processors.''  Each processor has its own dispatcher and root
scheduler thread, but the root scheduler threads cooperate as
described in Section 4.3 to implement a single fixed-priority
multiprocessor scheduler with a shared ready queue for each
priority level.  We tested the multiprocessor scheduler by
implementing a simple parallel database lookup application, in
which multiple threads repeatedly search for a randomly chosen

         ____________________________________________________
         |_Scheduling_Hierarchy_Depth_|_Dispatch_Time_(~s)__|
         |_Root_scheduler_only________|_________________8.0_|
         |_2-level_scheduling_________|________________11.2_|
         |_3-level_scheduling_________|________________14.0_|
         |_4-level_scheduling_________|________________16.2_|
         |_8-level_scheduling_________|________________24.4_|


                 Table 1:  Basic dispatching costs.


item in a shared binary tree, locking and unlocking the nodes of
the tree as they descend.  We ran this application on 1, 2, 4,
and 8-processor configurations, with varying numbers of worker
threads; in each case, the behavior and performance was exactly
as would be expected from a comparable traditional scheduler.
Although this experiment demonstrates that the framework extends
to multiprocessors, we have not yet explored the effects of
stacking multiprocessor scheduling policies, nor have we
implemented scheduler activations, which we suspect will be
critical to making CPU inheritance scheduling work well in many
practical multiprocessor situations.


5.7   Scheduling Overhead


While the preceding experiments demonstrate the basic properties
and potential usefulness of our scheduling framework, it remains
to be seen whether it is efficient enough in practice.  In
comparison to traditional multi-class scheduling mechanisms, our
framework introduces two additional sources of overhead:  first,
the overhead caused by the dispatcher itself while computing the
thread to switch to after a given event, and second, the overhead
resulting from additional context switches to and from scheduler
threads.  The computation overhead caused by a given scheduling
algorithm is not an issue, since the same algorithms can be used
in each case.


5.7.1   Dispatcher Costs

To address the first issue, Table 1 shows the basic costs of the
dispatcher's computation.  These costs are dependent on the depth
of the scheduling hierarchy since the dispatcher must often
iterate through trees and linked lists whose length is
proportional to the depth of the hierarchy.

One concern raised by the dependence of dispatch time on
scheduling hierarchy depth is that, since in theory the framework

imposes no depth limit and the dispatcher itself is always
effectively the ``highest priority activity in the system,'' this
creates a source of unbounded priority inversion that would be
unacceptable in hard real-time systems.  However, in practice
there is no reason a particular dispatcher must support unlimited
depth; the dispatcher in a hard real-time system could limit the
depth to four or eight levels, which should be sufficient for all
practical purposes and still impose little computational
overhead.


5.7.2   Context Switch Costs

The second type of overhead in our framework is the cost of
additional context switches to and from scheduler threads.
Unfortunately, the cost of a context switch varies widely between
different environments, from user-level threads packages in which
context switches are almost free, to monolithic kernels in which
context switches are several orders of magnitude more expensive.
An obvious inference is that our framework is likely to be
practical in some environments but not in others.  Therefore, in
order to gain a meaningful idea of how expensive our framework is
likely to be in a given environment, we first measure the number
of additional context switches produced, which varies with
different applications and system loads but is not dependent on
thread and protection boundary-crossing costs.

Table 2 shows the context switch statistics observed for each of
several example applications:  ``Client/Server'' is an
application in which a number of client threads repeatedly invoke
services on a smaller set of server threads; ``Parallel
Database'' is the multiprocessor binary search application
described in Section 5.6; ``Real-time'' is the data acquisition
application from Section 5.5; finally, ``General'' is the test in
Section 5.3 representing general-purpose computing activity.  The
chart shows the number of times the dispatcher switched to each
thread over the course of the test; user threads are shown
separately from scheduler threads.  The last row in the table
shows the percentage of total context switches that were
invocations of scheduler threads.  This percentage is
consistently near 50% regardless of the application; this
indicates that, on average, approximately one additional
scheduler thread invocation can be expected in our framework for
each ordinary context switch.  However, note that this chart is
overly pessimistic because the ``scheduler invocations'' figures
include all context switches to scheduler threads, not only those
directly related to scheduling:  in particular, it includes
context switches caused by explicit RPCs, which greatly inflate


________________________________________________________________
|                          | Client/| Parallel| Real- | General |
|__________________________|_Server_|Database_|__time_|_________|

| RM1                      |     57 |         |   322 |     101 |
| RM2                      |     19 |         |       |      26 |
| RM3                      |     19 |         |       |         |
| LS1                      |     25 |         |   622 |      17 |
| JAVA1                    |     46 |         |       |         |
| FIFO1                    |      9 |         |       |         |
| RR1                      |    114 |     238 |   249 |       7 |
| RR2                      |      3 |     242 |       |      14 |
| RR3                      |        |     234 |       |         |
|_RR4______________________|________|_____243_|_______|_________|
|_User_invocations_________|____492_|_____957_|__1193_|_____165_|

| Root scheduler           |    262 |     956 |  1237 |     142 |
| Rate monotonic           |     43 |         |     1 |      65 |
| Lottery scheduler        |     30 |         |    57 |       3 |
| Java thread scheduler    |      2 |         |       |         |
| FIFO scheduler           |      1 |         |       |         |
|_Round-robin_scheduler____|______8_|_________|_____8_|_______8_|
|_Scheduler_invocations____|____346_|_____956_|__1303_|_____218_|

|_Total_context_switches___|____838_|____1913_|__2496_|_____383_|

|_Scheduler_invoc.__rate___|____41%_|_____50%_|___52%_|_____56%_|

          Table 2:  Switch Costs for Simple Applications.


the counts for the root scheduler in our system because our root
scheduler thread also provides message-based timers and other
facilities to the rest of the system.


5.7.3   Overhead in Kernel Environments

Although we have not yet implemented our framework in a kernel
environment, we can get some idea of how it would perform in such
an environment based on the results shown above and a knowledge
of how real-world applications exercise the scheduler.  Table 3
shows system-wide statistics we gathered for several familiar
applications running on FreeBSD 2.1.5, measured using BSD's
vmstat command.  gzip is a compute-intensive application
compressing an 8MB file; gcc is a build of a 20,000-line program
using the GNU C compiler; tar represents an I/O-intensive
application copying 8MB of source files; and configure is an
extremely I/O and fork-intensive 3000-line Unix shell script.
All tests were performed on a networked machine running in
multi-user mode with a full complement of daemons but no other
user activity, representative of a typical single-user


_________________________________________________________
|________________________|_gzip_|_gcc_|__tar_|configure_|
|_Run_time_(sec)_________|_26.4_|35.3_|__9.6_|_____26.0_|
| Context switches/sec   |  11  |  32 |   81 |      202 |
| Traps/sec              |  10  | 562 |   22 |     3470 |
| System calls/sec       |  23  | 651 |  517 |     1807 |
|_Device_interrupts/sec__|_427__|_509_|_3337_|_____1055_|

Table 3:  Scheduling-related statistics for test applications
measured under FreeBSD-2.1.5 on a 100MHz Pentium PC.


workstation.

For three of the applications in Table 3, gzip (best-case), gcc
(average), and configure (worst-case), Figure 10 shows the
overall application slowdown that would result if each normal
process-to-process context switch was n microseconds more
expensive, where n varies from 1 to 1000 (1 ms).  This graph
effectively shows the ``tolerance'' of a system to scheduling
overhead in different situations:  for example, supposing that
typical users would tolerate no more than about 2% slowdown for
typical applications such as gcc, our framework (or any other
scheduling mechanism) would have to add no more than about 300~s
to the process-to-process context switch time if implemented in
FreeBSD. Suppose FreeBSD was changed so that all scheduling was
done in user mode, adding approximately one additional context
switch due to a scheduler invocation for each existing
process-to-process context switch.  Based on context switch times
we measured using the lmbench benchmark suite[24], which are
approximately 39~s on this machine, plus an additional 11~s to
reflect the dispatcher's cost (Section 5.7.1), the overall
overhead should still be negligible simply because FreeBSD does
not context switch all that often (see arrow A on the graph).
Furthermore, given BSD's monolithic design, we would in practice
expect at least the root scheduler to be implemented in the
kernel and only application-specific schedulers to be in user
mode; this would reduce the cost even further.

Of course, microkernels have much less tolerance for scheduling
overhead simply because they perform more context switches.  For
example, in Mach, traps and system calls can be expected to
produce about two additional context switches each; for an
extremely ``purist'' microkernel such as L4, in which even device
drivers are in user mode, device interrupts would also add an
additional two context switches each.  Figure 10 also shows a
plot of scheduling overhead tolerance for a hypothetical L4-like
microkernel, based on the numbers in Table 3 except with traps,
system calls, and interrupts each counting as two additional

context switches.  For example, to keep the overhead under 2% for
gcc in this system, the additional per-context switch cost must
be no more than about 6~s (see arrow B), which would be difficult
even with L4's phenomenal 3.2~s round-trip RPC time[20].  On the
other hand, these ``back of the envelope'' calculations are
pessimistic in several ways:  first, an additional scheduler
invocation should not be needed for every normal context switch,
as explained in Section 5.7.2; second, in a microkernel that
dispatches hardware interrupts to threads, we would expect at
least a minimal fixed-priority root scheduler to be implemented
in the kernel so that device driver threads can be scheduled
without the help of user-level schedulers; and third, our
prototype dispatcher is extremely unoptimized in terms of both
its algorithm (it still causes many avoidable context switches)
and its implementation (it takes much longer than necessary to
compute the next thread to run).  There should certainly be many
points in the design space at which CPU inheritance scheduling is
practical even for microkernels; we are currently working on a
second prototype of the framework in our Fluke
microkernel[10, 11] in order to evaluate the framework further in
this light.


5.8   Code Complexity


As a final useful metric of the practicality of our framework, we
measured our prototype's code size and complexity in terms of
both raw line count and lines containing semicolons.  The entire
dispatcher is contained in a single well-commented file of 550
(raw) or 158 (semicolons) lines.  Each of our example schedulers
is around 200/100 lines long; of course, production-quality
schedulers that handle processor affinity, scheduler activations,
dynamic priority adjustments, and so on, would probably be
significantly bigger.


6  Conclusion

In this paper we have presented a novel processor scheduling
framework in which threads are scheduled by other threads.
Widely different scheduling policies can coexist in a single
system, providing modular, hierarchical control over CPU usage.
Applications as well as the OS can implement customized local
scheduling policies, and CPU resources consumed are accounted for
and attributed accurately.  The framework also cleanly addresses
priority inversion by providing a generalized form of priority
inheritance that automatically works within and among multiple

diverse scheduling policies.  CPU inheritance scheduling extends
naturally to multiprocessors, and supports processor management
techniques such as processor affinity and scheduler activations.
We have shown that this flexibility can be provided with
negligible overhead in environments in which context switches are
fast, and that the framework should be practical even in some
kernel environments in which context switches are more expensive.


Acknowledgements

For their many thoughtful and detailed comments on earlier drafts
we thank the anonymous reviewers and Kevin Jeffay, our shepherd,
as well as the members of the Flux project.  We are especially
grateful to Jay Lepreau for his support and advice, to Kevin Van
Maren for considerable help on the results and bibliography, and
to Mike Hibler for spending an entire day reviving Bryan's
machine after a complete disk failure on Friday the 13th just
before the final paper deadline.


References

 [1] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid,
     A. Tevanian, and M. Young.  Mach:  A New Kernel Foundation
     for UNIX Development.  In Proc. of the Summer 1986 USENIX
     Conf., pages 93--112, June 1986.

 [2] A. Adl-Tabatabai, G. Langdale, S. Lucco, and R. Wahbe.
     Efficient and Language-Independent Mobile Programs.  In Proc.
     ACM SIGPLAN Symp. on Programming Language Design and
     Implementation, pages 127--136, May 1996.

 [3] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M.
     Levy.  Scheduler Activations:  Effective Kernel Support for
     the User-Level Management of Parallelism.  ACM Trans. Comput.
     Syst., 10(1):53--79, Feb. 1992.

 [4] D. L. Black.  Scheduling and Resource Management Techniques
     for Multiprocessors.  PhD thesis, Carnegie Mellon University,
     July 1990.

 [5] A. C. Bomberger and N. Hardy.  The KeyKOS Nanokernel
     Architecture.  In Proc. of the USENIX Workshop on
     Micro-kernels and Other Kernel Architectures, pages 95--112,
     Seattle, WA, Apr. 1992.

 [6] O.-J. Dahl.  Hierarchical Program Structures.  Structured
     Programming, pages 175--220, 1972.

 [7] S. Davari and L. Sha.  Sources of Unbounded Priority
     Inversions in Real-time Systems and a Comparative Study of
     Possible Solutions.  ACM Operating Systems Review,
     23(2):110--120, Apr. 1992.

 [8] D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr.  Exokernel:
     An Operating System Architecture for Application-level
     Resource Management.  In Proc. of the 15th ACM Symp. on
     Operating Systems Principles, pages 251--266, Copper
     Mountain, CO, Dec. 1995.

 [9] R. B. Essick.  An Event-based Fair Share Scheduler.  In Proc.
     of the Winter 1990 USENIX Conf., pages 147--161, Washington,
     D.C., Jan. 1990.

[10] B. Ford, M. Hibler, and Flux Project Members.  Fluke:
     Flexible ~-kernel Environment (draft documents).  University
     of Utah. Postscript and HTML available under
     https://www.cs.utah.edu/projects/flux/fluke/html/, 1996.

[11] B. Ford, M. Hibler, J. Lepreau, P. Tullmann, G. Back, and
     S. Clawson.  Microkernels Meet Recursive Virtual Machines.  In
     Proc. of the Second Symp. on Operating Systems Design and
     Implementation, Seattle, WA, Oct. 1996. USENIX Assoc.

[12] D. B. Golub.  Adding Real-Time Scheduling to the Mach Kernel.
     Master's thesis, University of Pittsburg, 1993.

[13] J. Gosling and H. McGilton.  The Java Language Environment:
     A White Paper.  Technical report, Sun Microsystems Computer
     Company, 1996.  Available as
     https://java.sun.com/doc/language_environment/.

[14] P. Goyal, X. Guo, and H. M. Vin.  A Hierarchical CPU
     Scheduler For Multimedia Operations.  In Proc. of the Second
     Symp. on Operating Systems Design and Implementation,
     Seattle, WA, Oct. 1996. USENIX Assoc.

[15] N. Hardy.  The KeyKos Architecture.  Operating Systems Review,
     Sept. 1985.

[16] G. J. Henry.  The Fair Share Scheduler.  AT&T Bell
     Laboratories Technical Journal, 63(8), Oct. 1984.

[17] Institute of Electrical and Electronics, Inc.  IEEE Standard
     for Information Technology --- Portable Operating System
     Interface (POSIX) --- Part 1:  System Application
     Programming Interface (API) --- Amendment 1:  Realtime
     Extension [C Language], 1994.  Std 1003.1b-1993.

[18] E. D. Jensen.  A Benefit Accrual Model of Real-Time.  In Proc.
     of the 10th IFAC Workshop on Distributed Computer Control
     Systems, Sept. 1991.

[19] J. Kay and P. Lauder.  A Fair Share Scheduler.  Communications
     of the ACM, 31(1), Jan. 1988.

[20] J. Liedtke.  On Micro-Kernel Construction.  In Proc. of the
     15th ACM Symp. on Operating Systems Principles, pages
     237--250, Copper Mountain, CO, Dec. 1995.

[21] J. Liedtke.  A Short Note on Cheap Fine-grained Time
     Measurement.  ACM Operating Systems Review, 30(2):92--94,
     Apr. 1996.

[22] C. L. Liu and J. W. Layland.  Scheduling Algorithms for
     Multiprogramming in a Hard-Real-Time Environment.  Journal of
     the ACM, 20(1):46--61, Jan. 1973.

[23] J. M. McKinney.  A survey of analytical time-sharing models.
     Computing Surveys, page 105''116, jun 1969.

[24] L. McVoy and C. Staelin.  lmbench:  Portable Tools for
     Performance Analysis.  In Proc. of 1996 USENIX Conf., Jan.
     1996.

[25] Microsoft Corporation.  Win32 Programmer's Reference, 1993.
     999 pp.

[26] L. Sha, R. Rajkumar, and J. P. Lehoczky.  Priority
     Inheritance Protocols:  An Approach to Real-time
     Synchronization.  IEEE Transactions on Computers,
     39(9):1175--1185, 1990.

[27] M. S. Squillante and E. D. Lazowska.  Using Processor-Cache
     Affinity Information in Shared-Memory Multiprocessor
     Scheduling.  IEEE Transactions on Parallel and Distributed
     Systems, 4(2):131--143, 1993.

[28] J. Torrellas, A. Tucker, and A. Gupta.  Evaluating the
     Performance of Cache-Affinity Scheduling in Shared-Memory
     Multiprocessors.  Journal of Parallel and Distributed
     Computing, 24:139--151, 1995.

[29] R. Vaswani and J. Zahorjan.  The Implications of Cache
     Affinity on Processor Scheduling for Multiprogrammed, Shared
     Memory Multiprocessors.  In Proc. of the 13th ACM Symp. on
     Operating Systems Principles, pages 26--40, Oct. 1991.

[30] C. A. Waldspurger.  Lottery and Stride Scheduling:  Flexible
     Proportional-Share Resource Management.  PhD thesis,
     Massachusetts Institute of Technology, Sept. 1995.

[31] C. A. Waldspurger and W. E. Weihl.  Lottery Scheduling:
     Flexible Proportional-Share Resource Management.  In Proc. of
     the First Symp. on Operating Systems Design and
     Implementation, pages 1--11, Monterey, CA, Nov. 1994. USENIX
     Assoc.

[32] C. A. Waldspurger and W. E. Weihl.  Stride Scheduling:
     Deterministic Proportional-Share Resource Management.
     Technical Report MIT/LCS/TM-528, MIT Laboratory for Computer
     Science, June 1995.

void fifo_scheduling_loop()
{
  cur_thread = NULL;
  more_msgs = 1;
  for (;;) {
    if (more_msgs) {
       more_msgs = msg_rcv(&fifo_pset, &msg);
    } else {

       /* Select the thread to run next. */
       if ((cur_thread == 0) &&
           !queue_empty(&fifo_runq))
         cur_thread = q_remove(&fifo_runq);

       /* Select wakeup sensitivity level. */
       cond = q_is_empty(&fifo_runq) ?
         WAKEUP_ON_CONFLICT : WAKEUP_ON_BLOCK;

       /* Schedule and wait for messages. */
       if (cur_thread != NULL)
         more_msgs = schedule(fifo_pset, &msg,
                              cur_thread, cond);
       else
         more_msgs = msg_rcv(fifo_pset, &msg);
    }

    /* Process the received message. */
    switch (msg.request_code) {
       case MSG_SCHED_REQUEST:
         /* A client thread wants to run. */
         q_enter(&fifo_runq, msg.thread_id);
         break;
       case MSG_SCHED_BLOCKED:
         /* Last thread gave up the CPU. */
         cur_thread = 0;
         break;
    }
  }
}


        Figure 4:  Example single-processor FIFO scheduler.

Figure 5:  Implicit CPU donation from high-priority thread T0
to low-priority thread T1 to avoid priority inversion during a
resource conflict.

    Figure 6:  Multilevel scheduling hierarchy used for tests.


Figure 7:    Behavior of a multilevel scheduling hierarchy.
Events are numbered according to the description in Section 5.3.
Figure 8:    Demonstration of modular control of CPU allocation.
Space between the lines represents the percentage of CPU
time used by a particular thread.  Events occur at 2000-tick
intervals, and correspond to the description in Section 5.4.


         Figure 9:  Priority inheritance between schedulers
Figure 10:    Overall application slowdown as a function of
additional overhead per process-to-process context switch.  In
BSD, only switches between Unix processes count; in Mach, traps
and system calls count as RPCs (two context switches each); in
L4, device interrupts also count as RPCs.