Check out the new USENIX Web site. next up previous
Next: Multi Queue Scheduler (MQS) Up: PMQS : Scalable Linux Previous: Introduction


Default SMP Scheduler (DSS)

The default SMP scheduler (DSS) in Linux 2.4.x treats processes and threads the same way, referring to them as tasks. Each task has a corresponding data structure which maintains state related to address space, memory management, signal management, open files and privileges. Traditional threading models and light-weight processes are supported through the clone system call.

For the purpose of scheduling, time is measured in architecture-dependent units called ticks. On x86 systems, timer ticks are generated at a 10ms resolution. Each task maintains a counter (tsk->counter) which expresses the time quantum for which it can execute before it can be preempted. By decrementing this counter on timer tick interrupts, DSS implements a priority-decay mechanism for non-realtime tasks. The priority of a task is determined by a goodness() value that depends on its remaining time quantum, nice value and the affinity towards the last CPU on which it ran. DSS supports preemption of tasks only when they run in user mode. Priority preemption can occur any time the scheduler runs.

The kernel scheduler consists of two primary functions :

  1. schedule(void) : This function is called synchronously by a processor to select the next task to run e.g. at the end of sleep(), wait_for_IO() or schedule_timeout(). It is also called preemptively on the return path from an interrupt e.g. a reschedule-IPI (interprocessor interrupt) from another processor, I/O completion or system call.
  2. reschedule_idle(task_struct *tsk) : This function is called in wake_up_process() to find a suitable processor on which the parameter task can be dispatched. wake_up_process() is called when a task is first created or when it has to be re-entered into the runqueue after an I/O or sleep operation. reschedule_idle() tries to find either an idle processor or one which is running a task with a lower goodness value. If successful, it sends an IPI to the target CPU, forcing it to invoke schedule() and preempt its currently running task.

Internally, the scheduler maintains a single runqueue protected by a spinlock. The queue is unordered, which allows tasks to be inserted and deleted efficiently. However, in order to select a new task to run, the scheduler has to lock and traverse the entire runqueue, comparing the goodness value of each schedulable task. A task is considered schedulable if it is not already running and it is enabled for dispatch on the target CPU. The goodness value, determined by the goodness() function, distinguishes between three types of tasks : realtime tasks (values 1000+), regular tasks (values between 0 and 1000) and tasks which have yielded the processor (value -1). For regular tasks, the goodness value consists of a static or non-affinity part and a dynamic or affinity part. The non-affinity goodness depends on the task's counter and nice values. The affinity part accounts for the anticipated overheads of cache misses and page table switches incurred as a result of migrating tasks across CPUs. If the invoking CPU is the same as the one the task last ran on, the goodness value is boosted by an architecture dependent value called PROC_CHANGE_PENALTY. If the memory management object (tsk->mm) is the same, goodness values are boosted by 1. The counter values of all tasks are recalculated when all schedulable tasks on the runqueue have expired their time quanta. Due to space limitations, we refer the reader to detailed descriptions of DSS in  [6,1].

The implications of the scheduling algorithm for large SMP and NUMA machines is discussed in Section 4.


next up previous
Next: Multi Queue Scheduler (MQS) Up: PMQS : Scalable Linux Previous: Introduction
2001-09-18