Home About USENIX Events Membership Publications Students
USENIX '05 Paper    [USENIX '05 Technical Program]

Group Ratio Round-Robin: O(1) Proportional Share Scheduling
for Uniprocessor and Multiprocessor Systems

Bogdan Caprita, Wong Chun Chan, Jason Nieh, Clifford Stein*, and Haoqiang Zheng

Department of Computer Science

Columbia University

*also in Department of IEOR


Abstract

We present Group Ratio Round-Robin ($GR^3$), the first proportional share scheduler that combines accurate proportional fairness scheduling behavior with $O(1)$ scheduling overhead on both uniprocessor and multiprocessor systems. $GR^3$ uses a simple grouping strategy to organize clients into groups of similar processor allocations which can be more easily scheduled. Using this strategy, $GR^3$ combines the benefits of low overhead round-robin execution with a novel ratio-based scheduling algorithm. $GR^3$ introduces a novel frontlog mechanism and weight readjustment algorithm to operate effectively on multiprocessors. $GR^3$ provides fairness within a constant factor of the ideal generalized processor sharing model for client weights with a fixed upper bound and preserves its fairness properties on multiprocessor systems. We have implemented $GR^3$ in Linux and measured its performance. Our experimental results show that $GR^3$ provides much lower scheduling overhead and much better scheduling accuracy than other schedulers commonly used in research and practice.

1 Introduction

Proportional share resource management provides a flexible and useful abstraction for multiplexing processor resources among a set of clients. Proportional share scheduling has a clear colloquial meaning: given a set of clients with associated weights, a proportional share scheduler should allocate resources to each client in proportion to its respective weight. However, developing processor scheduling mechanisms that combine good proportional fairness scheduling behavior with low scheduling overhead has been difficult to achieve in practice. For many proportional share scheduling mechanisms, the time to select a client for execution grows with the number of clients. For server systems which may service large numbers of clients, the scheduling overhead of algorithms whose complexity grows linearly with the number of clients can waste more than 20 percent of system resources [3] for large numbers of clients. Furthermore, little work has been done to provide proportional share scheduling on multiprocessor systems, which are increasingly common especially in small-scale configurations with two or four processors. Over the years, a number of scheduling mechanisms have been proposed, and a significant amount of progress has been made. However, previous mechanisms have either superconstant overhead or less-than-ideal fairness properties.

We introduce Group Ratio Round-Robin ($GR^3$), the first proportional share scheduler that provides constant fairness bounds on proportional sharing accuracy with $O(1)$ scheduling overhead for both uniprocessor and small-scale multiprocessor systems. In designing $GR^3$, we observed that accurate, low-overhead proportional sharing is easy to achieve when scheduling a set of clients with equal processor allocations, but is harder to do when clients require very different allocations. Based on this observation, $GR^3$ uses a simple client grouping strategy to organize clients into groups of similar processor allocations which can be more easily scheduled. Using this grouping strategy, $GR^3$ combines the benefits of low overhead round-robin execution with a novel ratio-based scheduling algorithm.

$GR^3$ uses the same basic uniprocessor scheduling algorithm for multiprocessor scheduling by introducing the notion of a frontlog. On a multiprocessor system, a client may not be able to be scheduled to run on a processor because it is currently running on another processor. To preserve its fairness properties, $GR^3$ keeps track of a frontlog for each client to indicate when the client was already running but could have been scheduled to run on another processor. It then assigns the client a time quantum that is added to its current allocation on the processor it is currently running on. The frontlog ensures that a client receives its proportional share allocation while also taking advantage of any cache affinity by continuing to run the client on the same processor.

$GR^3$ provides a simple weight readjustment algorithm that takes advantage of its grouping strategy. On a multiprocessor system, proportional sharing is not feasible for some client weight assignments, such as having one client with weight 1 and another with weight 2 on a two-processor system. By organizing clients with similar weights into groups, $GR^3$ adjusts for infeasible weight assignments without the need to order clients, resulting in lower scheduling complexity than previous approaches [7].

We have analyzed $GR^3$ and show that with only $O(1)$ overhead, $GR^3$ provides fairness within $O(g^2)$ of the ideal Generalized Processor Sharing (GPS) model [16], where $g$, the number of groups, grows at worst logarithmically with the largest client weight. Since $g$ is in practice a small constant, $GR^3$ effectively provides constant fairness bounds with only $O(1)$ overhead. Moreover, we show that $GR^3$ uniquely preserves its worst-case time complexity and fairness properties for multiprocessor systems.

We have implemented a prototype $GR^3$ processor scheduler in Linux, and compared it against uniprocessor and multiprocessor schedulers commonly used in practice and research, including the standard Linux scheduler [2], Weighted Fair Queueing (WFQ) [11], Virtual-Time Round-Robin (VTRR) [17], and Smoothed Round-Robin (SRR) [9]. We have conducted both simulation studies and kernel measurements on micro-benchmarks and real applications. Our results show that $GR^3$ can provide more than an order of magnitude better proportional sharing accuracy than these other schedulers, in some cases with more than an order of magnitude less overhead. These results demonstrate that $GR^3$ can in practice deliver better proportional share control with lower scheduling overhead than these other approaches. Furthermore, $GR^3$ is simple to implement and easy to incorporate into existing scheduling frameworks in commodity operating systems.

This paper presents the design, analysis, and evaluation of $GR^3$. Section 2 describes the uniprocessor scheduling algorithm. Section 3 describes extensions for multiprocessor scheduling, which we refer to as $GR^3MP$. Section 4 analyzes the fairness and complexity of $GR^3$. Section 5 presents experimental results. Section 6 discusses related work. Finally, we present some concluding remarks and directions for future work.


2 $GR^3$ Uniprocessor Scheduling

Uniprocessor scheduling, the process of scheduling a time-multiplexed resource among a set of clients, has two basic steps: 1) order the clients in a queue, 2) run the first client in the queue for its time quantum, which is the maximum time interval the client is allowed to run before another scheduling decision is made. We refer to the units of time quanta as time units (tu) rather than an absolute time measure such as seconds. A scheduler can therefore achieve proportional sharing in one of two ways. One way, often called fair queueing [11,18,28,13,24,10] is to adjust the frequency that a client is selected to run by adjusting the position of the client in the queue so that it ends up at the front of the queue more or less often. However, adjusting the client's position in the queue typically requires sorting clients based on some metric of fairness, and has a time complexity that grows with the number of clients. The other way is to adjust the size of a client's time quantum so that it runs longer for a given allocation, as is done in weighted round-robin (WRR). This is fast, providing constant time complexity scheduling overhead. However, allowing a client to monopolize the resource for a long period of time results in extended periods of unfairness to other clients which receive no service during those times. The unfairness is worse with skewed weight distributions.

$GR^3$ is a proportional share scheduler that matches with $O(1)$ time complexity of round-robin scheduling but provides much better proportional fairness guarantees in practice. At a high-level, the $GR^3$ scheduling algorithm can be briefly described in three parts:

  1. Client grouping strategy: Clients are separated into groups of clients with similar weight values. The group of order $k$ is assigned all clients with weights between $2^k$ to $2^{k+1}-1$, where $k \geq 0$.

  2. Intergroup scheduling: Groups are ordered in a list from largest to smallest group weight, where the group weight of a group is the sum of the weights of all clients in the group. Groups are selected in a round-robin manner based on the ratio of their group weights. If a group has already been selected more than its proportional share of the time, move on to the next group in the list. Otherwise, skip the remaining groups in the group list and start selecting groups from the beginning of the group list again. Since the groups with larger weights are placed first in the list, this allows them to get more service than the lower-weight groups at the end of the list.

  3. Intragroup scheduling: From the selected group, a client is selected to run in a round-robin manner that accounts for its weight and previous execution history.

Using this client grouping strategy, $GR^3$ separates scheduling in a way that reduces the need to schedule entities with skewed weight distributions. The client grouping strategy limits the number of groups that need to be scheduled since the number of groups grows at worst logarithmically with the largest client weight. Even a very large 32-bit client weight would limit the number of groups to no more than 32. The client grouping strategy also ensures that all clients within a group have weight within a factor of two. As a result, the intragroup scheduler never needs to schedule clients with skewed weight distributions. $GR^3$ groups are simple lists that do not need to be balanced; they do not require any use of more complex balanced tree structures.


2.1 $GR^3$ Definitions

We now define the state $GR^3$ associates with each client and group, and describe in detail how $GR^3$ uses that state to schedule clients. Table 1 lists terminology we use. For each client, $GR^3$ maintains the following three values: weight, deficit, and run state. Each client receives a resource allocation that is directly proportional to its weight. A client's deficit tracks the number of remaining time quanta the client has not received from previous allocations. A client's run state indicates whether or not it can be executed. A client is runnable if it can be executed.

For each group, $GR^3$ maintains the following four values: group weight, group order, group work, and current client. The group weight is the sum of the corresponding weights of the clients in the group run queue. A group with group order $k$ contains the clients with weights between $2^k$ to $2^{k+1}-1$. The group work is the total execution time clients in the group have received. The current client is the most recently scheduled client in the group's run queue.


Table 1: $GR^3$ terminology
$C_j$ Client $j$. (also called 'task' $j$)
$\phi_C$ The weight assigned to client $C$.
$\phi_j$ Shorthand notation for $\phi_{C_j}$.
$D_{C}$ The deficit of $C$.
$N$ The number of runnable clients.
$g$ The number of groups.
$G_i$ $i$'th group in the list ordered by weight.
$\vert G\vert$ The number of clients in group $G$.
$G(C)$ The group to which $C$ belongs.
$\Phi_G$ The group weight of $G$: $\sum_{C\in G} \phi_C$.
$\Phi_i$ Shorthand notation for $\Phi_{G_i}$.
$\sigma_G$ The order of group $G$.
$\phi_{\min}^G$ Lower bound for client weights in $G$: $2^{\sigma_G}$.
$w_C$ The work of client $C$.
$w_j$ Shorthand notation for $w_{C_j}$.
$W_G$ The group work of group $G$.
$W_i$ Shorthand notation for $W_{G_i}$.
$\Phi_T$ Total weight: $\sum_{j=1}^{N} \phi_j = \sum_{i=1}^{g}\Phi_i$.
$W_T$ Total work: $\sum_{j=1}^N{w_{j}} =\sum_{i=1}^{g}{W_{i}}
$.
$e_C$ Service error of client $C$: $ w_C-{W_T}{{\phi_C}\over{\Phi_T}}$
$E_G$ Group service error of $G$: $ W_G-{W_T}{{\Phi_G}\over{\Phi_T}}$
$e_{C,G}$ Group-relative service error of client $C$ with
  respect to group $G$: $ w_C-{W_G}{{\phi_C}\over{\Phi_G}}$


$GR^3$ also maintains the following scheduler state: time quantum, group list, total weight, and current group. The group list is a sorted list of all groups containing runnable clients ordered from largest to smallest group weight, with ties broken by group order. The total weight is the sum of the weights of all runnable clients. The current group is the most recently selected group in the group list.


2.2 Basic $GR^3$ Algorithm

We initially only consider runnable clients in our discussion of the basic $GR^3$ scheduling algorithm. We discuss dynamic changes in a client's run state in Section 2.3. We first focus on the $GR^3$ intergroup scheduling algorithm, then discuss the $GR^3$ intragroup scheduling algorithm.

The $GR^3$ intergroup scheduling algorithm uses the ratio of the group weights of successive groups to determine which group to select. The next group to schedule is selected using only the state of successive groups in the group list. Given a group $G_i$ whose weight is $x$ times larger than the group weight of the next group $G_{i+1}$ in the group list, $GR^3$ will select group $G_i$ $x$ times for every time that it selects $G_{i+1}$ in the group list to provide proportional share allocation among groups.

To implement the algorithm, $GR^3$ maintains the total work done by group $G_i$ in a variable $W_i$. An index $i$ to tracks the current group and is initialized to $1$. The scheduling algorithm then executes the following simple routine:


\begin{codebox}
\Procname{$\procdecl{Intergroup-Schedule}()$}
\mi $C \gets \proc...
...n $i \gets i + 1$
\mi \Else $i \gets 1$
\End
\mi \Return $C$
\End
\end{codebox}

Let us negate (1) under the form:

\begin{displaymath}
{{W_i + 1}\over{\Phi_i}} \leq { {W_{i+1} + 1}\over{\Phi_{i+1}} }
\end{displaymath} (2)

We will call this relation the well-ordering condition of two consecutive groups. $GR^3$ works to maintain this condition true at all times. The intuition behind (2) is that we would like the ratio of the work of $G_i$ and $G_{i+1}$ to match the ratio of their respective group weights after $GR^3$ has finished selecting both groups. Recall, $\Phi_i \geq \Phi_{i+1}$. For each time a client from $G_{i+1}$ is run, $GR^3$ would like to have run $\Phi_i / {\Phi_{i+1}}$ worth of clients from $G_i$. (1) says that $GR^3$ should not run a client from $G_i$ and increment $G_i$'s group work if it will make it impossible for $G_{i+1}$ to catch up to its proportional share allocation by running one of its clients once.

To illustrate how intergroup scheduling works, Figure 1 shows an example with three clients $C_1$, $C_2$, and $C_3$, which have weights of 5, 2, and 1, respectively. The $GR^3$ grouping strategy would place each $C_i$ in group $G_i$, ordering the groups by weight: $G_1$, $G_2$, and $G_3$ have orders 2, 1 and 0 and weights of 5, 2, and 1 respectively. In this example, each group has only one client so there is no intragroup scheduling. $GR^3$ would start by selecting group $G_1$, running client $C_1$, and incrementing $W_1$. Based on (1), ${{W_1+1}\over{W_2+1}} = 2 < {{\Phi_1}\over{\Phi_2}} = 2.5$, so $GR^3$ would select $G_1$ again and run client $C_1$. After running $C_1$, $G_1$'s work would be 2 so that the inequality in (1) would hold and $GR^3$ would then move on to the next group $G_2$ and run client $C_2$. Based on (1), ${{W_2+1}\over{W_3+1}} = 2 \leq {{\Phi_2}\over{\Phi_3}} = 2$, so $GR^3$ would reset the current group to the largest weight group $G_1$ and run client $C_1$. Based on (1), $C_1$ would be run for three time quanta before selecting $G_2$ again to run client $C_2$. After running $C_2$ the second time, $W_2$ would increase such that ${{W_2+1}\over{W_3+1}} = 3 > {{\Phi_2}\over{\Phi_3}} = 2$, so $GR^3$ would then move on to the last group $G_3$ and run client $C_3$. The resulting schedule would then be: $G_1$, $G_1$, $G_2$, $G_1$, $G_1$, $G_1$, $G_2$, $G_3$. Each group therefore receives its proportional allocation in accordance with its respective group weight.

Figure 1: $GR^3$ intergroup scheduling. At each time step, the shaded box contains the pair $\Phi _G\ \vert\ W_G + 1$ for the group $G$ before it is selected.
\scalebox{0.85}{\includegraphics{gr3_intergroup.eps}}

The $GR^3$ intragroup scheduling algorithm selects a client from the selected group. All clients within a group have weights within a factor of two, and all client weights in a group $G$ are normalized with respect to the minimum possible weight, $\phi_{\min}^{G} =2^{\sigma_G}$, for any client in the group. $GR^3$ then effectively traverses through a group's queue in round-robin order, allocating each client its normalized weight worth of time quanta. $GR^3$ keeps track of subunitary fractional time quanta that cannot be used and accumulates them in a deficit value for each client. Hence, each client is assigned either one or two time quanta, based on the client's normalized weight and its previous allocation.

More specifically, the $GR^3$ intragroup scheduler considers the scheduling of clients in rounds. A round is one pass through a group $G$'s run queue of clients from beginning to end. The group run queue does not need to be sorted in any manner. During each round, the $GR^3$ intragroup algorithm considers the clients in round-robin order and executes the following simple routine:


\begin{codebox}
\Procname{$\procdecl{Intragroup-Schedule}(G)$}
\mi $C \gets G[k]...
...min}^{G}}$
\End
\mi $D_{C} \gets D_{C} - 1$
\mi \Return $C$
\End
\end{codebox}

For each runnable client $C$, the scheduler determines the maximum number of time quanta that the client can be selected to run in this round as $\lfloor{\phi_C\over\phi_{\min}^G} + D_{C}(r-1)\rfloor$. $D_{C}(r)$, the deficit of client $C$ after round $r$, is the time quantum fraction left over after round $r$: $D_{C}(r) = {\phi_C\over\phi_{\min}^G} + D_{C}(r-1) - \lfloor{\phi_C\over\phi_{\min}^G} + D_{C}(r-1)\rfloor$, with $D_{C}(0)={\phi_C\over\phi_{\min}^G}$. Thus, in each round, $C$ is allotted one time quantum plus any additional leftover from the previous round, and $D_{C}(r)$ keeps track of the amount of service that $C$ missed because of rounding down its allocation to whole time quanta. We observe that $0 \leq D_{C}(r) < 1$ after any round $r$ so that any client $C$ will be allotted one or two time quanta. Note that if a client is allotted two time quanta, it first executes for one time quantum and then executes for the second time quantum the next time the intergroup scheduler selects its respective group again (in general, following a timespan when clients belonging to other groups get to run).

To illustrate how $GR^3$ works with intragroup scheduling, Figure 2 shows an example with six clients $C_1$ through $C_6$ with weights 12, 3, 3, 2, 2, and 2, respectively. The six clients will be put in two groups $G_1$ and $G_2$ with respective group order 1 and 3 as follows: $G_1=\{C_2, C_3, C_4, C_5, C_6\}$ and $G_2=\{C_1\}$. The weight of the groups are $\Phi_1=\Phi_2=12$. $GR^3$ intergroup scheduling will consider the groups in this order: $G_1$, $G_2$, $G_1$, $G_2$, $G_1$, $G_2$, $G_1$, $G_2$, $G_1$, $G_2$, $G_1$, $G_2$. $G_2$ will schedule client $C_1$ every time $G_2$ is considered for service since it has only one client. Since $\phi_{\min}^{G_1} = 2$, the normalized weights of clients $C_2$, $C_3$, $C_4$, $C_5$, and $C_6$ are 1.5, 1.5, 1, 1, and 1, respectively. In the beginning of round 1 in $G_1$, each client starts with 0 deficit. As a result, the intragroup scheduler will run each client in $G_1$ for one time quantum during round 1. After the first round, the deficit for $C_2$, $C_3$, $C_4$, $C_5$, and $C_6$ are 0.5, 0.5, 0, 0, and 0. In the beginning of round 2, each client gets another ${\phi_i / \phi_{\min}^{G_1}}$ allocation, plus any deficit from the first round. As a result, the intragroup scheduler will select clients $C_2$, $C_3$, $C_4$, $C_5$, and $C_6$ to run in order for 2, 2, 1, 1, and 1 time quanta, respectively, during round 2. The resulting schedule would then be: $C_2$, $C_1$, $C_3$, $C_1$, $C_4$, $C_1$, $C_5$, $C_1$, $C_6$, $C_1$, $C_2$, $C_1$, $C_2$, $C_1$, $C_3$, $C_1$, $C_3$, $C_1$, $C_4$, $C_1$, $C_5$, $C_1$, $C_6$, $C_1$.

Figure 2: $GR^3$ intragroup scheduling. At each time step, the shaded box contains the deficit of the client before it is run.
\scalebox{0.85}{\includegraphics{gr3_intragroup.eps}}


2.3 $GR^3$ Dynamic Considerations

In the previous section, we presented the basic $GR^3$ scheduling algorithm, but we did not discuss how $GR^3$ deals with dynamic considerations that are a necessary part of any on-line scheduling algorithm. We now discuss how $GR^3$ allows clients to be dynamically created, terminated, or change run state.

Runnable clients can be selected for execution by the scheduler, while clients that are not runnable cannot. With no loss of generality, we assume that a client is created before it can become runnable, and a client becomes not runnable before it is terminated. As a result, client creation and termination have no effect on the $GR^3$ run queues.

When a client $C$ with weight $\phi_C$ becomes runnable, it is inserted into group $G = G(C)$ such that $\phi_C$ is between $2^{\sigma_{G}}$ and $2^{{\sigma_{G}}+1}-1$. If the group was previously empty, a new group is created, the client becomes the current client of the group, and $g$, the number of groups, is incremented. If the group was not previously empty, $GR^3$ inserts the client into the respective group's run queue right before the current client; it will be serviced after all of the other clients in the group have first been considered for scheduling. The initial deficit $D_{C}$ will be initialized to 0.

When a newly runnable client $C$ is inserted into its respective group $G$, the group needs to be moved to its new position on the ordered group list based on its new group weight. Let this new position be $k$. The corresponding group work and group weight of $G$ need to be updated and the client's deficit needs to be initialized. The group weight is simply incremented by the client's weight. We also want to scale the group work of $G$ such that the work ratio of consecutive groups will continue to be proportional to their weight ratio:


\begin{displaymath}
W_{G} = \left\{
\begin{array}{ll}
\left\lfloor (W_{k+1}+...
... \right\rceil - 1& \mbox{ if\ \ $k=g$} \\
\end{array} \right.
\end{displaymath}

We will motivate these equations when analyzing the fairness of the algorithm in Section 4, but intuitively, we want to preserve the invariants that result from (2).

When a client $C$ with weight $\phi_C$ becomes not runnable, we need to remove it from the group's run queue. This requires updating the group's weight, which potentially includes moving the group in the ordered group list, as well as adjusting the measure of work received according to the new processor share of the group. This can be achieved in several ways. $GR^3$ is optimized to efficiently deal with the common situation when a blocked client may rapidly switch back to the runnable state again. This approach is based on ``lazy'' removal, which minimizes overhead associated with adding and removing a client, while at the same time preserving the service rights and service order of the runnable clients. Since a client blocks when it is running, we know that it will take another full intragroup round before the client will be considered again. The only action when a client blocks is to set a flag on the client, marking it for removal. If the client becomes runnable by the next time it is selected, we reset the flag and run the client as usual. Otherwise, we remove the client from $G(C)$. In the latter situation, as in the case of client arrivals, the group may need to be moved to a new position on the ordered group list based on its new group weight. The corresponding group weight is updated by subtracting the client's weight from the group weight. The corresponding group work is scaled by the same rules as for client insertion, depending on the new position of the group and its next neighbor. After performing these removal operations, $GR^3$ resumes scheduling from the largest weight group in the system.

Whenever a client $C$ blocks during round $r$, we set $D_{C}(r) =
\min(D_{C}(r-1)+{\phi_C/\phi_{\min}^{G(C)}} - \lceil w \rceil,1)$, where $w$ is the service that the client received during round $r$ until it blocked. This preserves the client's credit in case it returns by the next round, while also limiting the deficit to $1$ so that a client cannot gain credit by blocking. However, the group consumes $1$ tu (its work is incremented) no matter how long the client runs. Therefore, the client forfeits its extra credit whenever it is unable to consume its allocation.

If the client fails to return by the next round, we may remove it. Having kept the weight of the group to the old value for an extra round has no adverse effects on fairness, despite the slight increase in service seen by the group during the last round. By scaling the work of the group and rounding up, we determine its future allocation and thus make sure the group will not have received undue service. We also immediately resume the scheduler from the first (largest) group in the readjusted group list, so that any minor discrepancies caused by rounding may be smoothed out by a first pass through the group list.


3 $GR^3$ Multiprocessor Extensions ($GR^3MP$)

We now present extensions to $GR^3$ for scheduling a $P$-way multiprocessor system from a single, centralized queue. This simple scheme, which we refer to as $GR^3MP$, preserves the good fairness and time complexity properties of $GR^3$ in small-scale multiprocessor systems, which are increasingly common today, even in the form of multi-core processors. We first describe the basic $GR^3MP$ scheduling algorithm, then discuss dynamic considerations. Table  2 lists terminology we use. To deal with the problem of infeasible client weights, we then show how $GR^3MP$ uses its grouping strategy in a novel weight readjustment algorithm.


Table 2: $GR^3MP$ terminology
$P$ Number of processors.
$\wp^k$ Processor $k$.
$C(\wp)$ Client running on processor $\wp$.
$F_{C}$ Frontlog for client $C$.


3.1 Basic $GR^3MP$ Algorithm

$GR^3MP$ uses the same $GR^3$ data structure, namely an ordered list of groups, each containing clients whose weights are within a factor of two from each other. When a processor needs to be scheduled, $GR^3MP$ selects the client that would run next under $GR^3$, essentially scheduling multiple processors from its central run queue as $GR^3$ schedules a single processor. However, there is one obstacle to simply applying a uniprocessor algorithm on a multiprocessor system. Each client can only run on one processor at any given time. As a result, $GR^3MP$ cannot select a client to run that is already running on another processor even if $GR^3$ would schedule that client in the uniprocessor case. For example, if $GR^3$ would schedule the same client consecutively, $GR^3MP$ cannot schedule that client consecutively on another processor if it is still running.

To handle this situation while maintaining fairness, $GR^3MP$ introduces the notion of a frontlog. The frontlog $F_C$ for some client $C$ running on a processor $\wp^k$ ($C = C(\wp^k)$) is defined as the number of time quanta for $C$ accumulated as $C$ gets selected by $GR^3$ and cannot run because it is already running on $\wp^k$. The frontlog $F_C$ is then queued up on $\wp^k$.

Given a client that would be scheduled by $GR^3$ but is already running on another processor, $GR^3MP$ uses the frontlog to assign the client a time quantum now but defer the client's use of it until later. Whenever a processor finishes running a client for a time quantum, $GR^3MP$ checks whether the client has a non-zero frontlog, and, if so, continues running the client for another time quantum and decrements its frontlog by one, without consulting the central queue. The frontlog mechanism not only ensures that a client receives its proportional share allocation, it also takes advantage of any cache affinity by continuing to run the client on the same processor.

When a processor finishes running a client for a time quantum and its frontlog is zero, we call the processor idle. $GR^3MP$ schedules a client to run on the idle processor by performing a $GR^3$ scheduling decision on the central queue. If the selected client is already running on some other processor, we increase its frontlog and repeat the $GR^3$ scheduling, each time incrementing the frontlog of the selected client, until we find a client that is not currently running. We assign this client to the idle processor for one time quantum. This description assumes that there are least $P+1$ clients in the system. Otherwise, scheduling is easy: an idle processor will either run the client it just ran, or idles until more clients arrive. In effect, each client will simply be assigned its own processor. Whenever a processor needs to perform a scheduling decision, it thus executes the following routine:


\begin{codebox}
\Procname{$\procdecl{MP-Schedule}(\wp^k)$}
\mi $C \gets C(\wp^k)...
...C \gets \proc{Intergroup-Schedule}()$
\End
\mi \Return $C$
\End
\end{codebox}

To illustrate $GR^3MP$ scheduling, Figure 3 shows an example on a dual-processor system with three clients $C_1$, $C_2$, and $C_3$ of weights 3, 2, and 1, respectively. $C_1$ and $C_2$ will then be part of the order 1 group (assume $C_2$ is before $C_1$ in the round-robin queue of this group), whereas $C_3$ is part of the order 0 group. The $GR^3$ schedule is $C_2$, $C_1$, $C_2$, $C_1$, $C_1$, $C_3$. $\wp^1$ will then select $C_2$ to run, and $\wp^2$ selects $C_1$. When $\wp^1$ finishes, according to $GR^3$, it will select $C_2$ once more, whereas $\wp^2$ selects $C_1$ again. When $\wp^1$ again selects the next $GR^3$ client, which is $C_1$, it finds that it is already running on $\wp^2$ and thus we set $F_{C_1} = 1$ and select the next client, which is $C_3$, to run on $\wp^1$. When $\wp^2$ finishes running $C_1$ for its second time quantum, it finds $F_{C_1} = 1$, sets $F_{C_1} = 0$ and continues running $C_1$ without any scheduling decision on the $GR^3$ queue.

Figure 3: $GR^3$ multiprocessor scheduling. The two processors schedule either from the central queue, or use the frontlog mechanism when the task is already running.
\scalebox{0.85}{\includegraphics{gr3_frontlog.eps}}

3.2 $GR^3MP$ Dynamic Considerations

$GR^3MP$ basically does the same thing as the $GR^3$ algorithm under dynamic considerations. However, the frontlogs used in $GR^3MP$ need to be accounted for appropriately. If some processors have long frontlogs for their currently running clients, newly arriving clients may not be run by those processors until their frontlogs are processed, resulting in bad responsiveness for the new clients. Although in between any two client arrivals or departures, some processors must have no frontlog, the set of such processors can be as small as a single processor. In this case, newly arrived clients will end up competing with other clients already in the run queue only for those few processors, until the frontlog on the other processors is exhausted.

$GR^3MP$ provides fair and responsive allocations by creating frontlogs for newly arriving clients. Each new client is assigned a frontlog equal to a fraction of the total current frontlog in the system based on its proportional share. Each processor now maintains a queue of frontlog clients and a new client with a frontlog is immediately assigned to one of the processor frontlog queues. Rather than running its currently running client until it completes its frontlog, each processor now round robins among clients in its frontlog queue. Given that frontlogs are small in practice, round-robin scheduling is used for frontlog clients for its simplicity and fairness. $GR^3MP$ balances the frontlog load on the processors by placing new frontlog clients on the processor with the smallest frontlog summed across all its frontlog clients.

More precisely, whenever a client $C$ arrives, and it belongs in group $G(C)$, $GR^3MP$ performs the same group operations as in the single processor $GR^3$ algorithm. $GR^3MP$ finds the processor $\wp^k$ with the smallest frontlog, then creates a frontlog for client $C$ on $\wp^k$ of length $F_C = F_T {\phi_C \over \Phi_T }$, where $F_T$ is the total frontlog on all the processors. Let $C' =
C(\wp^k)$. Then, assuming no further clients arrive, $\wp^k$ will round-robin between $C$ and $C'$ and run $C$ for $F_C$ and $C'$ for $F_{C'}$ time quanta.

When a client becomes not runnable, $GR^3MP$ uses the same lazy removal mechanism used in $GR^3$. If it is removed from the run queue and has a frontlog, $GR^3MP$ simply discards it since each client is assigned a frontlog based on the current state of the system when it becomes runnable again.


3.3 $GR^3MP$ Weight Readjustment

Since no client can run on more than one processor at a time, no client can consume more than a $1/P$ fraction of the processing in a multiprocessor system. A client $C$ with weight $\phi_C$ greater than $\Phi_T / P$ is considered infeasible since it cannot receive its proportional share allocation $\phi_C / \Phi_T$ without using more than one processor simultaneously. $GR^3MP$ should then give the client its maximum possible service, and simply assign such a client its own processor to run on. However, since the scheduler uses client weights to determine which client to run, an infeasible client's weight must be adjusted so that it is feasible to ensure that the scheduling algorithm runs correctly to preserve fairness (assuming there are at least $P$ clients). $GR^3MP$ potentially needs to perform weight readjustment whenever a client is inserted or removed from the run queue to make sure that all weights are feasible.

To understand the problem of weight readjustment, consider the sequence of all clients, ordered by weight: $S_{1, N} = C_1, C_2, \ldots,
C_N$ with $\phi_1 \geq \phi_2 \geq \ldots \geq \phi_N$. We call the subsequence $S_{k,N} = C_k, C_{k+1}, \ldots,.C_N$ $Q$-feasible, if $\phi_k \leq \frac{1}{Q}\sum_{j=k}^N{\phi_j}$.

Lemma 1   The client mix in the system is feasible if and only if $S_{1,N}$ is $P$-feasible.


\begin{proof}
If $\phi_1 > \frac{\Phi_T}{P}$, $C_1$\ is infeasible, so the mix i...
...\sum_{j=1}^{N}\phi_j$, or, equivalently, $S_{1,N}$\ is $P$-feasible.
\end{proof}

Lemma 2   $S_{k,N}$ is $Q$-feasible $\Rightarrow$ $S_{k+1, N}$ is $(Q-1)$-feasible.


\begin{proof}
% latex2html id marker 312$\phi_k \leq \frac{1}{Q}{\sum_{j=k}^N{...
...k+1}^N{\phi_j}}$. Since $\phi_{k+1} \leq \phi_k$, the lemma follows.
\end{proof}

The feasibility problem is then to identify the least $k$ (denoted the feasibility threshold, $f$) such that $S_{k,N}$ is $(P-k +
1)$-feasible. If $f=1$, then the client mix is feasible. Otherwise, the infeasible set $S_{1,f-1} = C_1,\ldots, C_{f-1}$ contains the infeasible clients, whose weight needs to be scaled down to $1/P$ of the resulting total weight. The cardinality $f-1$ of the infeasible set is less than $P$. However, the sorted sequence $S_{1,N}$ is expensive to maintain, such that traversing it and identifying the feasibility threshold is not an efficient solution.

$GR^3MP$ leverages its grouping strategy to perform fast weight readjustment. $GR^3MP$ starts with the unmodified client weights, finds the set $I$ of infeasible clients, and adjust their weights to be feasible. To construct $I$, the algorithm traverses the list of groups in decreasing order of their group order $\sigma_G$, until it finds a group not all of whose clients are infeasible. We denote by $\vert I\vert$ the cardinality of $I$ and by $\Phi_I$ the sum of weights of the clients in $I$, $\sum_{C \in I}{\phi_{C}}$. The $GR^3MP$ weight readjustment algorithm is as follows:
\begin{codebox}
\Procname{$\procdecl{Weight-Readjustment}()$}
\mi \proc{Restore-...
... \For each $C \in I$
\mi \Do $\phi_C \gets \frac{\Phi_T}{P}$
\End
\end{codebox}

The correctness of the algorithm is based on Lemma 2. Let some group $G$ span the subsequence $S_{i,j}$ of the sequence of ordered clients $S_{1,N}$. Then $2^{\sigma_G + 1} -1 \geq \phi_i \geq \ldots \geq \phi_j \geq 2^{\sigma_G}$ and it is easy to show:

  • $2^{\sigma_G} > \frac{\Phi_T - \Phi_I - \Phi_G}{P - \vert I\vert - \vert G\vert} \Rightarrow$ $j < f$ (all clients in $S_{1,j}$ are infeasible).
  • $2^{\sigma_G} \leq \frac{\Phi_T - \Phi_I - \Phi_G}{P - \vert I\vert - \vert G\vert} \Rightarrow$ $j+1 \geq f$ (all clients in $S_{j+1, N}$ are feasible).
Once we reach line 7, we know $S_{j+1, N}$ is $(P-j)$-feasible, and $i \leq f \leq j+1$. If $\vert G\vert \geq 2(P-\vert I\vert)$, $GR^3MP$ can stop searching for infeasible clients since all clients $C \in G$ are feasible, and $f = i$ (equivalently, $S_{i, N}$ is $(P-\vert I\vert)$-feasible): $\phi_C
< 2^{\sigma_{G}+1} \leq 2{1\over \vert G\vert}\Phi_G \leq {1 \over P - \vert I\vert}\Phi_G
\leq {1\over P-\vert I\vert}(\Phi_T - \Phi_I)$. Otherwise, if $\vert G\vert < 2(P-\vert I\vert)$, then $i < f \leq j+1$ and $GR^3MP$ needs to search through $G$ to determine which clients are infeasible (equivalently, find $f$). Since the number of clients in $G$ is small, we can sort all clients in $G$ by weight. Then, starting from the largest weight client in $G$, find the first feasible client. A simple algorithm is then the following:

\begin{codebox}
\Procname{$\procdecl{Infeasible}(G, Q, \Phi)$}
\mi $I \gets \emp...
...ts I \cup \{C\}$
\mi \Else \Return $I$
\End
\End
\mi \Return $I$
\end{codebox}

$GR^3MP$ can alternatively use a more complicated but lower time complexity divide-and-conquer algorithm to find the infeasible clients in $G$. In this case, $GR^3MP$ partitions $G$ around its median $\bar{C}$ into $G_S$, the set of $G$ clients that have weight less than $\phi_{\bar{C}}$ and $G_B$, the set of $G$ clients that have weight larger than $\phi_{\bar{C}}$. By Lemma 2, if $\bar{C}$ is feasible, $G_S \cup \{\bar{C}\}$ is feasible, and we recurse on $G_B$. Otherwise, all clients in $G_B \cup \{\bar{C}\}$ are infeasible, and we recurse on $G_S$ to find all infeasible clients. The algorithm finishes when the set we need to recurse on is empty:


\begin{codebox}
\Procname{$\procdecl{Infeasible}(G, Q, \Phi)$}
\mi \If $G = \emp...
...ar{C}})$
\mi \Else \Return $\proc{Infeasible}(G_B, Q, \Phi)$
\End
\end{codebox}

Once all infeasible clients have been identified, $\proc{Weight-Readjustment}()$ determines the sum of the weights of all feasible clients, $\Phi_T^f = \Phi_T -
\Phi_I$. We can now compute the new total weight in the system as $\Phi_T = {P \over P-\vert I\vert}\Phi_T^f$, namely the solution to the equation $\Phi_T^f + \vert I\vert{x\over P} = x$. Once we have the adjusted $\Phi_T$, we change all the weights for the infeasible clients in $I$ to $\Phi_T \over P$. Lemma 6 in Section 4.2 shows the readjustment algorithm runs in time $O(P)$ and is thus asymptotically optimal, since there can be $\Theta(P)$ infeasible clients.


4 $GR^3$ Fairness and Complexity

We analyze the fairness and complexity of $GR^3$ and $GR^3MP$. To analyze fairness, we use a more formal notion of proportional fairness defined as service error, a measure widely used  [1,7,9,17,18,19,25,27] in the analysis of scheduling algorithms. To simplify the analysis, we will assume that clients are always runnable and derive fairness bounds for such a case. Subsequently, we address the impact of arrivals and departures.

We use a strict measure of service error (equivalent in this context to the Normalized Worst-case Fair Index [1]) relative to Generalized Processor Sharing (GPS) [16], an idealized model that achieves perfect fairness: $w_C = W_T{\phi_C\over \Phi_T}$, an ideal state in which each client $C$ always receives service exactly proportional to its weight. Although all real-world schedulers must time-multiplex resources in time units of finite size and thus cannot maintain perfect fairness, some algorithms stay closer to perfect fairness than others and therefore have less service error. We quantify how close an algorithm gets to perfect fairness using the client service time error, which is the difference between the service received by client $C$ and its share of the total work done by the processor: $e_C = w_C-{W_T}{{\phi_C}\over{\Phi_T}}$. A positive service time error indicates that a client has received more than its ideal share over a time interval; a negative error indicates that it has received less. To be precise, the error $e_C$ measures how much time a client $C$ has received beyond its ideal allocation. A proportional share scheduler should minimize the absolute value of the allocation error of all clients with minimal scheduling overhead.

We provide bounds on the service error of $GR^3$ and $GR^3MP$. To do this, we define two other measures of service error. The group service time error is a similar measure for groups that quantifies the fairness of allocating the processor among groups: $E_G = W_G-{W_T}{{\Phi_G}\over{\Phi_T}}$. The group-relative service time error represents the service time error of client $C$ if there were only a single group $G = G(C)$ in the scheduler and is a measure of the service error of a client with respect to the work done on behalf of its group: $e_{C,G} =
w_C-{W_G}{{\phi_C}\over{\Phi_G}}$. We first show bounds on the group service error of the intergroup scheduling algorithm. We then show bounds on the group-relative service error of the intragroup scheduling algorithm. We combine these results to obtain the overall client service error bounds. We also discuss the scheduling overhead of $GR^3$ and $GR^3MP$ in terms of their time complexity. We show that both algorithms can make scheduling decisions in $O(1)$ time with $O(1)$ service error given a constant number of groups. Due to space constraints, most of the proofs are omitted. Further proof details are available in [5].

4.1 Analysis of $GR^3$


4.1.0.1 Intergroup Fairness

For the case when the weight ratios of consecutive groups in the group list are integers, we get the following:

Lemma 3   If ${\Phi_j\over{\Phi_{j+1}}} \in \mathbb{N},\ 1\leq j<g$, then $-1 < E_{G_k} \leq (g-k){\Phi_k\over{\Phi_T}}$ for any group $G_k$.

Proof sketch: If the group currently scheduled is $G_k$, then the work to weight ratio of all groups $G_j$, $j<k$, is the same. For $j>k$, ${{W_{j+1}}\over{\Phi_{j+1}}} \leq {{ W_j \over{\Phi_j}}} \leq { {W_{j+1} + 1}\over{\Phi_{j+1}}}-{1\over {\Phi_j}}$ as a consequence of the well-ordering condition (2). After some rearrangements, we can sum over all $j$ and bound $W_k$, and thus $E_{G_k}$ above and below. The additive ${1\over {\Phi_j}}$ will cause the $g-1$ upper bound.

In the general case, we get similar, but slightly weaker bounds.

Lemma 4   For any group $G_k$, $-{{(g-k)(g-k-1)}\over 2}{\Phi_k \over \Phi_T}-1 < E_{G_k} < g-1$.

The proof for this case (omitted) follows reasoning similar to that of the previous lemma, but with several additional complications.

It is clear that the lower bound is minimized when setting $k=1$. Thus, we have

Corollary 1   $-{(g-1)(g-2)\over{2}}{\Phi_G \over \Phi_T} - 1 < E_G < g-1$ for any group $G$.


4.1.0.2 Intragroup Fairness

Within a group, all weights are within a factor of two and the group-relative error is bound by a small constant. The only slightly subtle point is to deal with fractional rounds.

Lemma 5   $-3 < e_{C,G} < 4$ for any client $C \in G$.


4.1.0.3 Overall Fairness of $GR^3$

Based on the identity $e_C = e_{C,G} + {\phi_C\over{\Phi_G}}E_G$ which holds for any group $G$ and any client $C \in G$, we can combine the inter- and intragroup analyses to bound the overall fairness of $GR^3$.

Theorem 1   $-{(g-1)(g-2)\over{2}}{\phi_C\over \Phi_T} - 4 < e_C < g+3$ for any client $C$.

The negative error of $GR^3$ is thus bounded by $O(g^2)$ and the positive error by $O(g)$. Recall, $g$, the number of groups, does not depend on the number of clients in the system.


4.1.0.4 Dynamic Fairness of $GR^3$

We can consider a client arrival or removal as an operation where a group is first removed from the group list and added in a different place with a different weight. We argue that fairness is preserved by these operations: when group $G_k$ is removed, then $G_{k-1}$, $G_k$, and $G_{k+1}$ were well-ordered as defined in (2), so after the removal, $G_{k-1}$ and $G_{k+1}$, now neighbors, will be well-ordered by transitivity. When a group, call it $G_{i+(1/2)}$, is inserted between $G_i$ and $G_{i+1}$, it can be proven that the work readjustment formula in Section 2.3 ensures $G_{i+(1/2)}$ and $G_{i+1}$ are well-ordered. In the case of $G_{i}$ and $G_{i+(1/2)}$, we can show that we can achieve well-ordering by running $G_{i+(1/2)}$ at most one extra time. Thus, modulo this readjustment, the intragroup algorithm's fairness bounds are preserved. An important property of our algorithm that follows is that the pairwise ratios of work of clients not part of the readjusted group will be unaffected. Since the intragroup algorithm has constant fairness bounds, the disruption for the work received by clients inside the adjusted group is only $O(1)$.


4.1.0.5 Time Complexity

$GR^3$ manages to bound its service error by $O(g^2)$ while maintaining a strict $O(1)$ scheduling overhead. The intergroup scheduler either selects the next group in the list, or reverts to the first one, which takes constant time. The intragroup scheduler is even simpler, as it just picks the next client to run from the unordered round robin list of the group. Adding and removing a client is worst-case $O(g)$ when a group needs to be relocated in the ordered list of groups. This could of course be done in $O(\log g)$ time (using binary search, for example), but the small value of $g$ in practice does not justify a more complicated algorithm.

The space complexity of $GR^3$ is $O(g) + O(N) = O(N)$. The only additional data structure beyond the unordered lists of clients is an ordered list of length $g$ to organize the groups.


4.2 Analysis of $GR^3MP$

4.2.0.1 Overall Fairness of $GR^3MP$

Given feasible client weights after weight readjustment, the service error for $GR^3MP$ is bounded below by the $GR^3$ error, and above by a bound which improves with more processors.

Theorem 2   $-{(g-1)(g-2)\over{2}}{\phi_C \over \Phi_T} - 4 < e_C < 2g+10 +{(g-1)(g-2)\over 2P}$ for any client $C$.

4.2.0.2 Time Complexity of $GR^3MP$

The frontlogs create an additional complication when analyzing the time complexity of $GR^3MP$. When an idle processor looks for its next client, it runs the simple $O(1)$ $GR^3$ algorithm to find a client $C$. If $C$ is not running on any other processor, we are done, but otherwise we place it on the frontlog and then we must rerun the $GR^3$ algorithm until we find a client that is not running on any other processor. Since for each such client, we increase its allocation on the processor it runs, the amortized time complexity remains $O(1)$. The upper bound on the time that any single scheduling decision takes is given by the maximum length of any scheduling sequence of $GR^3$ consisting of only some fixed subset of $P-1$ clients.

Theorem 3   The time complexity per scheduling decision in $GR^3MP$ is bounded above by ${{(g-k)(g-k+1)}\over 2} +(k+1){(g-k+1)P} +1$ where $1\leq k \leq g$.

Thus, the length of any schedule consisting of at most $P-1$ clients is $O(g^2P)$. Even when a processor has frontlogs for several clients queued up on it, it will schedule in $O(1)$ time, since it performs round-robin among the frontlogged clients. Client arrivals and departures take $O(g)$ time because of the need to readjust group weights in the saved list of groups. Moreover, if we also need to use the weight readjustment algorithm, we incur an additional $O(P)$ overhead on client arrivals and departures.

Lemma 6   The complexity of the weight readjustment algorithm is $O(P)$.


\begin{proof}
% latex2html id marker 506Restoring the original weights will wo...
...ecurse on a
subset half the size, this operation is $O(P)$\ as well.
\end{proof}

For small $P$, the $O(P \log(P))$ sorting approach to determine infeasible clients in the last group considered is simpler and in practice performs better than the $O(P)$ recursive partitioning. Finally, altering the active group structure to reflect the new weights is a $O(P+g)$ operation, as two groups may need to be re-inserted in the ordered list of groups.


5 Measurements and Results

We have implemented $GR^3$ uniprocessor and multiprocessor schedulers in the Linux operating system and measured their performance. We present some experimental data quantitatively comparing $GR^3$ performance against other popular scheduling approaches from both industrial practice and research. We have conducted both extensive simulation studies and detailed measurements of real kernel scheduler performance on real applications.

Section 5.1 presents simulation results comparing the proportional sharing accuracy of $GR^3$ and $GR^3MP$ against WRR, WFQ [18], SFQ [13], VTRR [17], and SRR [9]. The simulator enabled us to isolate the impact of the scheduling algorithms themselves and examine the scheduling behavior of these different algorithms across hundreds of thousands of different combinations of clients with different weight values.

Section 5.2 presents detailed measurements of real kernel scheduler performance by comparing our prototype $GR^3$ Linux implementation against the standard Linux scheduler, a WFQ scheduler, and a VTRR scheduler. The experiments we have done quantify the scheduling overhead and proportional share allocation accuracy of these schedulers in a real operating system environment under a number of different workloads.

All our kernel scheduler measurements were performed on an IBM Netfinity 4500 system with one or two 933 MHz Intel Pentium III CPUs, 512 MB RAM, and 9 GB hard drive. The system was installed with the Debian GNU/Linux distribution version 3.0 and all schedulers were implemented using Linux kernel version 2.4.19. The measurements were done by using a minimally intrusive tracing facility that writes timestamped event identifiers into a memory log and takes advantage of the high-resolution clock cycle counter available with the Intel CPU, providing measurement resolution at the granularity of a few nanoseconds. Getting a timestamp simply involved reading the hardware cycle counter register. We measured the timestamp overhead to be roughly 35 ns per event.

The kernel scheduler measurements were performed on a fully functional system. All experiments were performed with all system functions running and the system connected to the network. At the same time, an effort was made to eliminate variations in the test environment to make the experiments repeatable.


5.1 Simulation Studies

Figure 4: WRR error
\scalebox{0.38}{\includegraphics{sim0.1/wrr-0.1.eps}}
Figure 5: WFQ error
\scalebox{0.38}{\includegraphics{sim0.1/wfq-0.1.eps}}
Figure 6: SFQ error
\scalebox{0.38}{\includegraphics{sim0.1/sfq-0.1.eps}}
Figure 7: VTRR error
\scalebox{0.38}{\includegraphics{sim0.1/vtrr-0.1.eps}}
Figure 8: SRR error
\scalebox{0.38}{\includegraphics{sim0.1/srr-0.1.eps}}
Figure 9: GR$^3$ error
\scalebox{0.38}{\includegraphics{sim0.1/grrr-0.8-uniproc-lim.eps}}
Figure 10: $GR^3MP$ error
\scalebox{0.38}{\includegraphics{sim0.1/grrr-0.8-8-lim.eps}}
Figure 11: $GR^3MP$ overhead
\scalebox{0.38}{\includegraphics{sim0.1/grrr-0.8-8-oh-lim.eps}}

We built a scheduling simulator that measures the service time error, described in Section 4, of a scheduler on a set of clients. The simulator takes four inputs, the scheduling algorithm, the number of clients $N$, the total sum of weights $\Phi_T$, and the number of client-weight combinations. The simulator randomly assigns weights to clients and scales the weights to ensure that they add up to $\Phi_T$. It then schedules the clients using the specified algorithm as a real scheduler would, assuming no client blocks, and tracks the resulting service time error. The simulator runs the scheduler until the resulting schedule repeats, then computes the maximum (most positive) and minimum (most negative) service time error across the nonrepeating portion of the schedule for the given set of clients and weight assignments. This process is repeated for the specified number of client-weight combinations. We then compute the maximum service time error and minimum service time error for the specified number of client-weight combinations to obtain a ``worst-case'' error range.

To measure proportional fairness accuracy, we ran simulations for each scheduling algorithm on 45 different combinations of $N$ and $\Phi_T$ (32 up to 8192 clients and 16384 up to 262144 total weight, respectively). Since the proportional sharing accuracy of a scheduler is often most clearly illustrated with skewed weight distributions, one of the clients was given a weight equal to 10 percent of $\Phi_T$. All of the other clients were then randomly assigned weights to sum to the remaining 90 percent of $\Phi_T$. For each pair $(N, \Phi_T)$, we ran 2500 client-weight combinations and determined the resulting worst-case error range.

The worst-case service time error ranges for WRR, WFQ, SFQ, VTRR, SRR, and $GR^3$ with these skewed weight distributions are in Figures 4 to 9. Due to space constraints, WF$^2$Q error is not shown since the results simply verify its known mathematical error bounds of $-1$ and $1$ tu. Each figure consists of a graph of the error range for the respective scheduling algorithm. Each graph shows two surfaces representing the maximum and minimum service time error as a function of $N$ and $\Phi_T$ for the same range of values of $N$ and $\Phi_T$. Figure 4 shows WRR's service time error is between $-12067$ tu and $23593$ tu. Figure 5 shows WFQ's service time error is between $-1$ tu and $819$ tu, which is much less than WRR. Figure 6 shows SFQ's service time error is between $-819$ tu and $1$ tu, which is almost a mirror image of WFQ. Figure 7 shows VTRR's service error is between $-2129$ tu and $10079$ tu. Figure 8 shows SRR's service error is between $-369$ tu and $369$ tu.

In comparison, Figure 9 shows the service time error for $GR^3$ only ranges from $-2.5$ to $3.0$ tu. $GR^3$ has a smaller error range than all of the other schedulers measured except WF$^2$Q. $GR^3$ has both a smaller negative and smaller positive service time error than WRR, VTRR, and SRR. While $GR^3$ has a much smaller positive service error than WFQ, WFQ does have a smaller negative service time error since it is bounded below at $-1$. Similarly, $GR^3$ has a much smaller negative service error than SFQ, though SFQ's positive error is less since it is bounded above at $1$. Considering the total service error range of each scheduler, $GR^3$ provides well over two orders of magnitude better proportional sharing accuracy than WRR, WFQ, SFQ, VTRR, and SRR. Unlike the other schedulers, these results show that $GR^3$ combines the benefits of low service time errors with its ability to schedule in