Check out the new USENIX Web site. next up previous
Next: WFQ Up: Schedulers Previous: MTR-LS


Eclipse/BSD's I/O schedulers use approximations to the GPS (Generalized Processor Sharing) [18] model. GPS assumes an ideal ``fluid'' system where each backlogged ``flow'' in the system instantaneously receives service in proportion to the flow's share and inversely proportionally to the sum of the shares of all backlogged flows (where a backlogged flow is analogous to a busy queue). GPS cannot be precisely implemented for I/O because typically (1) I/O servers can only service one request at a time and (2) an I/O request cannot be preempted once service on it begins. GPS approximations estimate the time necessary for servicing each request and interleave requests from different queues so that each queue receives service proportionally to its share (although not instantaneously). However, the necessary time estimates may be difficult to compute precisely because GPS's rate of service for each flow depends on what flows are backlogged at each instant [3].

Eclipse/BSD's disk scheduler uses a new GPS approximation, the YFQ (Yet another Fair Queueing) algorithm [5], which can be implemented very efficiently. A resource is called busy if it has at least one busy queue, or idle otherwise. YFQ associates a start tag, Si, and a finish tag, Fi, with each queue qi. Si and Fi are initially zero. YFQ defines a virtual work function, v(t), such that: (1) v(0) = 0; (2) While the resource is busy, v(t) is the minimum of the start tags of its busy queues at time t; and (3) When the resource becomes idle, v(t) is set to the maximum of all finish tags of the resource.

When a new request ri that uses queue qi arrives: (1) If qi was previously empty, YFQ makes $S_i = \max(v(t), F_i)$ followed by $F_i = S_i + \frac{l_i}{w_i}$, where li is the data length of request ri; and (2) YFQ appends ri to qi.

YFQ selects for servicing the request ri at the head of the busy queue qi with the smallest finish tag Fi. ri remains at the head of qi while ri is being serviced. When ri completes, YFQ dequeues it; if queue qi is still non-empty, YFQ makes Si = Fi followed by $F_i = S_i + \frac{l'_i}{w_i}$, where l'i is the data length of the request r'i now at the head of qi.

Selecting one request at a time, as described above, allows YFQ to approximate GPS quite well, providing good cumulative service, delay, and fairness guarantees. However, such guarantees may come at the cost of excessive disk latency and seek overheads, harming aggregate disk throughput. Therefore, YFQ can be configured to select up to b requests (a batch) at a time and place them in a sort queue, as shown in Figure 2. The disk driver or the disk itself may reorder requests within a batch so as to minimize disk latency and seek overheads.

Figure 2: The sort queue allows the disk driver or disk to reorder requests and minimize disk latency and seek overheads.
\centerline{\epsfxsize=\columnwidth \epsfbox{figs/fs1.eps}}\end{figure}

next up previous
Next: WFQ Up: Schedulers Previous: MTR-LS
Jose Brustoloni