YFQ

Eclipse/BSD's I/O schedulers use approximations to the GPS (Generalized Processor Sharing) [18] model. GPS assumes an ideal ``fluid'' system where each backlogged ``flow'' in the system instantaneously receives service in proportion to the flow's share and inversely proportionally to the sum of the shares of all backlogged flows (where a backlogged flow is analogous to a busy queue). GPS cannot be precisely implemented for I/O because typically (1) I/O servers can only service one request at a time and (2) an I/O request cannot be preempted once service on it begins. GPS approximations estimate the time necessary for servicing each request and interleave requests from different queues so that each queue receives service proportionally to its share (although not instantaneously). However, the necessary time estimates may be difficult to compute precisely because GPS's rate of service for each flow depends on what flows are backlogged at each instant [3].

Eclipse/BSD's disk scheduler uses a new GPS approximation, the YFQ (Yet another Fair Queueing) algorithm [5], which can be implemented very efficiently. A resource is called busy if it has at least one busy queue, or idle otherwise. YFQ associates a start tag, S_i, and a finish tag, F_i, with each queue q_i. S_i and F_i are initially zero. YFQ defines a virtual work function, v(t), such that: (1) v(0) = 0; (2) While the resource is busy, v(t) is the minimum of the start tags of its busy queues at time t; and (3) When the resource becomes idle, v(t) is set to the maximum of all finish tags of the resource.

When a new request r_i that uses queue q_i arrives: (1) If q_i was previously empty, YFQ makes $S_i = \max(v(t), F_i)$ followed by $F_i = S_i + \frac{l_i}{w_i}$ , where l_i is the data length of request r_i; and (2) YFQ appends r_i to q_i.

YFQ selects for servicing the request r_i at the head of the busy queue q_i with the smallest finish tag F_i. r_i remains at the head of q_i while r_i is being serviced. When r_i completes, YFQ dequeues it; if queue q_i is still non-empty, YFQ makes S_i = F_i followed by $F_i = S_i + \frac{l'_i}{w_i}$ , where l'_i is the data length of the request r'_i now at the head of q_i.

Selecting one request at a time, as described above, allows YFQ to approximate GPS quite well, providing good cumulative service, delay, and fairness guarantees. However, such guarantees may come at the cost of excessive disk latency and seek overheads, harming aggregate disk throughput. Therefore, YFQ can be configured to select up to b requests (a batch) at a time and place them in a sort queue, as shown in Figure 2. The disk driver or the disk itself may reorder requests within a batch so as to minimize disk latency and seek overheads.

**Figure 2:** The sort queue allows the disk driver or disk to reorder requests and minimize disk latency and seek overheads.
$\begin{figure} \centerline{\epsfxsize=\columnwidth \epsfbox{figs/fs1.eps}}\end{figure}$