3. Opportunistic Block Reordering

To support large client populations inexpensively, we need to minimize the number of disk fetches and maximize the number of clients served with each fetch. Ideally, when multiple clients request the same file, the server retrieves each block from disk only once and sends it to every client. Then, using the notation introduced previously, serving a file to N clients at network throughput N R_crequires disk throughput R_c and buffer space $B/2 \leq M \leq 2B$ , all of which are independent of N.

A file download server must transfer all the blocks of a file to satisfy each file request; thus an arriving file request from client i reveals information about block accesses on behalf of client i for at least the next L/R_cⁱ time units. The system can use these revealed block sequences to improve performance and/or reduce cost. In particular, it can improve cache performance by opportunistically reordering the accesses, e.g., to send each fetched block opportunistically to all clients known to need the block in the future, before replacing it from memory. The server can send blocks out-of-order by attaching an application-layer header [12] to each transmitted block that specifies its offset; clients resequence the data according to these block headers. This section presents the block reordering algorithm in the Circus content server.

$\epsfig{file=eps/algorithm.eps, width=3.0in}$

Figure 4: We call active region of a file the part of the file that is resident in the main memory of the server. The client who downloads the file with highest rate (frontrunner) advances the active region, while clients with lower rates (followers) are kept within the active region. Some clients (independent) are allowed to move outside the active region.

Figure 5: Pseudocode of the Circus block selection algorithm.

The Circus server maintains a list of files with downloads currently in progress (active files), and for each active file a list of clients currently downloading that file (active clients). For each active client the server creates a FIFO queue of references to the file blocks selected to send next to that client. The server transmits blocks from the head of the queue when the client's network transport send window opens. When a queue drains below a threshold, the server selects another block to transmit to that client, schedules a disk fetch if necessary, and adds the block reference to the queue tail. Circus strives to guarantee forward progress and fairness by serving all active clients at their maximum network transfer rates.

The block selection policy is the key to the Circus algorithm. For each active client, Circus also maintains a set data structure (bitmap) to record the set of blocks already sent to the client. Thus, for each block of an active file, it knows the set of active clients that have yet to receive the block. The selection policy could select blocks to greedily maximize the number of clients that benefit from each block, or minimize the number of block fetches needed to fill up all block queues at any given time. However, these policies are computationally complex, and their cost scales with the file size.

We propose a simple block selection policy that is fast and has modest memory requirements. Figure 5 outlines the algorithm. Its complexity scales linearly with the number of active clients for each file. To preserve locality of reference among the active clients, Circus designates an (active region) for each file as preferred for block selection and caching. The active region is a moving contiguous region within a circular block space. The file cursor specifies the front edge of the active region of a file. The length of a file's active region (the active length, a tunable configuration parameter) controls the amount of memory devoted to caching the file.

Each active client has its own client cursor that keeps track of the file offset of the block that was last queued to send to the client. Based on the client cursor positions, we call the leading active client for a file (often the highest-rate client) the frontrunner of the file; the trailing clients active on the same file are the followers. The frontrunner advances the file cursor according to its current client cursor. This operation refreshes the system cache with new data blocks that are likely also needed by the followers. When the cursor of a follower lags behind the frontrunner at a distance that exceeds the active length, the follower's cursor advances to the middle of the active region. Moving the cursor to the middle rather than the front of the active region prevents the follower from immediately becoming a frontrunner. We can summarize these operations with the following two rules:

To avoid stalling clients that are missing only a small number of blocks, Circus tracks the fraction of file blocks each client has already received (the client threshold). When this fraction exceeds a configured maximum value, the client becomes independent; the algorithm selects blocks for independent clients by scanning their block maps for needed blocks, even if they are outside of the active region. A client also becomes independent if it is the frontrunner and its next missing block is more than a configured leapfront distance ahead of the current file cursor. This avoids damage to the reference locality of the other active clients. In our experiments, we found a leapfront distance equal to the active length of the file to achieve stable operation in the system. We investigate the sensitivity of the system to several configuration parameters in Section 5.5.

$\epsfig{file=eps/prototype.eps, width=3.5in}$

Figure 6: Simplified diagram outlining the prototype implementation of the Circus download server. Solid arrows show data transfers, while dotted arrows illustrate transfer of control.

$\epsfig{file=eps/setting.eps, width=2.4in}$

Figure 7: Multiple client instances corresponding to different download requests are partitioned across several workstations according to their network link capacity. The server is connected to the client workstations through a gigabit Ethernet switch. Dummynet is used to control the per flow rate from the server to each client workstation.