Check out the new USENIX Web site. next up previous
Next: Beyond Writes: Range Reads Up: The Basic Interface Previous: Range Specification

Overlapping Ranges

Modern disks allow multiple outstanding writes. While improving performance, the presence of multiple outstanding requests complicates range writes. In particular, a file system may wish to issue two requests to the disk, R1 and R2. Assume both requests should end up near one another on disk (e.g., they are to files that live in the same cylinder group). Assume also that the file system has a free block range in that disk region, B1 through Bn.

Thus, the file system would like to issue both requests R1 and R2 simultaneously, giving each the range B1 through Bn. However, the disk is thus posed with a problem: how can it ensure it does not write the two requests to the same location?

The simplest solution would be to disallow overlapped writes, much like many of today's disks do not allow multiple outstanding writes to the same address (``overlapped commands'' in SCSI parlance [35]). In this case, the file system would have two choices. First, it could serialize the two requests, first issuing R1, observing which block it was written to (say Bk), and then submitting request R2 with two ranges (B1 to Bk-1 and Bk+1 to Bn). Alternately, the file system could pick subsets of each range (e.g., B1, B3, ..., Bk-1 in one range, and B2, B4, ..., Bk in the other), and issue the requests in parallel.


Table: Classic vs. Range Writes. The table shows the differences between the classic idealized disk write and a range write. The range descriptor can be specified as a list of free ranges or a (begin, end) pair plus bitmap describing the free blocks within the range.
Classic Write
$\;$ in: address, data, length
$\;$ out: status
Range Write
$\;$ in: range descriptor, alignment,
data, length
$\;$ out: status, resulting target address
0.9


However, the non-overlapped approach was too restrictive; it forces the file system to reduce the number of targets per write request in anticipation of their use and thus reduces performance. Further, non-overlapped ranges complicate the use of range writes, as a client must then make decisions on which portions of the range to give to which requests; this likely decreases the disk's control over low-level placement and thus decreases performance. For these reasons, we chose to allow clients to issue multiple outstanding range writes with overlapping ranges.

Overlapping writes complicate the implementation of range writes within the disk. Consider our example above, where two writes R1 and R2 are issued concurrently, each with the range B1 through Bn. In this example, assume that the disk schedules R1 first, and places it on block B1. It is now the responsibility of the disk to ensure that R2 is written to any block except B1. Thus, the disk must (temporarily) note that B1 has been used.

However, this action raises a new question: how long does the disk have to remember the fact that B1 was written to and thus should not be considered as a possible write target? One might think that the disk could forget this knowledge once it has reported that R1 has completed (and thus indicated that B1 was chosen). However, because the file system may be concurrently issuing another request R3 to the same range, a race condition could ensue and block B1 could be (erroneously) overwritten.

Thus, we chose to add a new kind of write barrier to the protocol. A file system uses this feature as follows. First, the file system issues a number of outstanding writes, potentially to overlapping ranges. The disk starts servicing them, and in doing so, tracks which blocks in each range are written. At some point, the file system issues a barrier. The barrier guarantees to the disk that all writes following the barrier take into account the allocation decisions of the disk for the writes before the barrier. Thus, once the disk completes the pre-barrier writes, it can safely forget which blocks it wrote to during that time.


next up previous
Next: Beyond Writes: Range Reads Up: The Basic Interface Previous: Range Specification
Remzi Arpaci-Dusseau 2008-10-08