################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org A Trace-Driven Comparison of Algorithms for Parallel Prefetching and Caching Tracy Kimbrel Dept. of Computer Science and Engr. University of Washington Seattle WA tracyk@cs.washington.edu Andrew Tomkins School of Computer Science Carnegie Mellon University Pittsburgh PA andrewt@cs.cmu.edu R. Hugo Patterson Dept. of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA rhp@cs.cmu.edu Brian Bershad Dept. of Computer Science and Engr. University of Washington Seattle WA bershad@cs.washington.edu Pei Cao Computer Sciences Dept. University of Wisconsin - Madison Madison WI cao@cs.wisc.edu Edward W. Felten Dept. of Computer Science Princeton University Princeton NJ felten@cs.princeton.edu Garth A. Gibson School of Computer Science Carnegie Mellon University Pittsburgh PA garth@cs.cmu.edu Anna R. Karlin Dept. of Computer Science and Engr. University of Washington Seattle WA karlin@cs.washington.edu Kai Li Dept. of Computer Science Princeton University Princeton NJ li@cs.princeton.edu Abstract High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are significant interactions between them; a block prefetched too early reduces the effectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate trace-driven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We find that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when significant I/O stalls are not present; and that a single, simple strategy is capable of doing both. 1 Introduction Recently there has been a great deal of interest in prefetching from parallel disks as a technique for improving the I/O performance of sequential applications. In this paper, we study prefetching and caching strategies for multiple disks in the presence of application-provided knowledge of future accesses. We compare the performance of four algorithms: Fixed horizon is simple to implement, and has near-optimal performance when sufficient I/O parallelism is available, but can be suboptimal in I/O-bound situations. Aggressive is also simple to implement, is close to optimal for a single disk and for well-laid-out data on multiple disks, but can be suboptimal for multiple disks when the load on the disks is unbalanced. Reverse aggressive is substantially more complex, but is provably close to optimal for all configurations in a uniform fetch-time model of disk accesses. Forestall is a new algorithm, representing an attempt to combine the desirable characteristics of the other three algorithms. Using trace-driven simulation on a collection of file access traces, we compare the performance of these algorithms assuming an environment in which a single process is running and full advance knowledge is available. 1.1 Motivation Our work is motivated by recent advances in technology that have made magnetic disks both cheaper and smaller. As a result, parallel disk arrays have become an attractive means for achieving high performance from storage devices at low cost [24][28][15]. Independently accessible multiple disks offer the advantage of both increased bandwidth and reduced contention on individual disk arms. However, many applications do not benefit from this parallelism because their I/O accesses are serial. This problem is particularly severe for read-intensive applications. Write performance is less important as write behind strategies can mask update latency. Read-intensive applications that stall for I/O a significant fraction of their running time include text search, 3D scientific visualization, relational database queries, multimedia servers and object code linkers. Many of these applications have predictable access patterns [18][1][25]. The ability to provide the file system with hints about future references has motivated research into the design of policies that use this information to reduce I/O overhead [6][7][26][25]. The two key techniques that are enabled by detailed information about future accesses are deep prefetching and better-than-LRU cache replacement. This paper explores the tradeoff between aggressive prefetching and optimal cache replacement. The decision to prefetch requires that a buffer be reserved immediately, usually precipitating the replacement of cached data. This earlier replacement may result in an inferior replacement choice, which may actually increase the total number of fetches and degrade performance. Furthermore, in the multiple disk case, poor replacement choices can lead to load imbalance between the disks. The algorithms we study choose different points on the spectrum between aggressive prefetching (with possibly poorer cache replacements) and more conservative prefetching (with closer to optimal cache replacement). The algorithms explored in this paper also differ in the degree to which they require advance knowledge. Fixed horizon exploits global knowledge the least while reverse aggressive exploits it the most. Using disk-accurate, trace-driven simulation, this paper's results provide a measure of the potential benefit of using global knowledge. 1.2 Parallel prefetching and caching Prefetching and caching are even more complicated in a system with multiple disks, not only because it is possible to do multiple prefetches in parallel, but also because appropriate cache replacement strategies can alleviate load imbalance among the disks. Since a disk can serve only one prefetch at a time, a set of blocks can be prefetched in parallel only if they reside on different disks. example.eps Figure 1: An example of prefetching and caching with advance knowledge of accesses for two disks. One disk holds blocks A, C, E, and F, and another disk holds blocks b and d. Cache size k = 4 blocks and fetch time F = 2 computation steps. In order to develop intuition for why cache replacement strategies can affect parallel prefetching performance, consider the example shown in figure 1. In this example, the cache holds four blocks and the application references one block per time unit. If the application wants to reference a block that is not present in the cache, the application must wait or stall until the block is present. Suppose that it takes two time units to fetch a block from disk, and that the fetches on each disk are serialized. Every fetch evicts some block from the cache; the evicted block becomes unavailable at the moment the fetch starts. The goal is to minimize the total time spent by the application. The application references blocks in the sequence (A,b,C,d,E,F), and the cache initially holds blocks A, b, d, and F. Blocks A, C, E and F are on one disk, and blocks b and d are on the other disk. The straightforward approach is to prefetch aggressively: always fetch the missing block that will be referenced soonest; evict the block whose next reference is furthest in the future; but do not fetch if the evicted block will be referenced before the fetched block. For small caches such as in this figure, the fixed horizon and aggressive algorithms both behave in this way. Figure1(a) shows the cache block changes using this approach. The total elapsed time for the sequence is 7 time units. Figure 1(b) shows another prefetching schedule on the same reference pattern that is faster by one time unit. On the first fetch, d is evicted rather than F, even though d is referenced earlier. This has the advantage of offloading one fetch from the heavily loaded disk (the one holding A, C, E, and F) to the otherwise idle disk (the one holding b and d). This change allows two fetches to proceed in parallel later, thus saving one time unit. The example shows that the achievable I/O parallelism of multiple disks can be affected by cache replacement and data placement policies. These are the factors that are addressed by the reverse aggressive algorithm. 1.3 Comparing approaches The fixed horizon algorithm is based on the second Informed Prefetching (TIP2) system of Patterson, Gibson et al. [26], which manages allocation of cache space and I/O bandwidth between multiple processes, only some of which are disclosing some or all of their future accesses. TIP2 is designed for the case in which sufficient I/O bandwidth exists to service the request stream without stalling on I/O. The fixed horizon algorithm, a restriction of TIP2 to a single hinting process, initiates a fetch for a missing block H references ahead of its reference, where H is the ratio between the average time it takes to read a block from disk and the minimum time it takes to consume a block of data. Patterson et al. showed that under the assumption of sufficient bandwidth, this strategy eliminates stalls while placing little stress on system resources. However, fixed horizon will not look further than H references into the future for fetches to perform. This can cause it to stall on I/O when there is insufficient disk bandwidth. In contrast, the aggressive and reverse aggressive algorithms are designed to take maximum advantage of any amount of I/O parallelism. They use knowledge of future accesses to minimize application elapsed time for both small and large numbers of disks. Aggressive prefetches as early as possible, provided that the prefetched block is needed by the application sooner than the block that it will replace. When insufficient bandwidth is available, in particular, it becomes more important to schedule prefetch requests to ensure that no bandwidth is wasted. Reverse aggressive goes beyond aggressive's use of future knowledge by attempting to balance disk workload through carefully selected replacement decisions. Previously, it was shown theoretically that on any access pattern known in advance, reverse aggressive's elapsed time is close to optimal [16]. It is the only one of these algorithms with this theoretical performance guarantee. However, it is not a practical algorithm. First, it is much more complex than the other algorithms, and second, its decisions depend on information farther in the future than the other algorithms. Nonetheless, its relative performance characteristics are of interest: we would like to understand whether or not the theoretical model we have defined actually gives insight into real system performance. If so, the theoretically near-optimal reverse aggressive can be used as a benchmark against which to compare other algorithms. The performance of reverse aggressive is our best a priori estimate of optimal performance. The new algorithm forestall attempts to combine the best features of the other three algorithms: the good performance of reverse aggressive and the simplicity and implementability of fixed horizon and aggressive. Forestall avoids stalling while still making good (late) replacement decisions by estimating the point at which it needs to begin prefetching in order to prevent stalling. 1.4 Summary of results In this paper we describe the results of a performance evaluation of the different policies for the d-disk integrated prefetching and caching problem. Our results from trace driven simulation demonstrate the practical performance characteristics of these algorithms. On our traces, we found that: All four algorithms significantly outperform demand fetching, even when advance knowledge of the access sequence is used to make optimal replacement decisions in conjunction with demand fetching. In compute-bound situations, fixed horizon and forestall have the best performance (which is usually matched by reverse aggressive's). In I/O-bound situations, aggressive and forestall have the best performance (which is usually matched by reverse aggressive's). In any given situation, one of fixed horizon or aggressive performs close to the theoretically near-optimal reverse aggressive. In all situations, forestall performs close to reverse aggressive. When data is well-laid out on the disks (e.g., striped), disk loads are balanced even without careful replacement choices. For this reason, reverse aggressive does not significantly outperform the other algorithms. Fixed horizon consistently places the least I/O load on the disks, due to its conservative fetching and near-optimal replacement choices. Reverse aggressive and forestall are intermediate between aggressive and fixed horizon. Batching of prefetch requests and disk head scheduling are crucial to the performance of prefetching and caching strategies. Forestall is a promising new approach that combines the best features of the other three algorithms: good performance regardless of I/O- or compute-boundedness, simplicity, and practicality. We have focused on a rather narrow range of the input space: the single process, fully-hinted case. Clearly, prefetching and caching algorithms must deal effectively with missing or incorrect hints, as well as multiple simultaneously executing processes. Fixed horizon, aggressive and forestall can all be adapted to deal with these more general situations [26][5]. 1.5 Related work Caching and prefetching have been known techniques to improve storage hierarchies for many years [12][2]. In architectures, the work on caching and prefetching has focused on bridging the performance gap between CPU and main memory [29]. Research using caching and prefetching in database systems [10][23][9] showed that it is important to use applications' knowledge to perform caching and prefetching. File caching and prefetching have become standard techniques for sequential file systems [26][7][13][4][30][22][14][20][12]. The most common prefetching approach is to perform sequential readahead [21][20][12]. The limitation of this approach is that it only benefits applications that make sequential references to large files. Another large body of work has been on predicting future access patterns [13][10][23][30][11]. Recently, caching and prefetching have also been studied for parallel file systems [25][18][11]. Although much work has been done in file caching and prefetching, most of it has considered one or the other in isolation. Recent studies for the single disk case showed [5][6] that it is important to integrate prefetching, caching and disk scheduling together and that a properly integrated strategy can perform much better than a naive strategy. For the multi-disk case, a theoretical study [16] presented and analyzed aggressive and reverse aggressive. Other parallel prefetching strategies include one stripe lookahead prefetching on RAID arrays, and Patterson et al. [26]'s TIP2 system. The one stripe lookahead benefits only applications that use large files, and would perform little prefetching for other applications. TIP2 uses the fixed horizon algorithm we have studied here. Patterson et al. [26] also present a cost-benefit technique for controlling buffer allocations for both hinting and non-hinting applications in a multi-process environment. Previous studies of the algorithms considered here have been incomparable. Differences in hardware, both in the processor and the I/O system, as well as in the benchmarks used to evaluate the algorithms, have made it difficult to understand the differences between them. This paper represents the first direct comparison of these approaches. Using the knowledge learned from this comparison, we have designed a new algorithm that attempts to combine the best features of the previous efforts. 1.6 Organization of the paper In the next section we describe the first three algorithms and their theoretical basis. In section 3, we describe our simulation framework. In section 4, we present the results of our simulations using the first three algorithms. In section 5 we describe the new algorithm forestall and present simulation results on its performance. We present our conclusions in section 6. 2 The algorithms We begin by introducing the framework used to study this problem and the terminology used in the rest of the paper. 2.1 Theoretical model Our theoretical model consists of two levels of memory hierarchy: a cache of K data blocks, and d (disk) storage devices. The execution of a program makes a known request sequence of references r_1, r_2, ... r_n to a set of m data blocks. If a reference hits in the cache, it takes one time unit. Otherwise, the missed block must be fetched from a storage device. The system can either fetch a block on a miss (demand driven) or fetch the block before it is referenced in anticipation of a miss (prefetch). Either case takes F time units. If the cache is full, a cache block must be evicted before the fetch is issued to make room for the requested data block. While the fetch is in progress, neither the incoming block nor the discarded block is available for access. We assume that each block resides on a single disk. Fetches to a single disk are serialized, but fetches on different disks can be executed concurrently. When the program tries to access a block that is not in the cache, it stalls until the block arrives in the cache. The stall time is either F if the block is fetched on demand or F - i if the block is prefetched i time units before the reference. The measure of performance is the elapsed time required to serve the entire request sequence; this is equal to the number of references plus the total stall time. The goal is to minimize application elapsed time, by deciding when to fetch a block from a disk, which disk to fetch from, which block to fetch, and which cache block to evict (when the cache is full). The time unit models the CPU time spent between two consecutive file references --- the CPU time includes the time to copy the accessed file data from kernel address space to a user address space buffer, and the time for the application to consume the file data. The model simplifies the real situation by assuming that the CPU time between every two file references is the same, that all disk accesses take the same amount of time, and that there is no CPU overhead incurred by issuing an I/O request. These simplifications were made in order to make the problem theoretically tractable. Our simulations use actual CPU times collected in our traces and an accurate simulation model of modern disk drives, and charge a driver overhead for each request made to a disk. 2.2 Optimal prefetching rules The following simple rules can be assumed of any optimal strategy in the single-disk case [6]. (These rules are optimal in the sense that any schedule that does not follow them can be transformed into one that does, with performance at least as good.) Optimal fetching: when fetching, always fetch the missing block that will be referenced soonest; Optimal replacement: when fetching, always evict the block in the cache whose next reference is furthest in the future; Do no harm: never evict block A to fetch block B when A's next reference is before B's next reference; First opportunity: never evict A to fetch B when the same thing could have been done one time unit earlier. Unfortunately, as exhibited in the example in section 1, some of these rules no longer hold in the multiple-disk case. It may be necessary to violate all of the rules except first opportunity to produce an optimal schedule. 2.3 The fixed horizon algorithm As described earlier, the fixed horizon algorithm is based on the TIP2 system running a single hinting process [26]. Fixed horizon: Whenever there is a missing block at most H references in the future, issue a fetch for that block, replacing the cached block whose next reference is furthest in the future, provided that reference is further than H accesses in the future (which will certainly hold if H < K). Fixed horizon is consistent with the first three rules of optimal prefetching for a single disk. An advantage of not following the fourth rule is that fixed horizon needs less information about references beyond the prefetch horizon than the other algorithms. A disadvantage is that when additional information is available, fixed horizon can have elapsed time nearly twice optimal. The prefetch horizon H is computed as the ratio of the average time it takes to read a block from disk and the minimum time it takes to consume a single block of data. In the theoretical model, H = F. Fixed horizon tries to fetch as late as possible without stalling in order to make the best possible replacement decision. Each fetch is issued so that it will complete just in time for the reference. If parallelism increases to the point that each request is made to an idle disk, this algorithm is optimal. However, in practice, a sufficient number of disks may not be available. In this case, fixed horizon may initiate fetches too late to avoid stalling. In fact, because it never initiates a fetch more than H references ahead of the missing block, fixed horizon may allow a disk to become idle even though the future requests beyond the prefetch horizon contain many missing blocks. On the other hand, if the missing blocks in the sequence tend to be separated by many intervening references to blocks that are present in the cache, we'd expect fixed horizon to have performance much closer to optimal than its worst case. 2.4 The aggressive algorithm The (multi-disk) aggressive algorithm is based on the Cao et al. (single-disk) aggressive algorithm [6], which is provably near-optimal in the single-disk case. (Multi-disk) aggressive: Whenever a disk is free, prefetch the first missing block on that disk, replacing the block whose next reference is furthest in the future, under the condition that the next access to the evicted block is after the next access to the block being fetched. Aggressive is the most aggressive prefetching strategy that is consistent with the four optimal prefetching rules described in section 2.2. As mentioned, some of these rules are no longer valid in the multiple disk case. This provides some of the intuition for the following theorem. Theorem 1 [16] For any access pattern, and any layout of data on disks, the elapsed time of aggressive is at most d(1 + epsilon1) times that of the optimal elapsed time (the minimum possible), where d is the number of disks, and epsilon1 is a small constant that depends on system parameters. (epsilon1 here is F/K where F is the fetch time/compute time ratio and K is the cache size measured in blocks. For typical system parameters epsilon1 is less than 0.02.) There are worst case access patterns/data layouts for which the elapsed time of aggressive is at least d times the minimum possible. It is important to note that this worst case result depends on access patterns and data layouts in which the load is heavily unbalanced between the disks. If the request sequence is balanced, aggressive has near-optimal performance. 2.5 The reverse aggressive algorithm The reverse aggressive algorithm exploits global knowledge in order to produce a prefetching schedule that achieves near-optimal elapsed time. It does this by balancing disk workload through carefully selected replacement decisions. Reverse aggressive: Construct a prefetching schedule for the reversed sequence that replaces at most one block on each disk in parallel as follows: Whenever a disk is free, determine the block B not needed for the longest time on that disk. If the next request to B is after the first missing block, issue a fetch for the missing block, replacing B. Transform this prefetching schedule back to a schedule for the original sequence by treating each fetch on the reverse sequence as an eviction on the forward sequence and vice versa. For a proof of correctness, more details on how and why this algorithm works well, and a proof of the following theorem, see [16]. Theorem 2 [16] For any request sequence, and for any layout of the data on the disks, the elapsed time of reverse aggressive is at most 1 + epsilon2 times the optimal elapsed time. (epsilon2 here is less than Fd/K, where F is the fetch time/compute time ratio, d is the number of disks, and K cache size in blocks. For typical sytem parameters, epsilon2 is less than 0.1, and sometimes significantly less.) There are two key properties of reverse aggressive that result in this theorem. First, whereas aggressive chooses evictions without considering the relative loads on the disks, reverse aggressive greedily evicts to as many disks as possible on the reverse sequence. In the forward direction, this translates to performing a maximal set of fetches in parallel. The fact that these are fetches in the forward direction means that at some point earlier in the sequence, corresponding blocks were evicted. Thus the eviction decisions of reverse aggressive on the forward sequence are based on the ability to prefetch the evicted blocks later on in parallel. Second, whereas aggressive can wastefully prefetch ahead on some of its disks, reverse aggressive is greedy in the reverse direction. Consequently, it is fetching blocks in the forward direction just in time (to the extent possible) for them to be used. This results in performing close to the best evictions possible for those fetches. 2.6 Practical considerations Several important features of real systems are not captured by our theoretical model. Disk response times and CPU times between I/O requests are not constant. We use average values for each and expect that variation in event times does not substantially invalidate the algorithm's decisions. In our experimentation, this does not appear to be a major effect, with one exception (see section 4.3). (The systematic effects of disk scheduling on disk response time are considered separately). Access patterns exhibit locality of reference, and loads are balanced across the multiple disks when data is striped. In practice, this allows both fixed horizon and aggressive to effectively utilize multiple disks, and achieve elapsed times comparable to the theoretically superior reverse aggressive. Disk accesses require significant CPU overhead to form the request, communicate with the disk, and service the resulting interrupt(s). Thus, avoidable data fetches may add elapsed time even if they do not cause stalls. Because the theory assumes that fetches entail no CPU overhead, this penalty punishes overly aggressive fetching. In practice, this effect favors the fixed horizon algorithm since its late replacement decisions tend to lead to the fewest fetches. Disk response time is sensitive to the order in which requests are serviced. In particular, disk scheduling reduces average disk response time as more accesses are presented and allowed to be reordered by the disk (driver). Although fixed horizon implicitly allows multiple outstanding requests at each disk, aggressive and reverse aggressive were defined to submit only one request at a time, since in the theoretical model there is no advantage to batching. Because of the significance of the disk scheduling effect, we modify the definitions of aggressive and reverse aggressive to submit disk requests in batches. We have found that the performance of all three algorithms benefits from the CSCAN disk scheduling algorithm. Reverse aggressive also benefits from batching of requests during its construction of its prefetching schedule (the reverse pass over the request sequence). This is because typical request sequences exhibit spatial locality; by batching requests on the reverse pass, reverse aggressive generates missing blocks to be fetched on the forward sequence in groups that exhibit locality of reference. The inter-request CPU time is actually composed of two components, a fixed amount of time to read a block out of the cache, and a variable amount of time to process the data. Our implementation of fixed horizon assumes the data processing time to be zero, and uses the ratio of the average disk response time to the time to read a block from the cache as the prefetch horizon H. This ensures that any prefetch to an idle disk will complete in time for the reference. Assuming an average disk response time of 15ms (which is usually an overestimate in our simulations) and 243 microsec to read a block from the cache (which was measured on the implemented TIP2 system) yields a value of H=62; we used this value in all our simulations, except where noted otherwise. 2.7 Implementations of the algorithms In the context of the considerations of the previous section, we summarize the implementations we compared. Fixed horizon: Whenever there is a missing block at most H references away, issue a fetch for that block, replacing the block whose next reference is furthest in the future. Note that this algorithm may at any time have up to H outstanding references to a disk yielding opportunities for disk scheduling. Aggressive: Whenever a disk D is free, construct a batch of at most batch-size fetches to initiate on D as follows: As long as the first missing block B on disk D precedes the block B' whose next request is furthest in the future, add the fetch/eviction pair B/B' to the batch. Issue the batch. (The batch sizes used are listed in table 6.) If two or more disks are free at the same time, we consider all their missing blocks together, in order of increasing request index. Each next missing block is issued to the appropriate disk (and the best possible choice of evictions is made), if the disk's batch is not full and the do no harm rule allows it. At some point, either the last free disk's batch becomes full or the do no harm rule disallows issuing further requests. Reverse aggressive: Assuming a fixed ratio F between the time for a disk access and the inter-reference CPU time, consider the reversed sequence, and use it to derive a prefetching schedule as described in section 2.5, but construct the schedule in batches as done by aggressive. This prefetching schedule is then transformed into a schedule of fetch/eviction pairs for the forward sequence. Associated with each eviction is a release time, the earliest index in the request sequence at which the block can be evicted (i.e. one greater than the index of the last request to the block until it is possibly fetched back into the cache at some later time.) The eviction choices are naturally ordered by increasing release point due to the method used by reverse aggressive to construct its schedule. Fetches may need to be re-ordered according to increasing request index; they are then matched to eviction choices according to these orderings. This schedule is used to drive the disk model as follows. Whenever a disk D is free, add the first up to batch-size fetch/eviction pairs B_i/B_i' that have been released, and for which B_i resides on disk D, to the batch. Issue the batch. (The batch sizes and estimate F used by reverse aggressive are discussed in section 4.4.) Notice that aggressive and fixed horizon use less lookahead information than reverse aggressive, in that for both of them, the ``only'' future information needed are the identities of the next missing blocks (up to H missing blocks for fixed horizon, and up to d times batch-size for aggressive), and their positions in the sequence relative to the next references to blocks currently in the cache. 3 Simulation framework We used trace-driven simulation to evaluate the performance of the algorithms. We believe our simulation model to be an accurate reflection of the practical performance characteristics of the algorithms. The reference streams are taken from traces of real applications' behavior; the trace information we use is unaffected by prefetching and caching activity. The accurate modelling of disk fetch times, I/O driver overhead costs, and application process compute times in the simulations is a key difference relative to the theoretical framework. However, our simulators do not model serialization of DMA transactions. (We do not expect this to have a significant effect on the results since the DMA time is much less than the disk access time.) Two separate simulators were developed, one at Washington (UW) and one at Carnegie Mellon (CMU). The UW simulator uses the disk drive simulation of Kotz et al. [19] (which is based on that of Ruemmler and Wilkes [27]) to accurately model I/O costs. This simulation models fine architectural details to provide a very accurate simulation of the HP 97560 disk drive. Table 1 lists several characteristics of the HP 97560 (taken from [27]). The CMU simulator uses the Berkeley RaidSim [8] simulator, as modified at CMU, to simulate 0661 IBM Lightning disk drives. Sector size sectors per track tracks per cylinder 512 bytes 72 19 cylinders rotational speed disk cache size 1962 4002 rpm 128 Kbytes ave. access time (8Kbyte) controller interface transfer rate 22.8ms SCSI-II 10 MB/sec Table 1: HP 97560 characteristics. The simulators were cross-validated on a common set of traces. The CMU simulator does not implement reverse aggressive. We obtained good agreement between the simulators on the results for aggressive and fixed horizon for several traces. Table 2 shows the elapsed times measured by the simulators for the xds and synth traces described below. Remaining differences between the simulators are consistent with the differences in the disk models. We report here results for all algorithms obtained using the UW simulator. xds elapsed times (secs) CMU simulator UW simulator disks F.H. Agg. F.H. Agg. 1 63.3 61.6 65.6 63.7 2 36.9 34.1 38.0 34.3 3 33.6 33.9 36.2 33.7 4 33.8 35.1 34.2 35.1 5 33.0 34.2 33.5 34.4 synth elapsed times (secs) CMU simulator UW simulator disks F.H. Agg. F.H. Agg. 1 213.0 168.5 201.4 155.8 2 136.3 126.9 130.9 121.7 3 118.9 149.5 118.9 150.4 4 118.9 150.4 118.9 150.1 Table 2: Comparison of the simulators on the xds and synth traces. In our simulations, we ignore write operations. Write performance is less critical to I/O performance since the application generally does not have to wait for the disk to be written. Moreover, the impact this has on the results is small since most of the references in our traces are reads. We simulated disk arrays of sizes 1-8, 10, 12, and 16. Most of our figures show a smaller range of sizes, however. In each case, the performance with a larger number of disks is the same as that with the largest number of disks shown. 3.1 File access traces We used a set of traces collected on a DECstation 5000/200. The running time of all the applications is dominated by disk read accesses. Each trace consists of a sequence of file block read requests in the order they were issued, and the sequence of measured process compute times between read requests, of a single execution thread. We used an I/O driver overhead of .5ms per I/O operation, which is typical of the 5000/200. The applications are: cscope[1-3]: an interactive C-source examination tool written by Joe Steffen, searching for eight symbols (cscope1) in a 18MB software package, searching for four text strings (cscope2) in the same 18MB software package, and searching for four text strings (cscope3) on a 10MB software package. With multiple queries, cscope will read multiple files sequentially multiple times. dinero: a cache simulator written by Mark Hill. This application reads one file sequentially multiple times. glimpse: a text information retrieval system from the University of Arizona, searching for four keywords in a 40MB snapshot of news articles. It builds approximate indexes for words to allow both relatively fast search and small index files. The result is that the index files are accessed repeatedly, whereas the data files are accessed infrequently. postgres-join: the Postgres relational database system developed at the University of California at Berkeley, performing a join between an indexed 32MB relation and a non-indexed 3.2MB relation. The relations are those used in the Wisconsin Benchmark [3]. Since the result relation is small, most of the file accesses are reads. Here, the index blocks are accessed much more frequently than the data blocks. postgres-select: the Postgres relational database system executing a selection query of choosing 2% of the tuples from an indexed 32MB relation. The selection query is part of the Wisconsin Benchmark suite [3] and uses indexed search. ld: the Ultrix link-editor, building the Ultrix 4.3 kernel from about 25MB of object files. xds: a 3-D data visualization program, XDataSlice, generating 25 planar slice images at random orientations from a 64MB data file. Finally, we used a synthetic trace synth containing 50 passes through a loop of 2000 sequential blocks. Compute times between read requests were generated according to a Poisson distribution with a 1 ms mean. trace reads distinct blocks compute time (sec) dinero 8867 986 103.5 cscope1 8673 1073 24.9 cscope2 20206 2462 37.1 cscope3 30200 3910 74.1 glimpse 27981 5247 38.7 ld 5881 2882 8.2 postgres-join 8896 3793 11.5 postgres-select 5044 3085 79.2 xds 10435 5392 30.8 synth 100000 2000 99.9 Table 3: Trace summary data. Table 3 shows the length (number of read requests), number of distinct blocks requested, and total application compute times for each of the traces. The cache size was set to be 10MB (or K=1280 blocks of 8 kbytes each) for all traces except dinero and cscope1. These traces contain references to fewer than 1280 distinct blocks. For these traces, the cache size was reduced to 4MB (512 blocks). We assume the cache to be empty (or to contain some other application's data) when the traced application starts. The entire cache is available to the traced application. 3.2 Data placement and disk head scheduling The data was striped across the array using a one-block stripe unit. Some of our traces represented block numbers by (file,offset) pairs; for these we chose a random starting point within a group of 8550 8kbyte blocks (which occupy 100 cylinders on the HP 97560) for each file, corresponding to typical file system clustering mechanisms. The maximum seek time within a group of 100 cylinders is 7.24ms. Thus, in our simulations the average response time is typically lower than the 22.8ms listed in table 1. Other traces referred to logical filesystem block numbers; for these traces we used the actual block number for each access. Except where noted, we use CSCAN disk head scheduling. 4 Results In the following sections, we examine the behaviors of the algorithms in detail. We begin by comparing the performance of the algorithms with that of demand fetching. We then examine the algorithms' performance on the synthetic trace, an easily understood access pattern that illustrates the key differences in behavior between the algorithms. Next we examine performance on the application traces, and explore the effects on the results of changes in various simulation parameters. 4.1 Comparison with demand fetching pq7-all.eps Figure 2: Performance on the postgres-select trace. Each group of bars represents the performance of the four algorithms optimal demand fetching, fixed horizon, aggressive, and reverse aggressive, in left-to-right order. In order to make this comparison as favorable as possible to demand fetching, we use the optimal offline replacement policy: whenever a block is fetched, the block in the cache whose next reference is furthest in the future is replaced. Figure 2 shows the elapsed times of the three algorithms and of optimal demand fetching on the postgres-select trace for varying numbers of disks between one and sixteen. The elapsed times are divided into three components: process compute time, I/O driver overhead (processor) time, and the time the processor spends idle, stalling on I/O. From this figure we see that (1) all three prefetching algorithms significantly outperform optimal demand fetching, and (2) the three prefetching algorithms achieve near linear reduction in I/O overhead until the applications become compute-bound. These two behaviors are consistent across all the applications we have studied. 4.2 Fundamental differences synth-j2.eps Figure 3: Performance on the synth (left) and cscope1 (right) traces. Each group of bars represents the performance of the three algorithms fixed horizon, aggressive, and reverse aggressive, in left-to-right order. The synthetic trace is used to examine the algorithms' behavior on a simple, known sequence in order to gain insight into the algorithms' performance. This trace shows the relative behaviors typical of the three algorithms in exaggerated form. Figure 3 summarizes the results for one to four disks. The sequential accesses allow excellent performance from the disks; average response times are between 3 and 4 ms. In each case, fixed horizon performs 38000 fetches, 720 more than the minimum possible 37280 performed by optimal demand fetching. (The total sequence length is 100,000). With a single disk, the synthetic application is I/O bound. Fixed horizon's conservative prefetching strategy reduces I/O stalling relative to demand fetching, but not as much as aggressive's and reverse aggressive's more aggressive strategies. After each pass through the loop under fixed horizon, the cache contains 1280 sequential blocks and the other 720 blocks in the sequence are not cached. The clustering of the 720 missing blocks allows good disk performance; however, the clustering of the 1280 cached blocks causes fixed horizon to leave the disk idle until the last H cached blocks are being read. Aggressive and reverse aggressive perform 39240 and 39265 fetches, respectively, slightly more fetches than fixed horizon's 38000, resulting in a small difference in driver overhead. However, they are able to eliminate much of the I/O stall time by prefetching distant blocks and thus not idling the disk appreciably. With two disks, fixed horizon is able to eliminate most of the stall time, without increasing the total number of fetches. Aggressive has nearly eliminated stall time completely, but at a higher driver cost due to its increased number (41902) of fetches. Reverse aggressive is between fixed horizon and aggressive in stall time; it performs 42000 fetches. Elapsed times are similar under all three algorithms. This case marks the transition from I/O-boundedness to compute-boundedness. With three disks, stall time has been eliminated completely by all three algorithms. Aggressive uses the excess I/O bandwidth to prefetch and subsequently evict every block for every reference. In fact, because aggressive is willing to prefetch significantly ahead on one disk relative to others, it wastes 994 fetches, replacing a prefetched block from the cache before it is used in order to fetch a block on a different disk that will be needed sooner. Fortunately, this effect does not increase as the number of disks increases since with increasing I/O bandwidth, aggressive's prefetching becomes so successful that every fetch is to the first missing block in the future. Such a block can never be replaced before it is used, since that would violate the do-no-harm rule. Nonetheless, the elimination of stall time by aggressive comes at a high cost: the driver overhead for the extra fetches pushes aggressive's elapsed time higher than the two-disk case. In contrast, fixed horizon prefetches far enough ahead to serve all requests without stall, but no farther. Dedicating at most H buffers to prefetching, fixed horizon is able to eliminate stalling altogether without any additional fetches. Reverse aggressive performs 37907 fetches, fewer than fixed horizon, also eliminating stall time. 4.3 Application traces The application traces show differences among the three algorithms similar to those shown by the synthetic trace, but less pronounced. The right portion of figure 3 shows the performance of the three algorithms on the CPU-bound cscope1 trace. The behavior here is similar to that for the synthetic trace: aggressive eliminates stalling but issues too many fetches resulting in a greater driver overhead. j6.eps Figure 4: Performance on the ld trace. At the I/O-bound end of the spectrum, figure 4 shows a detailed breakdown of the performance of the three algorithms on the ld trace, from one to sixteen disks. With one disk, all three algorithms are I/O bound and have comparable performance. From two to eight disks, the more aggressive prefetching of aggressive and reverse aggressive results in somewhat less stalling than fixed horizon. At ten disks, fixed horizon's performance matches aggressive's. Beyond this point, the tradeoff between excessive stalling caused by leaving disks idle, and excessive driver overhead caused by prefetching aggressively, favors fixed horizon over aggressive. The other traces reflect similar trends, with different points of crossover: above five disks for postgres-select, glimpse, and cscope2, and below five disks for postgres-join, dinero, cscope1, and xds. j4.eps Figure 5: Performance on the cscope3 trace. An exception to the generally best performance of reverse aggressive is the cscope3 trace, shown in figure 5. Note that reverse aggressive's performance is much worse than aggressive's with one disk. This is a case in which the differences between the theoretical model and the simulation model affect the performance of reverse aggressive. Recall that since reverse aggressive is offline, it generates a complete schedule based on its estimate of F. When it uses a smaller estimate of F, each fetch is assumed to complete earlier (relative to the inter-reference compute time) and therefore reverse aggressive generates a more aggressive prefetching schedule that keeps the disk(s) busier. When it uses a larger estimate of F, each fetch is assumed to take longer, and therefore reverse aggressive must delay the scheduling of subsequent fetches in the sequence, thus generating a more conservative prefetching schedule. In our implementation of reverse aggressive, the single best estimate of F is used for each trace. On traces with large variation in inter-reference compute times, any single estimate of F will be either too small or too large for some parts of the trace. This is the case for cscope3 -- examination of the trace reveals that the inter-reference compute times are bursty. Runs of compute times near 1ms are interspersed with runs of times around 7ms. Since the average fetch time on this trace with one disk is about 8ms, the ratio of fetch time to compute time (the ``true'' value of F) varies from about 1 to about 8. In fact, with a single disk, aggressive has the same theoretical performance bounds as reverse aggressive. It is not surprising that aggressive's inherent adaptivity to varying fetch times and compute times should give it an advantage over reverse aggressive in this case. This effect is noticable, but less pronounced, on the synth trace as well. On the remaining traces, reverse aggressive's elapsed time varies from 3.6% worse to 10.7% better than the superior of fixed horizon and aggressive in any given configuration. For the full data, see [17]. Table 4 shows the utilization of the disks (averaged over the disks when there are more than one) for demand fetching and the three prefetching algorithms on the postgres-select trace. For moderate numbers of disks, aggressive places the greatest load on the disks, followed by reverse aggressive and then fixed horizon; demand fetching places the least load on the disks. With a very high degree of disk parallelism, reverse aggressive's offline schedule places even less load on the disks than fixed horizon's conservative strategy. disks demand fetching fixed horizon aggressive reverse aggressive 1 .82 .98 .99 .98 2 .41 .90 .92 .92 3 .27 .82 .87 .85 4 .20 .72 .81 .80 5 .16 .66 .70 .69 6 .13 .58 .63 .60 7 .12 .50 .62 .50 8 .10 .45 .56 .42 10 .08 .36 .43 .35 12 .07 .30 .35 .28 16 .05 .22 .26 .21 Table 4: Disk utilization on the postgres-select trace. 4.4 Varying parameters The performance of the algorithms depends on a set of parameters which interact in complicated ways with the applications' access patterns and inter-reference compute times, the layout of data on disks, the disk-scheduling discipline, and the characteristics of the disks. In this section, we explore the behavior of the algorithms when some of these parameters are varied. For lack of space, we present general observations and only a small portion of the data. For the full data, see [17]. We have already described most of the primary effects that explain what we see. These are: scheduling: an increase in the number of outstanding fetches issued by a prefetching algorithm results in increased latitude to reorder fetches and thus reduced disk response times. This effect is strongest in I/O-bound situations. out-of-order fetching: reordering of fetches can increase stall penalties when early missing blocks are fetched after later missing blocks. This effect is strongest in CPU-bound situations where any stall penalty is costly. When there is significant stalling, this effect is masked by other stalls and compensated for by the reduced average response time. early replacement: as prefetching becomes more aggressive, inferior replacement choices are made, leading to more fetches and in many cases, an increase in elapsed time. limited aggressiveness: the extent to which an algorithm can prefetch is limited by the do no harm rule. Disk-head scheduling The results shown in the previous section were obtained using CSCAN disk-head scheduling. CSCAN was used rather than SCAN since the HP 97560 contains a readahead buffer; CSCAN always scans in the same direction that the disk reads, improving the hit rate in the readahead buffer. We compared the performance impact of CSCAN disk-head scheduling versus FCFS scheduling. Relative to FCFS, CSCAN improves the performance of reverse aggressive the most, up to 24%, and that of fixed horizon the least, up to 15%. For aggressive, the greatest benefit was 19%. Because of out-of-order fetching, CSCAN sometimes degrades performance slightly relative to FCFS in compute-bound situations. This effect is strongest for fixed horizon since it issues fetches later than they are issued by the other algorithms. The maximum degradation we observed is 3.6% (for fixed horizon with six disks on the glimpse trace). Table 5 shows the performance benefit of CSCAN scheduling relative to FCFS on the postgres-select trace for all three algorithms with 1-16 disks. disks fixed horizon aggressive reverse aggressive 1 14.9 19.2 24.0 2 4.85 11.3 22.1 3 2.59 8.36 19.9 4 0.53 3.59 6.71 5 -0.62 -0.77 0.0 6 -0.68 -0.31 0.0 7 -2.15 -0.45 0.0 8 -0.42 -0.17 0.0 10 -0.05 0.09 0.0 12 0.0 0.11 0.0 16 0.0 0.0 0.0 Table 5: Percentage improvement of CSCAN over FCFS on the postgres-select trace. The batch size used by aggressive agg-batchsize.eps Figure 6: Performance of aggressive on the cscope2 trace, as a function of the batch size. Figure 6 shows the effect of varying aggressive's batch size on the cscope2 trace. For each number of disks, performance initially improves with increasing batch size due to improved scheduling. For example, for one disk, the average fetch time drops from 10.4ms to 8.4ms as the batch size increases from 4 to 160. Eventually, out-of-order fetching and early replacement become more important and performance drops off again. For example, for one disk the number of fetches increases from 6771 to 9806 as the batch size increases from 160 to 1280. As the number of disks increases, the variation in performance with batch size diminishes, and the best batch size shifts to a smaller value. This is because in more compute-bound situations, out-of-order fetching and limited aggressiveness are the dominant factors. Because of limited aggressiveness, the number of fetches increases only from 11325 to 11399 as batch size increases from 160 to 1280 with 5 disks. Although the optimal batch size decreases with the number of disks for all the traces, it varies significantly from trace to trace. For example, for the xds trace, the optimal batch size for one to three disks was 16, and for four or more was 4. All the results for aggressive presented in section 4.3 were obtained using the batch sizes given in table 6. The performance of aggressive with these fixed batch sizes is on average 0.7% worse (and at most 11% worse) than its performance with the best batch size for the configuration. 1 disk 2 disks 3 disks 4 disks 80 40 40 16 5 disks 6 disks 7 disks >7 disks 16 8 8 4 Table 6: Batch sizes used for aggressive. Prefetch horizon j2-j3-horizon.eps Figure 7: Performance of fixed horizon as a function of the prefetch horizon H on the cscope1 (left) and cscope2 (right) traces. The left side of figure 7 shows the effect of varying fixed horizon's prefetch horizon H on the cscope1 trace. We see that for each number of disks, performance deteriorates with increasing H (except on one disk, where it improves slightly until H=64 is reached). This is due to out-of-order fetching and early replacement. For example, with 1 disk, earlier replacements cause the number of fetches to increase from 4959 with H=64 to 8535 with H=2048. Out-of-order fetching accounts for all the stall time with 2 and 3 disks when H is greater than or equal to 512; using FCFS scheduling this stall time is eliminated. On the more I/O bound traces such as cscope2, also shown in figure 7, we find a significant initial performance improvement with increasing H because the more aggressive prefetching eliminates stalling. Only at very large values of H does performance decline again. The parameters used by reverse aggressive We experimented with the batch size and fixed value of F used by reverse aggressive to construct its schedule on its reverse pass over the request sequence, as well as the batch size used on the forward pass. Since we use reverse aggressive only as a benchmark against which to compare the other algorithms, the main purpose of these experiments was to determine the optimal configuration (choice of F and batch sizes) for each trace, for each number of disks. These experiments show that, as with aggressive, a smaller (resp. larger) batch size benefits a more compute-bound (resp. I/O-bound) application. Recalling that as reverse aggressive's estimate of F decreases, it becomes increasingly aggressive, we similarly find that a smaller (resp. larger) value of F benefits a more I/O-bound (resp. compute-bound) application. Processor speed and cache size In order to assess the impact of improved CPU performance relative to disk performance, we ran our trace-driven simulations assuming a processor twice as fast. For these tests, fixed horizon's prefetch horizon H was doubled to 124. The results are entirely unsurprising: faster processors are more dependent on I/O performance so that the payoff of using multiple disks and prefetching is increased. In addition, since a larger number of disks is needed to eliminate I/O overhead, the point at which the tradeoffs begin to favor fixed horizon over aggressive is shifted to a larger number of disks. This behavior was consistent across the applications. In order to assess the impact of cache size on performance, we ran our trace-driven simulations with varying cache sizes: 640, 1280, and 1920 blocks. As cache size increases, the performance of all the algorithms improves. In I/O-bound cases, a larger cache improves aggressive's and reverse aggressive's performance more than fixed horizon's since they prefetch more aggressively. In more compute-bound cases, aggressive's excessive driver overhead penalizes it even more with a larger cache, so that fixed horizon's performance relative to aggressive improves slightly as cache size increases. This is illustrated in table 7, which shows the performance of fixed horizon relative to aggressive as percentage difference, as a function of the cache size and the number of disks on the glimpse trace. cache size 1 disk 2 disks 4 disks 8 disks 16 disks 640 6.0 14.7 24.8 7.3 -2.6 1280 11.3 20.2 24.5 5.7 -3.8 1920 13.8 25.0 21.7 5.7 -3.8 Table 7: Elapsed time as a function of the cache size and number of disks of fixed horizon relative to aggressive (percentage difference) on the glimpse trace. 5 A new approach We have designed a new algorithm, forestall, attempting to combine the best features of all three previously described algorithms: the good performance of reverse aggressive regardless of I/O-boundedness or compute-boundedness, and the simplicity and implementability of fixed horizon and aggressive. Forestall tries to avoid stalling while still making good (late) replacement decisions by estimating the point at which it needs to begin prefetching in order to prevent stalling. It makes this estimate based on its current cache state. Returning to the theoretical model, suppose that there is a single disk, and that at some point during the servicing of the request sequence, the cache contains the next 2F-1 blocks requested. (Recall that in the theoretical model, the interreference CPU time is taken to be 1 time unit, and the time to fetch a block from disk is F time units.) Further suppose that the subsequent two requests are missing from the cache. Aggressive immediately starts fetching and avoids stalling on the missing blocks, bringing the second missing block into the cache at time 2F -- just in time to serve the request without stalling. Fixed horizon leaves its disk idle until the cursor is within F requests of the first missing block; it stalls F-1 steps on the second missing block. In contrast, suppose there is only one missing block at a distance of 2F-1 from the cursor. In this case, aggressive will fetch immediately and make a possibly inferior replacement choice. Fixed horizon waits until its cursor is within F steps of the missing block, and prefetches just early enough to avoid stalling; in the intervening time, it may have finished using a block that is not needed until later in the sequence (if at all) than the one evicted from the cache by aggressive. Forestall behaves as does aggressive in the first case, and as does fixed horizon in the second. For each i greater than or equal to 1, let d_i denote the distance from the cursor to the i-th missing block in the request sequence. For any i greater than or equal to 1, if iF > d_i, processing will surely stall on the i-th missing block or some earlier missing block. It will take iF time units to fetch the first i missing blocks, and at most the next d_i requests can be served concurrently. Forestall initiates a prefetch according to the optimal fetching and optimal replacement rules whenever iF is greater than or equal to d_i is true for some i and the do no harm rule allows it. Practical considerations As do the other algorithms, forestall requires modifications in order to account for differences between the theoretical model and real systems. Requests need to be issued in batches in order to reduce average disk access times. The ratio F of disk response time to interaccess time is not constant and must be estimated. In our implementation, we estimate F by tracking recent disk response times and compute times: F is dynamically computed on a per-disk basis as the ratio between the sum of the most recent 100 disk access times and the most recent 100 interreference CPU times. Just as we needed the prefetch horizon H to be an overestimate of F for fixed horizon to have adequate performance, forestall's performance depends on overestimating F in certain situations as well. We denote by F' the overestimate of F used by forestall. We evaluated forestall's performance with different values of the parameter F'. We found that the best choice of F' depended on the per-trace average disk access times. For those traces for which the average disk access time was small, in the 3-4ms range, it was best to take F'=F. For those traces for which the average disk access time was larger, it was best to take F'= 4F. This is not hard to explain. Traces with disk access times in the 3-4ms range must contain a great deal of sequential access, so that most requests hit in the disk's readahead cache and are served by the CSCAN scheduler in the order in which they are received. When this happens, it is not necessary to prefetch aggressively. When the disk access times are large, the access pattern is more complicated, and disk access times more varied. Forestall's mechanism for deciding when to prefetch benefits from overestimating the potential to stall. This smooths out the variations and avoids stalling due to the reordering of requests by CSCAN. Our implementation of forestall adapts to the observed disk access times, using the small value of F' for small disk access times (less than 5ms on average), and the larger value of F' for larger disk access times. Finally, because of the reordering of requests by CSCAN, we found it necessary to add fixed horizon's rule to issue a fetch whenever the cursor is within H requests of a missing block. This avoids stalling on reordered requests in situations in which the iF' greater than or equal to d_i rule delays fetching until the cursor is very near the first missing block. Rather than using complete lookahead information in our implementation of forestall, we check the value of the expression iF - d_i only for those missing blocks within distance 2K of the cursor, where K is the cache size. We have not experimented with different values of this parameter, nor with variations of the history length 100 used to track fetch times and application process compute times. Forestall's dependence on batch-size is similar to aggressive's. We used for forestall the batch sizes given in table 6. To compare static vs. dynamic estimation of forestall's parameter F', we compared the performance of forestall using fixed values of 1, 2, 4, 8, 15, 30, and 60 for F' to its performance using the dynamic estimation just described. Because actual average inter-reference compute times in our traces vary greatly (from 1.3ms for postgres-join to 15.7ms for the postgres-select), no single value can work well for all traces. The values with the least maximum degradation relative to the dynamic algorithm, over all traces and disk array sizes, are 30 and 60; performance is at most 6.8% worse than the dynamic estimation (on the j2 trace with 2-6 disks). We exclude the synth trace from this calculation, since its artificially low disk response times demand a very low value of F'. For each trace, there is a fixed choice of F that works well across all disk array sizes. This best choice varies from 1 for the dinero trace to 60 for the glimpse trace. The maximum degradation relative to the dynamic estimation allowing this much flexibility is 1.4% (for the ld trace with 7 disks and F'=30). Finally, if we choose the best fixed value for each trace and each disk array size, the maximum degradation relative to the dynamic estimation is 1.2% for the cscope1 trace with one disk and F'=2; for all other traces and array sizes, as well as all other array sizes for this trace, the degradation is less than .5%. These results indicate that choosing the right parameters between workloads is more important than choosing the right parameter within a particular workload. Furthermore, forestall's performance even with a single fixed parameter over all workloads and array sizes is always within 7% of optimal, and is almost always within 4% of optimal. This suggests that the advantages of forestall are due to its estimation of and adaptivity to upcoming disk load rather than the dynamic nature of its fetch-time and compute-time estimates. 5.1 Performance of forestall synth-xds-wh.eps Figure 8: Performance on the synth (left) and xds (right) traces. Each group of bars represents the performance of the three algorithms fixed horizon, aggressive, and forestall, in left-to-right order. Figure 8 shows the performance of the three practical algorithms, fixed horizon, aggressive and forestall, on the synthetic trace and xds. Forestall behaves exactly as expected. In the I/O bound situations, it prefetches aggressively enough to perform as well as or even better than aggressive. In the CPU-bound situations, it becomes more conservative in its prefetching, and has a lower driver overhead, matching the performance of fixed horizon. j3-wh.eps Figure 9: Performance on the cscope2 trace. j5-wh.eps Figure 10: Performance on the glimpse trace. Figures 9 and 10 show the performance of the three algorithms on the cscope2 and glimpse traces. Once again, forestall has the best performance of the three practical algorithms. On all remaining traces, over all configurations, forestall's performance was between 2% worse and 5.8% better than the best of aggressive and fixed horizon in that configuration. For the full data, see [17]. Table 8 shows the utilization of the disks by forestall on the postgres-select trace. Its utilization falls between those of aggressive and fixed horizon, as expected. Moreover, in I/O-bound situations, it places a load on the disks similar to aggressive's; in compute-bound situations, it places a lower load on the disks, similar to that of fixed horizon. disks 1 2 3 4 5 6 util. .99 .92 .87 .81 .68 .63 disks 7 8 10 12 16 util. .62 .54 .39 .30 .32 Table 8: Utilization of disks by forestall on the postgres-select trace. 6 Conclusions This paper presents the results of a trace-driven simulation study of integrated prefetching and caching algorithms on a single read-only access sequence, assuming that all accesses are known in advance. We studied four algorithms: aggressive, fixed horizon, reverse aggressive, and forestall. We found that the theoretically near-optimal reverse aggressive usually has the best performance of the four algorithms, but that, perhaps surprisingly, it was never much better than the best of the other algorithms. This shows that carefully choosing replacements is not necessary to balance the load across the disks when the data is well laid out. We found that each of aggressive and fixed horizon performs well under the conditions for which it was designed, and in any given situation, one or the other performs similarly to reverse aggressive. Clearly, aggressive and fixed horizon are much more practical algorithms than reverse aggressive. These observations led us to the hybrid approach of forestall, which prefetches more aggressively in I/O-bound situations and more conservatively in compute-bound situations, resulting in nearly the best performance of the four in all configurations. This study leaves several important issues unresolved. The performance of the algorithms depends on a set of parameters which interact in a complicated way with the applications' access patterns and inter-reference compute times, the layout of data on disks, the disk-scheduling discipline, and the characteristics of the disks. At this time, we have no analytical basis for dynamically determining aggressive's batch size, fixed horizon's prefetch horizon H, reverse aggressive's batch sizes and estimate of F, or forestall's batch size and estimate F' of F. It is a challenging open problem to fully understand the interaction between the algorithmic parameters and the specific application and system characteristics. Another direction for future research is the treatment of writes, both theoretically and experimentally. We have not considered the effects of incomplete or inaccurate hints and we have not dealt with the question of how to allocate buffers among competing processes. While the three practical prefetching algorithms can easily be adapted to deal with these situations [5,26], we expect differences in their performance. Aggressive prefetching increases both disk utilization and cache utilization. Therefore, disks are more likely to be busy when unhinted accesses occur. Moreover, an aggressively prefetching process might consume too large a fraction of the cache relative to a nonhinting process. Since fixed horizon places the least load on the disks and the cache, it is likely to be least affected by unhinted accesses and to have the smallest impact on other executing processes. Lastly, this work reaffirms that the operating system can effectively take advantage of hints. An important research direction is to determine how applications can easily provide such hints. Acknowledgements Tracy Kimbrel and Anna Karlin wish to thank Martin Tompa for his continued encouragement and advice. We could not have managed the production of this document without the help of Dylan McNamee. Karin Petersen, our paper shepherd, was of great help in improving the quality of presentation. This research is supported in part by NSF grant numbers ECD-8907068, CCR-9301186, GER-9450075, CCR-9632769, in part by DARPA Contract numbers DABT63-94-C-0049, DABT63-93-C-0054, in part by generous contributions from the member companies of the Parallel Data Consortium, and in part by Intel Corporation and Digital Equipment Corporation. Tracy Kimbrel is supported by an Intel Foundation Graduate Fellowship. Brian Bershad is supported by an NSF Presidential Faculty Fellowship. Ed Felten is supported by an NSF National Young Investigator Award. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of any supporting organization or the U.S. Government. References 1 A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz and A. Sussman. Tuning the Performance of I/O-Intensive Parallel Applications. Proceedings of the Fourth Annual Workshop on I/O in Parallel and Distributed Systems, pages 15-27, May, 1996. 2 L.A. Belady. A Study of Replacement Algorithms for Virtual Storage Computers. IBM Systems Journal, 5(2):78-101, 1966. 3 Jim Gray. The Benchmark Handbook. Morgan-Kaufman, San Mateo, CA. 1991. 4 Pei Cao, Edward Felten, and Kai Li. Application-Controlled File Caching Policies. In USENIX Summer 1994 Technical Conference, pages 171-182, June 1994. 5 Pei Cao, Edward W. Felten, Anna Karlin, and Kai Li. Implementation and Performance of Integrated Application-Controlled Caching, Prefetching and Disk Scheduling. Technical Report TR-CS95-493, Princeton University, 1995. 6 Pei Cao, Edward W. Felten, Anna Karlin, and Kai Li. A study of Integrated Prefetching and Caching Strategies. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages , May 1995. 7 Pei Cao, Edward W. Felten, and Kai Li. Implementation and Performance of Application-Controlled File Caching. In Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 165-178, November 1994. 8 P.M. Chen and D.A. Patterson. Maximizing Performance in a Striped Disk Array. In Proceedings of the 17th Annual Symposium on Computer Architecture, pages 322-331, May 1990. 9 H.T. Chou and D.J. DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. In Proceedings of the 19th International Conference on Very Large Data Bases, pages 127-141, Dublin, Ireland, 1993. 10 Kenneth M. Curewitz, P. Krishnan, and Jeffrey S. Vitter. Practical Prefetching via Data Compression. In Proceedings of the 1993 ACM Conference on Management of Data (SIGMOD), pages 257-266, Washington, DC, May 1993. 11 Carla Schlatter Ellis and David Kotz. Prefetching in File System for MIMD Multiprocessors. In Proceedings of the 1989 International Conference on Parallel Processing, pages 306-314, August 1989. 12 R.J. Feiertag and E.I. Organisk. The Multics Input/Ouput System. In Proceedings of the 3rd Symposium on Operating Systems Principles, pages 35-41, 1971. 13 Jim Griffioen and Randy Appleton. Reducing File System Latency using a Predictive Approach. In USENIX Summer 1994 Technical Conference, pages 197-208, June 1994. 14 John H. Howard, Michael Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. Scale and Performance in a Distributed File System. ACM Transactions on Computer Systems, 6(1):51-81, February 1988. 15 M. Kim. Synchronized Disk Interleaving. IEEE Transactions on Computers, 35(11):978-988, 1986. 16 Tracy Kimbrel and Anna R. Karlin. Near-optimal Parallel Prefetching and Caching. In Proceedings of the 1996 IEEE Symposium on Foundations of Computer Science, October 1996. 17 Tracy Kimbrel, Andrew Tomkins, R. Hugo Patterson, Brian Bershad, Pei Cao, Edward W. Felten, Garth A. Gibson, Anna R. Karlin, and Kai Li. A Trace-Driven Comparison of Algorithms for Parallel Prefetching and Caching. Technical Report UW-CSE-96-09-01, University of Washington, 1996. 18 David Kotz and Carla Schlatter Ellis. Practical Prefetching Techniques for Multiprocessor File Systems. Journal of Distributed and Parallel Databases, 1(1):33-51, January 1993. 19 David Kotz, Song Bac Toh, and Sriram Radhakrishnan. A Detailed Simulation Model of the HP 97560 Disk Drive. Technical Report PCS-TR94-220, Department of Computer Science, Datmouth College, July 1994. 20 Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. A Fast File System for UNIX. ACM Transactions on Computer Systems, 2(3):181-197, August 1984. 21 L. W. McVoy and S. R. Kleiman. Extent-like Performance from a UNIX File System. In Proceedings of the 1991 Winter USENIX Conference, pages 33-43, 1991. 22 Michael N. Nelson, Brent B. Welch, and John K. Ousterhout. Caching in the Sprite File System. ACM Transactions on Computer Systems, 6(1):134-154, February 1988. 23 Mark Palmer and Stanley B. Zdonik. Fido: A Cache That Learns to Fetch. In Proceedings of the 17th International Conference on Very Large Data Bases, pages 255-264, September 1991. 24 D.A. Patterson, G. Gibson, and R.H. Katz. A Case for Redundant Arrays for Inexpensive Disks (RAID). In Proceedings of ACM SIGMOD Conference, pages 109-116, June 1988. 25 R. Hugo Patterson and Garth A. Gibson. Exposing I/O Concurrency with Informed Prefetching. In Proceedings of the Third International Conference on Parallel and Distributed Information Systems, pages 7-16, September 1994. 26 R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed Prefetching and Caching. In Proceedings of the 15th Symposium on Operating Systems Principles, pages 79-95, December 1995. 27 Chris Ruemmler and John Wilkes. An Introduction to Disk Drive Modelling. In IEEE Computer, 27(3):17-28, March 1994. 28 K. Salem and H. Garcia-Molina. Disk Striping. In the 2nd IEEE Conference on Data Engineering, pages 336-342, Feb. 1986. 29 Alan J. Smith. Second Bibliography on Cache Memories. Computer Architecture News, 19(4):154-182, June 1991. 30 C. Tait and D. Duchamp. Service Interface and Replica Management Algorithm for Mobile File System Clients. In Proceedings of Parallel and Distributed Information Systems, pages 190-196. IEEE, 1991.