Application-Controlled File Caching Policies Pei Cao, Edward W. Felten, and Kai Li Department of Computer Science Princeton University Abstract We consider how to improve the performance of file caching by allowing user-level control over file cache replacement decisions. We use two-level cache management: the kernel allocates physical pages to individual applications (allocation ), and each application is responsible for deciding how to use its physical pages (replacement ). Previous work on two-level memory management has focused on replacement, largely ignoring allocation. The main contribution of this paper is our solution to the allocation problem. Our solution allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes. Our solution makes sure that good user-level policies can improve the file cache hit ratios of the entire system over the existing replacement approach. We evaluate our scheme by trace-based simulation, demonstrating that it leads to significant improvements in hit ratios for a variety of applications. Introduction File caching is a widely used technique in today's file system implementations. Since CPU speed and memory density have improved dramatically in the last decade while disk access latency has improved slowly, file caching has become increasingly important. One major challenge in file caching is to provide high cache hit ratio. This paper studies an application-controlled file caching approach that allows each user process to use an application-tailored cache replacement policy instead of always using a global Least-Recently-Used (LRU) policy. Some applications have special knowledge about their file access patterns which can be used to make intelligent cache replacement decisions. For example, if an application knows which blocks it needs and which it does not, it can keep the former in cache and reduce its cache miss ratio. Traditionally such applications buffer file data in user address space as a way of controlling replacement. However, since the kernel tries to cache file data as well, this approach leads to double buffering, which wastes space. Furthermore, this approach does not give applications real control because the virtual memory system can still page out data in the user address space. Hence we need another way to let applications control replacement. To reduce the miss ratio, a user-level file cache needs not only an application-tailored replacement policy but also enough available cache blocks. In a multiprocess environment, the allocation of cache blocks to processes will thus affect the file cache hit ratio of the entire system. It is the kernel's job to ensure that the hit ratio of the whole system does not degrade because of the user-level management of cache replacement policies. The challenge is to allow each user process to control its own caching and at the same time to maintain the dynamic allocation of cache blocks among processes in a fair way so that overall system performance improves. This paper describes a scheme that achieves this goal. Our approach, called ``two-level block replacement'', splits the responsibilities of allocation and replacement between kernel and user level. A key element in this scheme is a sound allocation policy for the kernel, which is discussed in section~ allocation:policy . This allocation policy guarantees that an application-tailored replacement policy can improve the overall file system performance and that a foolish replacement policy in one application will not degrade the file cache hit ratios of other processes. We have evaluated our allocation policy using trace-driven simulation. In our simulations, we used several file access traces that we collected on a DEC 5000/200 workstation running the Ultrix operating system, and the Sprite traces from University of California at Berkeley. We have simulated our allocation policy and various replacement policies for individual application processes. The simulations show that an application-tailored replacement policy can reduce an application's file cache miss ratio up to 100 , over the global LRU policy. In addition, in a multiprocess environment, the combination of our allocation policy and application-tailored replacement policies can reduce the overall file cache miss ratios, over the traditional global file caching approach, by up to 50 . User Level File Caching Our goal is to allow user-level control over cache replacement policy. In many cases, the application has better knowledge about its future file accesses than the kernel has. User level control of cache replacement enables the application to use its better knowledge to improve the hit ratio. Despite the advantages of application control, we cannot simply move all responsibility for cache management to the user level. In a multiprogrammed system, the kernel serves a valuable function: managing the allocation of resources between users to guarantee the performance of the entire system. Two-Level Replacement two-level-replacement We propose a scheme for file caching that splits responsibility between the kernel and user levels. The kernel is responsible for allocating cache blocks to processes. Each user process is free to control the replacement strategy on its share of cache blocks; if it chooses not to exercise this choice, the kernel applies a default policy (LRU). We call our approach two-level cache block replacement . To be more precise, each file is assigned to a ``manager'' process, which is responsible for making replacement decisions concerning the file. Usually the process that currently has the file open is its manager; however, this process may designate another process to be the manager of a particular file. If several processes have the same file open simultaneously, then it is up to these processes to agree on a manager; if they cannot agree then the kernel imposes the default LRU policy for that file. Processes that do not want to control their own replacement policy can abdicate their management responsibility; in this case the kernel applies the default LRU policy for the affected files. The interactions between kernel and manager processes are the following: On a cache miss, the kernel finds a candidate block to replace, based on its global replacement policy (step 1 in Figure~ fig:forwarding-decision ). The kernel then identifies the manager process of the candidate. This manager process is given a chance to decide on the replacement (step 2). The candidate block is given as a hint, but the manager process may overrule the kernel's choice by suggesting an alternative block under that manager's control (step 3). Finally, the block suggested by the manager process is replaced by the kernel (step 4). (If the manager process is not cooperative, then the kernel simply replaces the candidate block.) figure=/u/pc/paper/pc/graphs/interaction.ps Interaction between kernel and user processes in two-level replacement: (1) P misses; (2) kernel consults Q for replacement; (3) Q decides to give up page B; (4) kernel reallocates B to P. fig:forwarding-decision figure The kernel's replacement policy is in fact an allocation policy. Suppose that process P's reference misses in the cache and the kernel finds a replacement candidate owned by process Q. Although process Q's user-level replacement policy decides which of its blocks will be replaced, the replacement will cause a deallocation of a block from process Q and an allocation of a block to process P. Kernel Allocation Policy The kernel allocation policy is the most critical part of two-level replacement. To obtain best performance, it is known that allocation should follow the dynamic partition principle~ coffman denning : each process should be allocated a number of cache blocks that varies dynamically in accordance with its working set size. Experience has shown that global LRU or its approximations perform relatively well; they approximate the dynamic partition principle or tend to follow processes' working set sizes. Our goal is to design an allocation policy for the kernel to guarantee that the two-level replacement method indeed improves system performance over the traditional global (or single level) replacement method when user processes are not making bad replacement decisions. To be more precise, let us call a decision to overrule the kernel wise if the alternative block is referenced before the candidate suggested by the kernel, and call the decision foolish if the candidate will be referenced first. (A decision not to overrule the kernel can be viewed as neutral.) The kernel's allocation policy should satisfy three principles: A process that never overrules the kernel does not suffer more misses than it would under global LRU. This ensures that ordinary processes, which are unwilling or unable to predict their own accesses, will perform at least as well as they did before. A foolish decision by one process never causes another process to suffer more misses. Of course, we cannot prevent a process from discarding its valuable pages. However, we must ensure that an errant or malicious process cannot hurt the performance of others. A wise decision by one process never causes any process, including itself, to suffer more misses. This ensures that processes have an incentive to choose wisely. (It goes without saying that wise decisions should actually improve performance whenever possible.) enumerate The main contribution of this paper is to propose and evaluate an allocation policy that satisfies these design principles. c@ 1ex c@ 1ex ccc@ 1ex c@ 1ex cc 2 c First try method 2 c Swapping position Time P's ref Q's ref Cache State Global LRU list Cache state Global LRU list A W X B A W X B Y to 6em to 6em A W X Y A W X Y Z to 6em to 2em Z W X Y A Z X Y A to 2em Z A X Y A Z X Y B to 2em to 2em Z A B Y A Z B Y tabular This example shows what's wrong with the first try and how to fix it with the swapping position mechanism. fig:need-swapping center figure* An Allocation Policy allocation:policy We will describe our allocation policy in an evolutionary fashion. We will start with a simple but flawed policy, and then diagnose and fix two problems with it. The result will be a fully satisfactory allocation policy. First Try To start, we can have an allocation policy that is literally the same as that of global LRU. The kernel simply maintains an LRU list of all blocks currently in the file cache. When a replacement is necessary, the block at the end of the LRU list is suggested as a candidate, and its owner process is asked to give up a block. The problem is that if the owner process overrules the kernel, the candidate block still stays at the end of the LRU list. On the next miss, the same process will again be asked to give up a block. The left side of Figure~ fig:need-swapping shows an example. Two processes, P and Q, share a file cache with four blocks. Process P uses blocks A and B; process Q uses blocks W, X, Y, and Z. The reference stream is Y, Z, A, B. The top line shows the initial LRU list, at time . The first reference, to Y at time , causes a replacement. The kernel consults its LRU list and suggests A for replacement. A's owner, process P, overrules the kernel. It decides to replace B, hoping to save the next miss to A. The LRU list is now as shown at time . The next reference, to Z at time , causes another replacement. Again, A is chosen as a candidate. This time P has no other blocks in the cache and hence must give up A. The LRU list is now as shown at time . At this point, the next two references, to A and B, both miss. There are four misses in this example, two misses by P and two by Q. But note that under global LRU, there would be only three misses, one by P and two by Q. This violates Principle 3: a wise decision by process P causes P to suffer one extra miss. Swapping Position sec:need-swapping The problem in the above scheme arises because the LRU list is maintained in strict reference order. Intuitively, the only use of the LRU list is to decide which process will give up a block upon a miss. To get the same allocation policy as the existing global LRU policy, our policy's LRU list should be in correspondence to the LRU list in the original algorithm. This can be achieved by swapping the blocks' positions in the LRU list. Suppose the kernel suggests A for replacement, but the user-level manager overrules it with B. At this point, the previous policy would simply replace B. The new policy first swaps the positions of A and B in the LRU list, then proceeds to replace B. As a result of this swap, A is no longer at the tail of the LRU list. Compared with the LRU list under global LRU (i.e. if A is replaced), the only difference is that A is in B's position. This fixes the problem with above example as in Figure~ fig:need-swapping . The right side of the figure shows what happens under the new policy. On the first replacement, A moves to the head of the LRU list before B is replaced. The result is that A is still in the cache when it is referenced. Process P is no longer hurt by its wise choice. In general, swapping positions guarantees that if no process makes foolish choices, the global hit ratio is the same as or better than it would be under global LRU. Unfortunately, this scheme does not guard against foolish choices made by user processes. This is illustrated by the left side of Figure~ fig:need-place-holder . Processes P and Q share a three-block cache. The top line shows the initial LRU list at time ; the reference stream is Z, Y, A. The first reference, to Z at time , causes a replacement. The kernel suggests X for replacement. Now suppose process Q makes exactly the wrong choice: it decides to replace Y. After swapping X and Y in the LRU list, the kernel replaces Y, leading to the LRU list as shown at time . The second reference, to Y at time , misses. The kernel suggests A for replacement, and process P cannot overrule because it has no alternative to suggest. Thus A is replaced, leading to the LRU list as shown at time . The third reference, to A at time , misses. cccccccc 2 c No Place-Holder 2 c With Place-Holder Time P's ref Q's ref Cache State Global LRU list Cache state Global LRU list X A Y X A Y Z to 4em to 4em X A Z X A Z Y to 4em X Y Z Y A Z A to 4em A Y Z Y A Z tabular This example shows why place-holders are necessary. fig:need-place-holder center figure* There are three misses in this example, one miss by P and two by Q. Under global LRU, there would be only one miss, by Q. Had Q not foolishly overruled the kernel, the last two references would both have hit in cache. Principle 2 is violated --- process Q's foolish decision causes process P to suffer an extra miss. Place-Holders The problem in the previous example arises because Q's choice enables it to acquire more cache blocks than it would have had under global LRU. As a result, some of P's blocks are pushed out of the cache, which increases P's miss rate. To satisfy Principle 2, we must prevent foolish processes from acquiring extra cache blocks. We achieve this by using place-holders . A place-holder is a record that refers to a page. It records which block would have occupied that page under the global LRU policy. Suppose the kernel suggests A for replacement, and the user process overrules it and decides to replace B instead. In addition to swapping the positions of A and B in the LRU list, the kernel also builds a place-holder for B to point to A's page. If B is later referenced before A, A's page can be confiscated immediately. This allows the cache state to recover from the user process's mistake. The right side of Figure~ fig:need-place-holder illustrates how place-holders work. The top line shows the initial LRU list at time . The first reference, to Z at time , misses. Block X is chosen as a candidate for replacement, but process Q (foolishly) overrules the kernel and chooses Y for replacement. X and Y are swapped in the LRU list, and Y is replaced. At this point, a place-holder is created, denoting the fact that the block occupied by X would have contained Y under global LRU. The second reference, to Y at time , misses. The kernel notices that there is a place-holder for Y --- Y ``should'' have been in the cache, but was not, due to a foolish replacement decision by Y's owner. The kernel responds to this situation by correcting the foolish decision: it loads Y into the page that Y's place-holder pointed to, replacing X. Note that in this case the normal replacement mechanism is bypassed. The LRU list is now as shown at time . The third reference, to A, hits. This example results in two misses, both by process Q. Under global LRU, there would have been only one miss, by Q. Q hurts itself by its foolish decision, but it does not hurt anyone else. Principle 2 is satisfied. Our Allocation Scheme Combining above two fixes, here is our full allocation scheme: If a reference to cache block hits, then is moved to the head of the global LRU list, and the place-holder pointing to (if there is one) is deleted. If the reference misses, then there are two cases. In the first case, there is a place-holder for , pointing to ; in this case is replaced and its page is given to . (If is dirty, it is written to disk.) In the second case, there is no place-holder for . In this case, the kernel finds the block at the end of the LRU list. Say that block , belonging to process , is at the end of the LRU list. The kernel consults to choose a block to replace. (The kernel suggests replacing .) Say that 's choice is to replace block . The kernel then swaps and in the LRU list. If there is place-holder pointing to , it is changed to point to ; otherwise a place-holder is built for , pointing to . Finally, 's page is given to . ( is written to disk if it is dirty.) We can prove that this algorithm satisfies all three of our design principles. (A detailed formal proof appears in placeholder:proof .) Our scheme has the property that it never asks a process to replace a block for another process more often than global LRU. In other words, whenever a process is asked to give up a block for another process, it would have already given up that block under global LRU. To see why, first notice that the place-holder scheme ensures that every process appears , in the view of other processes, never to unwisely overrule the kernel's suggestions. This is because whenever a process makes an unwise decision, only the errant process is punished and the state is restored as if the mistake were never made. Hence, from the allocator's point of view, every process is either doing LRU or is doing something better than LRU. Therefore we need only guarantee that those that are doing better than LRU are not discriminated against. Swapping position serves this purpose. That is, the allocator doesn't care which of a process's pages is holding which data, as long as its pages occupy the same positions on the LRU list that they would have occupied under global LRU. In summary, our framework for incorporating user level control into replacement policy is: consulting user processes at the time of replacement; swapping the positions of the block chosen by the global policy and the block chosen by the user process in the LRU list; and building ``place-holder'' records to detect and recover from user mistakes. This framework is also applicable to various policies that approximate LRU, such as FIFO with second chance, and two-hand clock levy:twohand-clock . Design Issues This section addresses various aspects of our scheme, including possible implementation mechanisms, treatment of shared files and interaction with prefetching. User-Kernel Interaction There are several ways to implement two-level replacement, trading off generality and flexibility versus performance. The simplest implementation is to allow each user process to give the kernel hints. For example, a user process can tell the kernel which blocks it no longer needs, or which blocks are less important than others. Or it can tell the kernel its access pattern for some file (sequential, random, etc). The kernel can then make replacement decisions for the user process using these hints. Alternatively, a fixed set of replacement policies can be implemented in the kernel and the user process can choose from this menu. Examples of such replacement policies include: LRU with relative weights, MRU (most recently used), LRU-K db:lru-k , etc. For full flexibility, the kernel can make an upcall to the manager process every time a replacement decision is needed, as in premo . Similarly, each manager process can maintain a list of ``free'' blocks, and the kernel can take blocks off the list when it needs them. The manager would be awakened both periodically and when its free-list falls below an agreed-upon low-water mark. This is similar to what is implemented in sechrest park . Combinations of these schemes are possible too. For example, the kernel can implement some common policies, and rely on upcalls for applications that do not want to use the common policies. In short, all these implementations are possible for our two-level scheme. We are still investigating which is best. Shared Files As discussed in Section~ two-level-replacement , concurrently shared files are handled in one of two ways. If all of the sharing processes agree to designate a single process as manager for the shared file, then the kernel allows this. However, if the sharing processes fail to agree, management reverts to the kernel and the default global LRU policy is used. Prefetching Under two-level replacement, prefetches could be treated in the same way as in most current file systems: as ordinary asynchronous reads. Most file systems do some kind of prefetching. They either detect sequential access patterns and prefetch the next block unix:ffs , or do cluster I/O mcvoy:cluster . Recent research has explored how to prefetch much more aggressively. In this case, a significant resource allocation problem arises --- how much of the available memory should the system allocate for prefetching? Allocating too little space diminishes the value of prefetching, while allocating too much hurts the performance of non-prefetch accesses. The prefetching system must decide how aggressively to prefetch. Our techniques do not address this problem, nor do they make it worse. The kernel prefetching code would still be responsible for deciding how aggressively to prefetch. We would simply treat the prefetcher as another process competing for memory in the file cache. However, since the prefetcher would be trusted to decide how much memory to use, our allocation code would provide it with a fresh page whenever it wanted one. Recent research on prefetching focuses on obtaining information about future file references hugo:tip . This information might be as valuable to the replacement code as it is to the prefetcher, as we discuss in the next section. Thus, adding prefetching may well make the allocator's job easier rather than harder. To facilitate the use of a sophisticated prefetcher, there can be more interaction between the allocator and the prefetcher. For example, the allocator could inform the prefetcher about the current demand for cache blocks; the prefetcher could voluntarily free cache blocks when it realized some prefetched blocks were no longer useful, etc. The details are beyond the scope of this paper. Simulation We used trace-driven simulation to evaluate two-level replacement. In our simulations the user-level managers used a general replacement strategy that takes advantage of knowledge about applications' file references. Two sets of traces were used to evaluate the scheme. Simulated Application Policies Our two-level block replacement enables each user process to use its own replacement policy. This solves the problem for those sophisticated applications that know exactly what replacement policy they want. However, for less sophisticated applications, is there anything better than local LRU? The answer is yes, because it is often easy to obtain knowledge about an application's file accesses, and such knowledge can be used in replacement policy. Knowledge about file accesses can often be obtained through general heuristics, or from the compiler or application writer. Here are some examples: Files are mostly accessed sequentially; the suffix of a file name can be used to guess the usage pattern of a file: ``.o'' files are mostly accessed in certain sequences, ``.ps'' files are accessed sequentially and probably do not need to be cached, etc. Compilers may be able to detect whether there is any lseek call to a file; if there is none, then it is very likely that the file is accessed sequentially. Compilers can also generate the list of future file accesses in some cases; for example, current work on TIP (Transparent Informed Prefetching) hugo:tip is directly applicable. The programmer can give hints about the access pattern for a file: sequential, with a stride, random, etc. itemize When these techniques give the exact sequence of future references, the manager process can apply the offline optimal policy RMIN: replace the block whose next reference is farthest in the future. Often, however, only incomplete knowledge about the future reference stream is known. For example, it might be known that each file is accessed sequentially, but there might be no information on the relative ordering between accesses to different files. RMIN is not directly applicable in these cases. However, the principle of RMIN still applies. We propose the following replacement policy to exploit partial knowledge of the future file access sequence: when the kernel suggests a candidate replacement block to the manager process, find all blocks whose next references are definitely (or with high probability) after the next reference to the candidate block; if there is no such block, replace the candidate block; else, choose the block whose reference is farthest from the next reference of the candidate block. enumerate Depending on the implementation of two-level replacement, this policy may be implemented in the kernel or in a run-time I/O library. Either way, the programmer or compiler needs to predict future sequences of file references. This strategy can be applied to common file reference patterns. For general applications, common file access patterns include: sequential : Most files are accessed sequentially most of the time; file-specific sequences : some files are mostly accessed in one of a few sequences. For example, object files are associated with two sequences: 1) sequential; 2) first symbol table, then text and data (used in link editing); filter : many applications access files one by one in the order of their names in the command line, and access each file sequentially from beginning to end. General filter utilities such as grep are representative of such applications; same-order : a file or a group of files are repeatedly accessed in the same order. For example, if ``*'' (for file name expansion) appears more than once in a shell script, it usually leads to such an access pattern; and access-once : many programs do not reread or rewrite file data that they have already accessed. itemize Applying our general replacement strategy, we can determine replacement policies for applications with these specific access patterns. Suppose the kernel suggests a block A, of file F, to be replaced. For sequential or file-specific , the block of F that will be referenced farthest in the future is chosen for replacement; for filter , the sequence of future references are known exactly, and RMIN can be applied; for same-order , the most recently accessed block can be replaced; and for access-once , any block of which the process has referenced all the data can be replaced. Simulation Environment We used trace-driven simulations to do a preliminary evaluation of our ideas. Our traces are from two sources. We collected the first set ourselves, tracing various applications running on a DEC 5000/200 workstation. The other set is from the Sprite file system traces from University of California at Berkeley sprite91 . We built a trace-driven simulator to simulate the behavior of the file cache under various replacement policies Our traces and simulator are available via anonymous ftp from ftp.cs.princeton.edu: pub/pc. . In our simulation we only considered accesses to regular files --- accesses to directories were ignored for simplicity, the justification being that file systems often have a separate caching mechanism for directory entries. We also assume that the file system has a fixed size file cache That is, we do not simulate the dynamic reallocation of physical memory between virtual memory and file system that happens in some systems. , with a block size of 8K. We validated our simulator using Ultrix traces. Our machine has a 1.6MB file cache. We can measure the actual number of read misses using the Ultrix ``time'' command. Our simulation results were within 3 of the real result except for link-editing, for which the simulator predicted 7 fewer misses. This is because the simulator ignores directory operations, which are more common in the link-editing application. To evaluate our scheme, we compared it with two policies: existing kernel-level global LRU without application control, and the ideal offline optimal replacement algorithm, RMIN. The former is used by most file systems, while the latter sets an upper bound on how much miss ratio can be reduced by improving the replacement policy. Our performance criterion is miss ratio: the ratio of total file cache misses to total file accesses. Results for Ultrix Traces We instrumented the Ultrix 4.3 kernel to collect our first set of traces. When tracing is turned on, file I/O system calls from every process are recorded in a log, which is later to fed to the simulator. Traces were gathered for three application programs, both when they were running separately and when they were run concurrently. The applications are: Postgres : Postgres is a relational database system developed at University of California at Berkeley postgres:4.1 . We used version 4.1. We traced the system running a benchmark from the University of Wisconsin, which is included in the release package. The benchmark takes about fifteen minutes on our workstation. cscope : cscope is an interactive tool for examining C sources. It first builds a database of all source files, then uses the database to answer the user's queries, such as locating all the places a function is called. We traced cscope when it was being used to examine the source code for our kernel. The trace recorded four queries, taking about two minutes. link-editing : The Ultrix linker is known as being I/O-bound. We collected the I/O traces when link-editing an operating system kernel twice, taking about six minutes. We also collected traces when linking some programs with the X11 library. compiling : we traced the compilation of the NJ Standard ML compiler; xrn : xrn is known for being slow; we collected its I/O traces when a user is reading news with it; simulations: I-(D-) caches; this simulator; VLSI cad tools: magic, etc; scientific applications: anything?; sh, csh, make: large scripts; what else? itemize figure=/u/pc/paper/pc/graphs/new-smallsize-postgres.jgr.ps Performance for Postgres graph:postgres1 figure figure=/u/pc/paper/pc/graphs/new-smallsize-cscope.jgr.ps Performance for cscope graph:cscope1 figure figure=/u/pc/paper/pc/graphs/new-smallsize-vmunix2.jgr.ps Performance for Linking Kernel graph:vmunix figure Single Application Traces First we'd like to see how introducing application control can improve each application's caching performance: Postgres : it is often hard to predict future file accesses in database systems. To see whether user-level heuristics may reduce miss ratio, we tried the policy for sequential access pattern. Indeed, the miss ratio is reduced (Figure~ graph:postgres1 ). We think that the designer of the database system can certainly give a better user-level policy, thus further improving the hit ratio. cscope : cscope actually has a very simple access pattern. It reads the database file sequentially from beginning to end to answer each query. The database file used in our trace is about 10MB. For caches smaller than 10MB, LRU is useless. The reason that the miss ratio is only 12.5 is that the size of file accesses is 1KB, while the file block size is 8KB. However, if we apply the right user-level policy (noticing that the access pattern is same-order ), the miss ratio is reduced significantly (Figure~ graph:cscope1 ). link-editing : the linker in our system makes a lot of small file accesses. It doesn't fit the sequential access pattern. However, it is read-once . Even though the linker is run twice in our traces, during each run its user level policy can still be that of read-once . The result is shown in Figure~ graph:vmunix . For linking with X11 library, we tried both the policy for sequential and the policy for read-once at user-level (Figure~ graph:x11link ). read-once seems to be the right policy. (Note that this trace is small and a 4MB cache is actually enough for it.) itemize figure=/u/pc/paper/pc/graphs/new-smallsize-x11link2.jgr.ps Performance for Linking with X11 graph:x11link figure Multi-Process Traces Having seen that appropriate user-level policies can really improve the cache performance of individual applications, we would like to see how our scheme performs in a multi-process environment. We collected traces when the three applications (Postgres, cscope, linking the kernel) are run concurrently. In this trace, we simulated each application running its own user-level policy as discussed above. The result is shown in Fig. graph:three-app . Since the applications' user-level policies are not optimal, we also simulate the case of each application using an offline optimal algorithm as its user-level policy. This yields the curve directly above RMIN. As can be seen, our scheme, coupled with appropriate user level policies, can improve the hit ratio for multiprocess workloads. figure=/u/pc/paper/pc/graphs/new-smallsize-three-apps.jgr.ps Performance for a Multi-Process Workload graph:three-app figure We also performed an experiment to measure the benefit of using place-holders. We collected a trace of the Postgres and kernel-linking applications running concurrently, and simulated the miss ratio of Postgres when kernel-linking makes the worst possible replacement choices and Postgres simply follows LRU. We simulated our full allocation algorithm and our algorithm without place-holders. Figure~ graph:with-and-without-place-holders shows the result. Without place-holders, Postgres is noticeably hurt by the other application's bad replacement decisions. (With place-holders, Postgres has the same miss ratio as under global LRU.) figure=/u/pc/paper/pc/graphs/placeholder.jgr.ps Benefit of Using Place-Holders graph:with-and-without-place-holders figure Results for Sprite Traces Our second set of traces is from the UC Berkeley Sprite File System Traces sprite91 . There are five sets of traces, recording about 40 clients' file activities over a period of 48 hours (traces 1, 2 and 3) or 24 hours (traces 4 and 5). We focused on the performance of client caching. In a system with a slow network (e.g. ethernet), client caching performance determines the file system performance on each workstation. Furthermore we ignored kernel file activities in these traces, because Sprite's VM system swaps to remote files. In our simulation we set the client cache size to be 7MB, which is the average file cache size reported in sprite91 . These traces do not contain process-ID information, so we cannot simulate application-specific policies as with Ultrix traces. However, since most file accesses are sequential sprite91 , the sequential heuristic can be used. Figure~ graph:sprite1 shows average miss ratios for global LRU, sequential heuristic and optimal replacement. Average cold-start (compulsory) miss ratios are also shown. figure=/u/pc/paper/pc/graphs/allfive.ps Averaged Results from Sprite Traces graph:sprite1 figure As can be seen, two-level replacement with sequential heuristic improves hit ratio for some traces. In fact simulations show that sequential improves hit ratio for about 10 of the clients, and the improvements in these cases are between 10 and over 100 . Overall, these results show that two-level replacement is a promising scheme to enable application control of replacement and to improve the hit ratio of the file buffer cache. We believe that two-level replacement should be implemented in future file systems. Related Work There have been many studies on caching in file systems (e.g. schroeder85,cheriton87,sun:nfs,wilkes:caching,afs:caching,sprite:caching ), but these investigations were not primarily concerned with cache replacement policies. The database community has long studied buffer replacement policies stonebraker:osdbms,dewitt:buffer-policy-evaluation,db:lru-k , but existing file systems do not support them. Although user scripts were introduced in caching in a disconnected environment coda:disconnect , they were used to tell the file system which files should be cached (on disk) when disconnection occurs. In most of these systems, the underlying replacement policy is still LRU or an approximation to it. In the past few years, there has been a stream of research papers on mechanisms to implement virtual memory paging at user level. The external pager in Mach~ Young87 and V~ Cheriton88 allows users to implement paging between local memory and secondary storage, but it does not allow users to control the page replacement policy . Several studies~ premo,sechrest park,harty:appcontrol,anderson:oopsla proposed extensions to the external pager or improved mechanisms to allow users to control page replacement policy. These schemes do not provide resource allocation policies that satisfy our design principles to guarantee replacement performance. Furthermore, they are not concerned with file caching. Previous research on user-level virtual memory page replacement policies~ alonso appel,smlgc:mach,harty:appcontrol,anderson:oopsla,cmu:discardable shows that application-tailored replacement policies can improve performance significantly. With certain modifications, these user-level policies might be used as user-level file caching policies in our two-level replacement. Recent work on prefetching~ hugo:tip,pk:prefetching,chew:kernel can be directly applied in user-level file caching policies using our general replacement strategy. These systems can take advantage of the properties of our allocation policy to guarantee performance of the entire system. Conclusions This paper has proposed a two-level replacement scheme for file cache management, its kernel policy for cache block allocation, and several user-level replacement policies. We evaluated these policies using trace-driven simulation. Our kernel allocation policy for the two-level replacement method guarantees performance improvements over the traditional global LRU file caching approach. Our method guarantees that processes that are unwilling or unable to predict their file access patterns will perform at least as well as they did under the traditional global LRU policy. Our method also guarantees that a process that mis-predicts its file access patterns cannot cause other processes to suffer more misses. Our key contribution is the guarantee that a good user-level policy will improve the file cache hit ratios of the entire system. We proposed several user-level policies for common file access patterns. Our trace-driven simulation shows that they can improve file cache hit ratios significantly. Our simulation of a multiprogrammed workload confirms that two-level replacement indeed improves the file cache hit ratios of the entire system. We believe that the kernel allocation policy proposed in this paper can also be applied to other instances of two-level management of storage resources. For example, with small modifications, it can be applied to user-level virtual memory management. Although the kernel allocation policy guarantees performance improvement over the traditional global LRU replacement policy, there is still room for improvement. We plan to investigate these possible improvements and implement the two-level replacement method to evaluate our approach with various workloads. Acknowledgments We are grateful to our paper shepherd Keith Bostic and the USENIX program committee reviewers for their advice on improving this paper. Chris Maeda, Hugo Patterson and Daniel Stodolsky provided helpful comments. We benefited from discussions with Rafael Alonso, Matt Blaze, and Neal Young. Finally we'd like to thank the Sprite group at the University of California at Berkeley for collecting and distributing their file system traces. plain /u/pc/lib/bib/abbr,/u/pc/lib/bib/db,/u/pc/lib/bib/userlevel Author Information Pei Cao graduated from TsingHua University in 1990 with a BS in Computer Science. She received her MS in Computer Science from Princeton University in 1992 and is currently a Ph.D. candidate. Her interests are in operating systems, storage management and parallel systems. Email address: pc@cs.princeton.edu. Edward W. Felten received his Ph.D. degree from the University of Washington in 1993 and is currently Assistant Professor of Computer Science at Princeton University. His research interests include parallel and distributed computing, operating systems, and scientific computing. Email address: felten@cs.princeton.edu. Kai Li received his Ph.D. degree from Yale University in 1986 and is currently an associate professor of the department of computer science, Princeton University. His research interests are in operating systems, computer architecture, fault tolerance, and parallel computing. He is an editor of the IEEE Transactions on Parallel and Distributed Systems, and a member of the editorial board of International Journal of Parallel Programming. Email address: li@cs.princeton.edu. document