Implementation and Performance of  
Application-Controlled File Caching  

Pei Cao, Edward W. Felten, and Kai Li  
 
Department of Computer Science 
Princeton University 
Princeton, NJ 08544 USA 
pc,felten,li @cs.princeton.edu 

Traditional file system implementations do not allow applications to control file caching replacement decisions.  We have implemented two-level replacement, a scheme that allows applications to control their own cache replacement, while letting the kernel control the allocation of cache space among processes.  We designed an interface to let applications exert control on replacement via a set of directives to the kernel.  This is effective and requires low overhead. 

We demonstrate that for applications that do not perform well under traditional caching policies, the combination of good application-chosen replacement strategies, and our kernel allocation policy LRU-SP, can reduce the number
of block I/Os by up to 80 , and can reduce the elapsed time by up to
45 .  We also show that LRU-SP is crucial to the performance 
improvement for multiple concurrent applications: LRU-SP fairly distributes
cache blocks and offers protection against foolish applications.

  abstract 

  Introduction 

File caching is a widely used technique in file systems.  Cache management policy is normally centrally controlled by the operating system kernel.  Recently, we have shown by simulation that application-controlled file caching can offer higher file cache hit ratios than the traditional approach~  Cao94 .  This paper presents the design and implementation of an application-controlled file cache and reports its performance benefit under a real application workload.

The design of our application-controlled file cache is based on a
two-level replacement scheme proposed in~  Cao94 .  This method
lets the kernel dynamically allocate cache blocks to user processes,
and allows each user process to apply its favorite file caching policy
to its blocks.
A cache block allocation policy, called 
Least-Recently-Used with Swapping and Placeholders(LRU-SP), is used 
to guarantee the fairness of allocation.

We designed an interface of user-to-kernel directives to enable applications
to control file cache replacement.  The interface is designed to be
sufficiently 
flexible for applications to express desired strategies in the common cases,
and yet to have low overhead.

We implemented the application-controlled file cache for the
Ultrix file system on the DEC 5000/240 workstation.  We compared
the performance of our file cache to that of the traditional, global LRU
approach, using several real applications.
First, we showed that a well-chosen caching policy can reduce cache
misses (and hence disk I/Os) significantly.  In our single-application
experiments, good policies can reduce cache misses by between 10  and 80 .
As a result, these policies can reduce the elapsed time of applications by
up to 45 .
Second, we measured the effectiveness and fairness of our LRU-SP
allocation policy.  Using various mixes of concurrent applications, we
showed that our implementation can reduce their elapsed times by up to
30  over the global LRU approach.

We also compared our implementation with our earlier simulation study.  This
confirmed the result from our simulation study that the added features 
of LRU-SP (swapping and placeholders) are important for performance
improvement.  Although LRU-SP does not always provide    perfect 
protection against foolish processes, it significantly reduces their
effect, and it provides an easy way for the kernel to detect foolish or
malicious behavior.


Two-Level Replacement and LRU-SP background 

This section presents some background material about
two level block replacement and the LRU-SP global allocation policy.  For
a full discussion see~  Cao94 .  

The challenge in application-controlled file caching is to allow
each user process to control its own caching    and  at the same time to 
maintain
the dynamic allocation of cache blocks among processes in a fair way so that
overall system performance improves.  

To achieve this goal, we use    two level block replacement. 
Two-level replacement splits the responsibility for allocation and
replacement between the kernel and user level.  The kernel is
responsible for allocating cache blocks 
to processes, while each user process can control the replacement strategy
on its share of cache blocks.  When a user-level process misses in the
cache, the kernel chooses a process to give up a cache block.  The
designated process is free to give up whichever block it likes.  The
kernel's    allocation policy  is used to decide which process will
give up a block.

The interactions between the kernel and the controlling user process (called
the manager process) are as follows: on a cache miss, the kernel first 
finds a candidate block to replace, based on its ``global replacement'' 
policy (step 1 in Figure~  fig:forwarding-decision ).
The kernel then identifies the manager process for the candidate.
This manager process is given a chance to decide on the replacement (step 2).
The candidate block is given as a hint, but the manager
process may overrule the kernel's choice by suggesting an alternative block
under that manager's control (step 3).  Finally, the block suggested by the
manager process is replaced by the kernel (step 4).  (If the manager process
does not exist or is uncooperative, then the kernel simply replaces
the candidate block.)  

  
  figure=/u/pc/papers/OSDI/pc/graphs/interaction3.ps 
  Interaction between kernel and user processes in two-level
replacement: (1) P misses; (2) kernel consults Q for replacement;
(3) Q decides to give up page B; (4) kernel reallocates B to P. 
  fig:forwarding-decision 
  figure 

Clearly, the kernel's ``global replacement'' policy is actually not
a replacement policy at all (as it doesn't really decide which block
to replace) but rather a global    allocation  policy.  A sound global
allocation policy is crucial to the success of two level replacement.
It should satisfy these three criteria: 
  
  
Oblivious processes (those that do not want to manage their share of
the cache) should do no worse than they would have done under the
existing LRU policy.
  
Foolish processes should not hurt other processes.
 
Smart processes should perform better than under the existing LRU policy
whenever possible, and they should never perform worse.
  itemize 


In ~  Cao94  we proposed a kernel policy called LRU-SP that satisfies
these criteria.
LRU-SP operates by keeping an LRU list of all cache blocks and having
two extensions to the ``basic LRU'' scheme:
  
If the kernel suggests block   for replacement and the
user-level manager chooses to replace   instead, the kernel swaps
the positions of   and   in the LRU list (``swapping''), then
builds a record (a ``placeholder'') for  , pointing to  , to remember the
manager's choice.  (See Figure~  fig:LRU-SP-list-change .) 
 
If a user process misses on a block  , and a place-holder for  
exists, then the block pointed to by that placeholder is chosen as the
candidate for replacement.  Otherwise, the block at the end of the LRU
list is picked as the candidate.
  itemize 
If an application never overrules the kernel's suggestions, then no
swapping or placeholders are ever used for that application's blocks,
so the application sees an LRU policy.  If it uses a better policy than LRU
to overrule the kernel's suggestion, then swapping is necessary so
that it won't be penalized by the kernel.  If it uses a worse policy
than LRU, then placeholders are necessary to prevent it from hurting
other processes.  For more details 
please see ~  Cao94 .

  
  figure=/u/pc/papers/OSDI/pc/graphs/list.ps 
  How LRU-SP keeps the ``LRU'' list.  Here the kernel suggests
block A as a candidate for replacement; the application overrules the
kernel, and causes block B to be replaced instead.
The figure shows the list before and after 
the replacement decision. 
  fig:LRU-SP-list-change 
  figure 


  Supporting Application Control 

There are a variety of ways to implement the user-kernel interaction
in two level block replacement, as discussed in ~  Cao94 .  The
main design challenge is to devise an interface that allows
applications to exert the control they need, without introducing the
overhead that would result from a totally general mechanism.  

Our approach was to first consider common file access patterns reported
in the literature 
(e.g.   sprite85,sprite91,dewitt:buffer-policy-evaluation ) as
well as patterns that we observed in the applications discussed below.
We then chose a simple interface that was sufficient to compose
caching strategies to support these access patterns.  


  An Interface for Application Control 

The basic idea in our interface is to allow applications to assign
priorities to files and to specify file cache replacement policies for
each priority level.  The kernel always replaces blocks with the lowest
priority first.  (This rule applies only within the blocks of a single
process.  Inter-process allocation decisions are made using LRU-SP, as
explained above.)  The calls for applications are:
  
     set (file, prio)  and    get (file)  
set and get the long-term cache priority of a file.

     set (prio, policy)   
and    get (prio)  
set and get the file cache replacement policy of a priority level.  At
present, we offer only two policies: least-recently-used (LRU) and
most-recently-used (MRU).  The default policy is LRU.
  itemize 

Since files with the same priority belong to the same caching ``pool''
with the same replacement policy, application or library writers can
use a combination of priorities and policies to deal with various file
access patterns.  They can apply a single policy to all files, can
apply different
policies for different pools of files, and can change priorities to 
tell the kernel to replace some files before others.

Our interface also has a primitive for applications to assign
temporary priorities to file blocks:
  
set temppri(file,startBlock,endBlock,prio)
  verbatim 
This is a mechanism to allow applications to temporarily change the
priority of a set of blocks within a file.  This temporary change
affects only those blocks that are presently in the cache, and a
block's priority change only lasts until the next time the block is
either referenced or replaced.  When the temporary priority ends, the
block reverts to its long-term priority as set by    set  .

These five operations are multiplexed through a single new   
fbehavior  system call, in the same way that the Unix    ioctl 
system call multiplexes several operations.

  Supporting Common Access Patterns 

We now illustrate how our interface can be used to support a variety
of common file access patterns.  Further examples are found in section
  performance , in which we discuss how we used our interface to control
caching for a set of real applications.


  Sequential   Applications often read a file sequentially from beginning
to end.  If the file is accessed sequentially repeatedly, it can be assigned
a low priority and MRU replacement policy.  If the file is accessed only once,
it can be assigned priority -1 to flush its block from cache quickly.


  Cyclic access   Some applications repeat the same sequence
of accesses several times.  These applications 
can assign a low priority, and MRU replacement policy, to the relevant
file or files.


  Access to many small files   Some applications access many
small files.  Since our interface treats all files of the same
priority as being in a single ``pool'' when making replacement
decisions, these applications can control the caching of these blocks.
(An interface that allowed control only within individual files would
be useless for tiny files.)


  Hot and cold files   Some applications have particular files that
are accessed unusually often, or with unusually good temporal
locality.  Such files can be assigned high priority.


  Future access prediction   Some applications can predict
that certain 
blocks are especially likely to be accessed in the near future.  These
blocks can be given a temporary increase in priority, to keep them in
the cache longer.


  Done-with blocks   An application may know that it is finished with 
a block, and will not be accessing it again for a long time.  For
example, this can happen in temporary files, which are often written
once and then read once.  When such a file is being read, the priority
of a block can be decreased temporarily when the file pointer moves
past the end of the block.  This will flush the block quickly from the
cache. 


  Kernel Implementation 

The kernel implementation of two-level replacement is done 
by breaking the existing buffer
cache code into two modules.  The buffer cache module (BUF) handles cache
management and bookkeeping and implements the allocation policy.  The
application control module (ACM) implements the interface calls and
acts as a proxy for the user-level managers.
Both of these modules sit below the VFS interface and communicate via
procedure calls.  The implementation is done on a Ultrix 4.3 kernel.
  Contact the authors for portions of the code that are not
vendor-proprietary. 

When a block must be replaced, BUF picks a candidate block and asks ACM
which block to replace, giving the candidate block as a suggestion.  ACM
then acts on behalf of the appropriate user-level manager to
make the decision.  The structure is shown in Figure~  fig:Proc-Buf-ACM .  

  
  figure=/u/pc/papers/OSDI/pc/graphs/proc-buf-acm.ps 
  Implementation of Application Controlled Caching. 
  fig:Proc-Buf-ACM 
  figure 

ACM implements the calls from user level in a straightforward way.
It allocates a ``manager'' structure for any process that wants to 
control its own caching.  Then for each priority level it allocates a header 
to keep the list of blocks in that level.  It also allocates a file record 
if a file has a non-zero long-term priority.
The implementation imposes a limit on kernel resources consumed by these
data structures and fails the calls if the limit would be exceeded.  

Every block, upon entering the cache, is linked into the appropriate list based
on its file's long-term priority.  
The lists are always kept in LRU order, and
LRU (MRU, respectively) chooses blocks from the head (tail) of the list.  
Blocks may move among lists by    set   or    set  .
The implementation uses the rule that blocks moving into a list will be put
at the end that causes them to be replaced later (the MRU end if the
policy is LRU, or the LRU end if the policy is MRU).  The opposite effect
can be achieved by appropriate use of    set  .

BUF and ACM communicate by using five procedure calls.  
They serve the purpose of notifying ACM about
replacement decisions and mistakes, informing ACM about 
changes in cache state and accesses, and asking ACM for replacement
decisions.  The calls are: 
  
     new (block) : informs ACM that block was loaded into cache buffer; 
     block (block) : informs ACM that block was removed
from the cache; 
     block (block, offset, size) : informs ACM that
block was accessed;   
     replace (candidate, missing ) :   
asks ACM which block to replace; 
     placeholder (block, placeholder) : informs ACM that a
previous decision to replace block was erroneous. 
  itemize 

Changes are needed in the replacement procedure in BUF to implement LRU-SP.
Instead of replacing the LRU (Least-Recently-Used) block, the procedure
first checks if the missing block has a placeholder, then takes the LRU
block or the block pointed to by the placeholder (if there is one) as
the candidate.   BUF calls    replace   if the candidate
block's caching is application-controlled, and finally BUF swaps block
positions and builds a placeholder. 
The interface is well defined, and the procedures are called with no lock 
held.  

Since BUF and ACM sit below the VFS interface, our implementation works for
both UFS~  unix:ffs  and NFS~  sun:nfs .  In all our following 
experiments, however, we put all files on local disks to avoid impact 
on performance from network traffic.

We believe this interface could be used if one wanted to
implement two level replacement using upcalls or dropped-in-kernel 
``user-level handlers''~  SPIN94 .
In particular user-level handlers could know which blocks are in cache by
keeping track of    new   and    block   calls.

Our current implementation adds negligible overhead to file accesses.  
For processes that do not control their own caching, there is no added
overhead on cache hits.
On a cache miss, there are two sources of overhead:
we have to keep track of which file's data are in a cache block 
because Ultrix does not remember this; and if the candidate block belongs 
to a process that manages its own cache,    replace   and
   block   have to be called.  Since these procedures are
cheap and are implemented in the kernel, both overheads are small; indeed
they are negligible compared to the disk latency.

For processes that control their own caching, the extra overhead on a
cache miss is similar to the above case except that    new   is
called as well. The overhead on a cache hit depends on the work done in
   block  , which in our current implementation only involves
updating counters and moving a few blocks around in lists.  These overheads
are often smaller than the potential 
reduction in system CPU time caused by reducing the number of misses.


  Performance Results 
  performance 

Beneficiaries of application controlled caching include
applications that use large file data sets (i.e. they do not fit in cache),
because LRU is often not the right policy for them.  
Examples include: database systems,
including text retrieval software; trace-driven simulations, as the
trace files are often large; graphics and image-processing
applications; and I/O-intensive UNIX utilities.  

We chose a set of such applications for our experiments.  Below we
first describe the applications that we chose and the replacement
strategies they use; then we
present performance results for single and multiple application experiments.

  Applications 

We chose the following applications:


  cscope [cs1-3] 
	Cscope is an interactive C-source examination tool
	written by Joe Steffen.  It builds
        a database of all source files, then uses the database to
        answer queries about the program.
        There are two kinds of queries for cscope: symbol-oriented ones,
        mostly questions about where C-symbols occur; and egrep-like text
        search, mostly asking where patterns of text occur.  
	We used cscope on two software packages (in fact, two kernel
	sources), one about 18MB and one about 10MB, and did two sets of 
	queries: searching for eight symbols, and searching for four text
	patterns.  The three runs are: 
	   cs1  --- symbol search on the 18MB source;	
	   cs2  --- text search on the 18MB source; 	
	   cs3  --- text search on the 10MB source.	

	 Strategy:  
	Symbol-oriented queries always read the database file ``cscope.out''
	sequentially to search for records containing the requested symbols.
	Therefore for this kind of queries, the right policy is to use
	MRU on ``cscope.out'':
	  
	set priority("cscope.out", 0); 
	set policy(0, MRU);
	  verbatim 
	Text searches involve reading all the source files in the same order
	on every query, so the right policy is MRU on all the source files.
	Since all source files have the default priority 0,
	the only necessary call is:
	  
	set policy(0, MRU);
	  verbatim   
	(When there is a mix of these queries, cscope can keep or
	discard ``cscope.out'' in cache when necessary by raising or 
	lowering its priority.)
	

  dinero [din]  
	Dinero is a cache simulator 
	written by Mark Hill and used in Hennessy and Patterson's
        architecture textbook~  H P-Course .  The distribution
        package for the course material includes the simulator and
        several program trace files.  We chose the ``cc'' trace (about 8MB)
        from the distribution package, and ran a set of simulations,
        varying the cache line size from 32 to 128 bytes, and
        set associativity from 1 to 4.  

	 Strategy: 
	Dinero reads the trace file sequentially on each simulation.  
	Hence the right policy is MRU on trace file and the call is: 
	  
	set priority(trace, 0);
	set policy(0, MRU);
	  verbatim 


  glimpse [gli]  
	Glimpse is a text information retrieval system~  manber:glimpse .
        It builds approximate indexes for words to allow both relatively
        fast search and small index files.
        We took a snapshot of news articles in several comp.*
	newsgroups on May 22, 1994, about 40MB of texts.  Then we
           glimpseindex ed it, resulting in about 2MB of indexes.
        The searches are for lines containing these keywords:
        scheduling, scheduling and disk, cluster, rendering and
        volume, DTM.  

	 Strategy: 
	The index files and data files naturally lead to two priority
	levels because index files are always accessed first on every query,
	but data files are not.  Hence glimpse gives the index files long-term
	priority 1, and the articles the default long-term priority 0.  
	Since index files are always accessed in
        the same order, and several groups of articles (called
        partitions~  manber:glimpse ) are accessed in the same order,
        MRU is chosen for both levels.  The calls are: 
	  
	set priority(".glimpse index", 1);
	set priority(".glimpse partitions", 1);
	set priority(".glimpse filenames", 1);
	set priority(".glimpse statistics", 1);
	set policy(1, MRU); 
	set policy(0, MRU);
	  verbatim 
	
	
  link editor [ldk]  
	Ld is the Ultrix link-editor.  We used it to build the Ultrix
        4.3 kernel from about 25 MB of object files.  

	 Strategy:  
	Ld almost never accesses the same file data twice, but it
	does lots of small accesses, so the right thing to do is to free a 
	block whenever its data have all been accessed by calling
	  We can't obtain the source of the DEC MIPS link editor, so
	we implemented this policy in the kernel to simulate what the program 
	would do and called it ``access-once''. :
	  
	set temppri(file, blknum, blknum, -1);
	  verbatim 


  postgres join [pjn] 
	Postgres is a relational database system from the 
        University of California
	at Berkeley.  We used version 4.0.1 for Ultrix 4.3. 
	It uses the file system for I/O operations, and since it
	only has a small internal buffer, it relies heavily on file caching.
	We chose one query operation: a join between an indexed
	and a non-indexed relation, to illustrate how it can
	use application control on file caching.  

	The relations are a 200,000 tuple one,    twohundredk , and a 
	20,000 tuple one,    twentyk , from a scaled-up Wisconsin 
	benchmark  Gray:handbook .  
	The join is on field    unique1 , which is uniquely random within
	1-200,000 in    twohundredk , and uniquely random within 
	1-1,000,020 in    twentyk .
	The size of    twentyk  is roughly 3.2MB,    twohundredk  32MB, 
	and index    twohundredk   5MB.

	 Strategy:  
	Since there is a non-clustered index on    unique1  in
	   twohundredk , and no index in    twentyk , Postgres executes the
	join by using    twentyk  as the outer relation and using
        the index to 
	retrieve tuples from    twohundredk .  The index blocks have a 
	much higher probability of being accessed than data blocks.  Therefore
	postgres uses two priority levels: the index file has long-term
	priority 1, while data files have default priority 0.  LRU is used
	for both priority levels.  The only call necessary is:
	  
	set priority("twohundredk unique1", 1);
	  verbatim 
	

  sort [sort] 
	Sort is the external sorting utility in UNIX.  
	We used a 200,000-line, 17MB text file as input, and sorted 
	numerically on the first field.

	Sort has two phases: it first partitions the input file into sorted 
	chunks that are stored in temporary files, then repeatedly merges
	the sorted files.  Its access pattern
	has the following characteristics: input files are read once; 
	temporary files are written once and then read once; merging
        is done	eight files at a time, and temporary files are merged in the
	order in which they were created.

	 Strategy:  
	Sort has two priority levels, -1 and 0.  Input files have
	priority -1 as they are read once. 
 	Temporary files have default priority 0.  
	MRU is chosen for both level -1 and level 0 (because 
	temporary files created earlier will be merged first). The calls are:
	  
	set policy(-1, MRU);
	set policy(0, MRU);
	set priority(input file, -1);
	  verbatim 
	In addition, the ``readline'' routine is changed to
	keep track of when the end of an 8K block is reached, and at
        the end of reading a block to free the block by calling:
	  
	set temppri(file, blknum, blknum, -1);
	  verbatim 
	We experimented with other strategies, and we found that the ones 
	given here performed the best.


  Single Applications 

  
  figure=/u/pc/papers/OSDI/pc/graphs/latex-single.jgr.ps 
  Normalized elapsed time and block I/Os of 
individual applications under LRU-SP, versus file cache size.
Elapsed times under the original kernel are set to 1.0.  
All numbers are averages of five cold-start runs; variances are less
than 2  with a few exceptions that are less than 5 .   
  single 
  figure* 

We compared the performance of each application
under application-controlled file caching to that under the kernel-controlled
file caching (the original, unmodified kernel).  We used a DEC 5000/240
workstation with one RZ56 disk and one RZ26 disk.  The RZ56 is a 665M
SCSI disk, with average seek time of 16ms, average rotational latency of 8.3ms,
and peak transfer rate of 1.875MB/s; the RZ26 is a 1.05GB SCSI disk,
with average seek time of 10.5ms, average rotational latency of 5.54ms, and peak
transfer rate of 3.3MB/s.  The two disks are connected to one SCSI bus.

Applications    cs[1-3] ,
   din ,    gli  and    ldk  were run on the RZ56 disk, and    pjn  
and    sort  were run on the RZ26 disk.

We measured both the number of block I/Os, reflecting the cache misses, and
the elapsed time, for several configurations of the buffer
cache size (6.4MB  This is 10  of the memory of our workstation,
which is the default cache size under Ultrix. , 8MB, 12MB, 16MB).   
Figure~  single  shows the
elapsed times and the number of block I/Os    normalized  to those under 
the original kernel.  The raw data for this experiment appear in the 
appendix  Raw data 
for all of our experiments are 
available at ftp.cs.princeton.edu: /pub/pc/OSDI/. .

Application-specific policies reduce the number of block I/Os 
by an amount ranging from 9  to over 80 .  This confirms our simulation
result~  Cao94  that miss ratio is reduced by application controlled
caching.  The reduction in elapsed time ranges from 6  to over 45 .
Elapsed time is not directly proportional to block I/Os because
elapsed time is also affected by the amount of CPU computation time,
and because the disk access latency is not uniform.

The performance improvement often depends on the relative size of the cache 
versus the data set.  For example, for    cs1 , the improvement increases 
until the 9MB database file ``cscope.out'' fits in cache, at which
point all policies perform the same.
In some cases LRU makes a bigger cache useless ---
there is no benefit until the entire data set can fit in the cache.
Application specific policies, on the other hand, can appropriately take 
advantage of a bigger cache and improve application performance.  As DRAM
density improves, we will see ever-larger caches, so the benefit of
application control will become more significant.

  Multiple Applications 

In our previous paper we used simulations to show that our global allocation
policy LRU-SP can properly take advantage of application control on replacement
to improve the hit ratio of the whole system.  Now that we've
implemented the system, we'd like to see whether this is true in practice.

We ran several combinations of two or more of the above 
applications, with each application using its appropriate user-level
replacement policy.  
The combinations were chosen to have various mixes of
access patterns.  For this purpose, we put the applications into
categories based on their access patterns.  We put    cs[1-3]  and
   din  into the ``cyclic'' category,    gli  and    pjn  into the
``hot/cold'' category, and    ldk  and    sort  in two separate
categories.  We ran six  
two-application combinations, chosen to 
cover all combinations of the four categories.  We also 
randomly chose two three-application combinations and one
four-application combination. 

  
  figure=/u/pc/papers/OSDI/pc/graphs/latex-concurrent.jgr.ps 
  Normalized total elapsed time and block I/Os of 
multiple concurrent applications.  Values under the original kernel
are normalized to 1.0. 
All numbers are averages of three cold-start runs; variances are less
than 2 .   
  multi 
  figure* 

The combinations are:    cs2+gli ,    cs3+ldk ,  
   gli+sort ,    din+sort ,    sort+ldk ,    pjn+ldk ,  
   din+cs2+ldk ,    cs1+gli+ldk , and    din+cs3+gli+ldk .
Note that    gli+sort ,    din+sort ,    sort+ldk ,    pjn+ldk 
are run using two disks, while all others are run using one disk.

We measured the total elapsed time and the total number of block I/Os
for these concurrent applications.  
Figure~  multi 
shows the    normalized  elapsed time and the number of block I/Os.
As can be seen, LRU-SP indeed improves the performance of the whole system.  
The improvement becomes more significant as the file cache size increases.


  Analysis of LRU-SP 

In our simulation study~  Cao94  we found that both the    swapping 
and    placeholders  
techniques in LRU-SP are necessary, and that LRU-SP satisfies our allocation
criteria as described in Section~  background .  In this section, we 
investigate these questions again with our real implementation.

  Comparison with ALLOC-LRU 

LRU-SP keeps a basic LRU list of cache blocks to choose victim processes,
but makes two variations,    swapping  and    placeholders , to the list,
as described in Section~  background .
In the following we first see whether swapping is necessary, then check whether
placeholders are necessary.

  Is swapping necessary?  
Let's call the basic scheme that simply uses the LRU list to choose
victim processes the ALLOC-LRU (ALLOCation by LRU order) policy.  
With ALLOC-LRU as allocation policy in two level replacement, we randomly
chose the following from our above experiments:
   cs2+gli ,    cs3+ldk ,    din+cs2+ldk ,    cs1+gli+ldk ,
and    din+cs3+gli+ldk , and compared the results to
those of LRU-SP.
The results of this comparison are shown in 
Figure~  concurrent-LRU .

In most cases ALLOC-LRU performs worse.
In fact, in a few cases, under ALLOC-LRU applications are better off not 
using smart policies --- smarter allocation hurts their
performance  For detailed data, see ftp.cs.princeton.edu: /pub/pc/OSDI. !  
The problem is that ALLOC-LRU uses straight LRU order without
swapping, and hence penalizes any  
process that does not replace the LRU block.  These results show that swapping 
positions of candidate and alternative blocks
is necessary.

  Are placeholders necessary? 
To see whether placeholders are also necessary, we need an application that
can have a replacement policy that does much worse than LRU
  We do not use the
previous applications here because LRU works poorly for them. ,
and an oblivious
application whose performance is sensitive to the number of cache
blocks it is given.
We found a program ``ReadN'' that can serve both purposes.

  
  figure=/u/pc/papers/OSDI/pc/graphs/latex-setlru-concurrent.jgr.ps 
  Normalized total elapsed time and block I/Os of 
multiple concurrent applications under    ALLOC-LRU .
Values under two level replacement with
   LRU-SP  are normalized to 1.0. 
All numbers are averages of three cold-start runs; variances are less
than 2 .  
  concurrent-LRU 
  figure* 

  
    |c|c|c|c|c|||c|c|c|c|   
     4  c|||   Elapsed Time of ``ReadN''      4  c|   Block I/Os of ``ReadN''      2-9 
Settings   Read390   Read400   Read490   Read500   Read390   Read400   Read490   Read500    
Oblivious   53   58   59   72   1172   1181   1176   1481   
           (7 )     (14 )         (11 )    
Unprotected   73   89   76   122   1300   1538   1465   2294  
         (23 )   (24 )     (27 )    (19 )   (21 )     (27 )    
Protected   75   75   72   91   1170   1170   1199   1580   
              (12 )         (10 )    
  tabular 
  Results from experiments done to test the effectiveness of
placeholders.  For descriptions of these experiments please see text.
The first four columns show the running time in seconds, and the next
four the number of block I/O's, for ReadN.  
Results are averaged over five runs.  Variances are
shown in parentheses; when omitted they are less than 5 . 
  wkset-times 
  center 
  table* 

ReadN sequentially reads the first   8K-byte blocks from a file in
sequence, repeating this sequence five times,
then reads the next   blocks five times, and so on.
MRU replacement policy is much worse than LRU for ReadN, so 
ReadN with MRU can be used as an example of a foolish application.
ReadN also has the characteristic that under LRU, the cache miss ratio
is low when it has at least N cache blocks, but is high when it has
less than N cache blocks.
Therefore we can compare the number of block I/Os done by
ReadN in different situations as a way of measuring what share of the
cache blocks it is getting.

To answer our question about the effectiveness of placeholders, we ran
two versions of ReadN concurrently.  One version, with N=300, is run
with various replacement policies: the oblivious LRU policy (which is
good but not optimal), and MRU (which is terrible).
The other version of ReadN is used to detect changes in allocation of
cache blocks.  This version is run with various values of N: Read390,
Read400, Read490, and Read500.  These values are chosen because our cache
size is 6.4MB, or 819 blocks, so Read300 and Read390 can comfortably
fit in cache together with plenty of space to spare, while Read300 and Read500 
can barely just fit in cache together.  

We measured the performance (running time and number of I/Os) of
the various ReadN's when the background Read300 uses both oblivious (LRU)
and dumb (MRU) replacement policies.
If processes are protected from the foolishness of others, then the
performance of the ReadN's should not get much worse when the
background Read300's policy is changed from oblivious to foolish.

In the case where the background Read300 is foolish, we actually ran
two experiments.  In the ``protected'' experiment, the kernel uses our
LRU-SP allocation policy.  In the ``unprotected'' experiment, the
kernel uses LRU-SP    without placeholders .  Thus there are three
cases overall:
  
  Oblivious: the background Read300 uses the oblivious (LRU) policy;
  Unprotected: the background Read300 uses a foolish (MRU) policy, and the 
	kernel uses LRU-SP    without placeholders  i.e. LRU-S;
  Protected: the background Read300 uses a foolish (MRU) policy,
and the kernel uses LRU-SP;
  itemize 
The results are shown in Table~  wkset-times .  
The data clearly shows that place-holders are necessary to 
protect the oblivious    readN  from losing its share of cache blocks.
(For a detailed explanation of why placeholders provide this
protection, see ~  Cao94 .)  However, the data also shows that  
placeholders did not prevent the increasing in elapsed times of ReadN.
The next subsection explains why.

  Criteria Test 
  sufficiency 

  
   |c|c|c|c|c||c|c|c|c|      
``Read300''     4  c||   Elapsed Time     4  c|   Block I/Os      2-9 
Policy   din   cs2   gli   ldk   din   cs2   gli   ldk     
Oblivious   155   225   156   112   3067   9760   9086   5201    
Foolish   202   339   261   208   3495   10542   9759   5374    
  tabular 
  Effect of a foolish process on smart applications.
This table shows performance of various processes when run
concurrently with a Read300 process, depending on whether the Read300
is oblivious (using LRU) or is foolish (using MRU).
The first four columns shows elapsed time in seconds, and the next four 
the number of block I/Os.
Numbers are averages of three runs with all variances   3 . 
  foolish-time 
  center 
  table* 

We have seen in Section~  performance  
that smart processes improve their performance.  
The next two questions are: do foolish processes hurt other processes,
and do smart processes hurt oblivious processes?

  Do foolish processes hurt other processes? 

We have that seen placeholders at least limit the harm done to
oblivious processes by
foolish processes.  To see whether smart processes are protected as well,
we ran each of    din ,    cs2 ,    gli  and    ldk , all using smart 
policies, concurrently with ``Read300''.   
The results when ``Read300'' is oblivious and when it is foolish are shown in 
Table~  foolish-time .

The data show that there are degradations in both the number of block I/Os and
the elapsed time.  This contradicts results from our simulation study. 
The reasons include the following:
  
  Foolish processes generate more I/O requests, 
and thus put a heavier load on the disk.  This leads to longer
queueing time for the disk, so non-foolish processes' I/O requests
wait longer to be served.   
  Foolish processes take longer
to finish and therefore occupy cache blocks for a longer time.
Processes that run concurrently with them and finish after them normally 
get an increase in available cache blocks after they are finished.  If they
finish later, this increase comes later.
  itemize 

The best way to provide protection from foolish processes is probably
for the kernel to revoke the cache-control privileges of consistently
foolish applications.  
Placeholders allow the kernel to tell when an application is foolish, so
the kernel can keep track of how good the application's policy is.
If it turns out that a certain percentage of the application's decisions 
are wrong, the kernel can revoke its right to control its caching
  We are adding this in our implementation. .

  
   |l|c|c|c|c|    
Application     4  c|   Elapsed time of ``Read300''      2-5 
Policies   w. din   w. cs2   w. gli   w. ldk    
Oblivious   87   88   60(7 )   78     
Smart   67(8 )   83   64(9 )   76     
  tabular 
  Elapsed time of ``Read300'' when run concurrently with
oblivious and smart versions of other applications on one disk.
Numbers are in seconds, averages of three runs with variances  4 
except those shown in parenthesis.
Read300's numbers of block I/Os are the same in all cases (about 1310) 
as they are all compulsory misses. 
  obli-times:1 
  table 

  
   |l|c|c|c|c|    
Application     4  c|   Elapsed time of ``Read300''      2-5 
Policies   w. din   w. cs2   w. gli   w. ldk    
Oblivious   20   18   19   17     
Smart   20   17.5   18   17     
  tabular 
  Elapsed time of ``Read300'' when run concurrently with
oblivious and smart versions of other applications on two disks.
Numbers are in seconds, averages of three runs with variances  3 .
Read300's number of block I/Os is the same in all cases (about 1310). 

  obli-times:2 
  table 

  Do smart processes hurt oblivious ones? 
To answer this we use ``Read300'' again, this time as an oblivious application.
We run it with each of    din ,    cs2 ,
   gli ,    ldk , both when they are oblivious and when they are smart.


The results are summarized in Table~  obli-times:1 .

In most cases smart processes do not hurt but rather help oblivious processes.
Since these experiments are run on one disk (RZ56), the reduction in the number
of I/Os from smart processes reduces the load on the disk, and speeds up the
requests from ``Read300''.  

The only exception is the experiments with    gli , which have a high
variance in the elpased time.  Since the number of block I/Os from ``Read300''
did not increase, nor did its user and system times, we suspect that this is
due to the RZ56 disk's internal scheduling or buffering.  To test this, we
also run this experiments on two disks, with ``Read300'' using RZ26 and others
using RZ56.  The results are summarized in Table~  obli-times:2 .

As can be seen, the anomaly goes away, which suggests that the previous 
problem was due to the disk contention.  
The improvements in the elapsed times, however,  
are not as noticeable as before because ``Read300'' is using a seperate disk.  


From these results we believe that both swapping and placeholders are 
Status: R

necessary.  Although placeholders do not completely eliminate 
the harm done by foolish processes, they at least help the kernel take
administrative measures to solve the problem.


  Related Work 

There have been many studies on caching in file systems (e.g.
  schroeder85,cheriton87,sun:nfs,afs:caching,sprite:caching,coda:disconnect ),
but these investigations were not primarily 
concerned with the performance impact of different replacement policies.  
Recently several research projects have tried to 
improve file system performance in a number of other ways, including
prefetching  smith-cache,hugo:tip,pk:prefetching,chew:kernel,appleton:prefetching ,
delayed writeback  mogul:writeback  and disk block
clustering  McVoy:cluster,saltzer:lfs .    
Most of these papers still assume global LRU as the basic cache
replacement policy, and they do not address how to use application
control over replacement to improve  
cache hit ratio.  Our work is complementary to these 
approaches; in other words, with application-controlled file caching,
these approaches will improve file system performance even more.

The database community has long studied access patterns and
buffer replacement policies
  stonebraker:osdbms,dewitt:buffer-policy-evaluation,db:lru-k .
In our experiments we used    pjn  (Postgres join) as an example of
how database systems may use our interface to control file caching.
We believe our interface 
is flexible enough to implement most of the policies a database system might
want to use.  


A very similar problem is application controlled physical
memory management in virtual memory systems.  Current operating systems
provide limited facilities that are not adequate for application control.  
For example, the Berkeley UNIX system calls for pinning pages,    mpin  or   
mlock , are often only available to the superuser.   


Advisory calls like    vadvise  or    madvise  provide an
interface for applications to advise the kernel about their access patterns 
over a memory object, including a mapped file.  
However, this interface is much more limited than ours, allowing specification
of only a small number of basic patterns and no priorities.

As a result, in the past few years there has been a stream of research papers 
on application control of caching in the virtual memory context
  premo,sechrest park,harty:appcontrol,anderson:oopsla .  
Our work differs from that described in these papers in three
significant ways: 
  
  
None of these papers (except ~  harty:appcontrol ) addresses the global
allocation policy problem.  By contrast, we discuss this problem in
detail, and provide a solution, LRU-SP, which we have
simulated~  Cao94  and now have implemented.
 

Most of these papers relied on RPC or upcalls for kernel-to-user
communication, and consequently reported overhead as high as 
10  of the total execution time~  premo,sechrest park . 
We provide a flexible interface for applications to issue primitives to
exert control on cache replacement; we found that this is adequate most
of the time and requires low overhead.  
 

Most of these papers do not adequately consider which replacement
policy an application should use.  In   Cao94  we proposed that application
replacement policies be based on the optimal replacement principle~  coffman denning , and in this paper we discussed which policies to use for our
example applications.  
 
  itemize 

On the other hand, our work shares some common purposes with the
existing work on application-controlled virtual memory.  
We believe that, with some minor modifications, our approach applies
to virtual memory cache management as well. 

First, swapping and placeholder techniques apply to the virtual memory
caching problem --- one can swap positions of pages on the two-hand-clock
list, and can build placeholders to catch foolish decisions.  Second, our 
interface can be modified to apply to virtual memory context, i.e. instead of
files, we use a range of virtual addresses (or memory regions).

One important difference is that in the virtual memory
context, the implementation cannot capture the exact reference stream as
it does in the file caching context.
It is still not clear to us what are the common policies applications may want 
to use for virtual memory caching and whether not capturing exact references is a serious problem.

Our implementation is for Ultrix 4.3 where there is a fixed amount of 
DRAM memory allocated for file caching.  In modern operating systems
the virtual memory system and the file system often share a common page
buffer pool.  We believe our 
approach to file caching fully applies to these systems as well, with
only some minor changes in data structures.

Finally, the difference between this paper and the work described in
  Cao94  is that in   Cao94  we proposed and simulated LRU-SP,
and in this paper we proposed a concrete scheme for user-kernel
interaction and implemented both the interaction scheme and
application-controlled caching in a real kernel.


  Conclusion and Future Work 

Our main conclusion is that two level replacement with LRU-SP works.  It can
be implemented with low overhead, and it can improve performance
significantly.

Applications with large data sets often do not perform well under traditional
file caching.  We have identified common access patterns of these applications,
and have designed an interface for them to express control over cache
replacement. 
The allocation policy LRU-SP fairly
distributes cache blocks among applications.  As a result, the number of disk
I/Os can be significantly reduced, and the application's elapsed time and
the system's throughput can improve.

We are working on improving our user-kernel interface, and on
supporting user-level control over caching of 
concurrently shared files  Cao94 .  
In addition, 
our current implementation ignores metadata blocks like inodes, 
 partly because there is a separate caching scheme for them inside the file
system.  We plan to look more into metadata caching performance and 
investigate what should be done.

We also plan to look into the interaction between
caching, prefetching, write-back and disk block clustering.  

The search for better global allocation policies requires a clear
definition of the goal and an examination of the interactions between
caching and scheduling/management
policies in other parts of the system, particularly process scheduling
and disk scheduling  ganger:disk-scheduling .  This is a fruitful
area for future work.

  
  Better global allocation policies? 
LRU-SP is by no means optimal.  However, before we discuss
optimal allocation policies, we first have to address the question:
what are we trying to optimize?

If the goal is the throughput of the system, then allocation policy
alone can not solve the problem.  Job/process scheduling,
disk service scheduling all have to contribute as well.  The reasons
are: it is clear that no two disk bound applications should be run
concurrently, and this can only be achieved by 
job/process scheduling; and when a disk bound application runs with
a more balanced application, CPU, not the disk, is the one that has
to be kept busy, hence the more balanced application's disk I/O
requests should receive higher priority at the disk end (in addition
to get necessary cache blocks from the allocation policy), and this 
is can only be accommodated by the disk scheduler.

Or the goal can be fair share of system resource and equal progress.
Existing UNIX process scheduling and resource management
are more toward this goal.
What LRU-SP does is to try to preserve the current state of 
``fair'' share of resources and 
improve throughput at the same time.  
On the other hand, we also think that with recent technology 
changes it is necessary to reexamine various scheduling 
policy including process scheduling for their proposed goal, 
and better global allocation policies have to be investigated 
in the same context.

In a word, the search for the optimal allocation policy can not be
isolated, it must involve a definition of the goal and
examination of scheduling/management 
policies in other parts of the system.
 

  Acknowledgements 

We are grateful to our paper shepherd Jay Lepreau and the anonymous referees
for their valuable comments.  Keith Bostic and Randy Appleton also provided 
helpful feedback.  

  References 
 
  plain 
  /u/pc/lib/bib/abbr,/u/pc/lib/bib/db,implementation 

 
  Appendix 

These tables show raw performance data for our single application experiments.

can
be implemented with low overhead, and it can improve performance
significantly.

Applications with large data sets often do not perform well under traditional
file caching.  We have identified common access patterns of these applications,
and have designed an interface for them to express control over cache
replacement. 
The allocation policy LRU-SP fairly
distributes cache blocks among applications.  As a result, the number of disk
I/Os can be significantly reduced, and the application's elapsed time and
the system's throughput can improve.

We are working on improving our user-kernel interface, and on
supporting user-level control over caching of 
concurrently shared files  Cao94 .  
In addition, 
our current implementation ignores metadata blocks like inodes, 
 partly because there is a separate caching scheme for them inside the file
system.  We plan to look more into metadata caching performance and 
investigate what should be done.

We also plan to look into the interaction between
caching, prefetching, write-back and disk block clustering.  

The search for better global allocation policies requires a clear
definition of the goal and an examination of the interactions between
caching and scheduling/management
policies in other parts of the system, particularly process scheduling
and disk scheduling  ganger:disk-scheduling .  This is a fruitful
area for future work.

  
  Better global allocation policies? 
LRU-SP is by no means optimal.  However, before we discuss
optimal allocation policies, we first have to address the question:
what are 
e trying to optimize?

If the goal is the throughput of the system, then allocation policy
alone can not solve the problem.  Job/process scheduling,
disk service scheduling all have to contribute as well.  The reasons
are: it is clear that no two disk bound applications should be run
concurrently, and this can only be achieved by 
job/process scheduling; and when a disk bound application runs with
a more balanced application, CPU, not the disk, is the one that has
to be kept busy, hence the more balanced application's disk I/O
requests should receive higher priority at the disk end (in addition
to get necessary cache blocks from the allocation policy), and this 
is can only be accommodated by the disk scheduler.

Or the goal can be fair share of system resource and equal progress.
Existing UNIX process scheduling and resource management
are more toward this goal.
What LRU-SP does is to try to preserve the current state of 
``fair'' share of resources and 
improve throughput at the same time.  
On the other hand, we also think that with recent technology 
changes it is necessary to reexamine various scheduling 
policy including process scheduling for their proposed goal, 
and better global allocation policies have to be investigated 
in the same context.

In a word, the search for the optimal allocation policy can not be
isolated, it must involve a definition of the goal and
examination of scheduling/management 
policies in other parts of the system.
 

  [htb]
   ||l|l||c|c|c|c||   
  2  ||c||   Application      4  c||   Buffer Cache Size      3-6 
  2  ||c||        6.4MB   8MB   12MB   16MB    

   o

  document