A BETTER UPDATE POLICY

                            Jeffrey C. Mogul
       Digital Equipment Corporation Western Research Laboratory

                                ABSTRACT

  Some file systems can delay writing modified data to disk, in order to
reduce disk traffic and overhead.  Prudence dictates that such delays be
bounded, in case the system crashes.  We refer to an algorithm used to
decide when to write delayed data back to disk as an update policy.
Traditional UNIX(R) systems use a periodic update policy, writing back
all delayed-write data once every 30 seconds.  Periodic update is easy
to implement but performs quite badly in some cases.  This paper
describes an approximate implementation of an interval periodic update
policy, in which each individual delayed-write block is written when its
age reaches a threshold.  Interval periodic update adds little code to
the kernel and can perform much better than periodic update.  In
particular, interval periodic update can avoid the huge variances in
read response time caused by using periodic update with a large buffer
cache.

1. Introduction
  File systems usually cache data and meta-data in a main memory buffer
cache, in order to improve performance.  When a modification is made,
the file system may write the new information to stable storage (e.g.,
disk) immediately, or it may delay the write.  This leads to a tradeoff:
delaying writes reduces the load on the disk and system overhead, but
the data could be lost if the system crashes before the write occurs.
In many cases, users can tolerate this vulnerability, and welcome the
performance advantages of delayed writes.

  UNIX(R) systems have traditionally supported delayed writes, from the
earliest C-language version [13, 14] up through 4.3BSD [6] and its
derivatives.  In these systems, a modification of a partially-filled
data block results in a delayed write, while a modification that fills a
block results in an immediate, although asynchronous, write.  The
ULTRIX(TM) operating system can be configured to be even more
aggressive, delaying all writes of modified data.

  Without some bound on the age of a delayed-write block, a system crash
could cause loss of arbitrary data.  Users would not tolerate this, so
the file system does push delayed-write data out to disk, after a while.
We use the term update policy to describe the algorithm that decides
what to write out, and when.

  UNIX systems have traditionally used a simple periodic update (or
``PU'') policy:  once every 30 seconds, all dirty blocks in the file
system's buffer cache are placed on the output queue for the appropriate
disk.  Recent analytical and simulation results, presented by Carson and
Setia [2], showed that the PU policy actually performs worse in many
cases than the write-through (WT) policy (in which all writes are
immediate).  Their analysis showed that PU causes increased mean
response times for read operations; the results presented in this paper
show that PU can also increase the variance in read response time.

  Because so many systems use the now suspect PU policy, it seemed like
a good idea to validate the results of Carson and Setia on actual
systems; their analyses and simulations, while careful, had to use
certain simplifying assumptions.  Carson and Setia suggested that
A BETTER UPDATE POLICY                                                 1


several other update policies would provide better performance, so it
also seemed useful to implement one of these and measure its
performance.

  In this paper, after discussing the theoretical background in some
more detail, I describe an implementation of the interval periodic
update (IPU) policy.  I also present the results of some simple
measurements of PU and IPU made using an actual implementation, rather
than a model.  The results appear to bear out the basic conclusions of
Carson and Setia.  What I found is that although use of delayed writes
can improve mean response time, combining delayed writes with periodic
update increases the variance in response time, but using interval
periodic update both the mean and variance are improved.

2. Theoretical background
  In this section, I discuss the previous simulation study, the possible
alternative update policies, the choice of an update interval, and how
this problem will scale with changes in technology.


2.1. Carson and Setia's results for Periodic Update
  The study done by Carson and Setia showed that the periodic update
(PU) policy, while easily implemented, can perform quite badly.  By
dumping all the dirty blocks into the disk queue at once, PU can cause
lengthy queueing delays.  Latency-sensitive synchronous operations, such
as file reads or synchronous writes, are forced to wait behind
latency-insensitive asynchronous operations in the queue.  If the system
were to use a write-through (WT) policy instead, disk write operations
would normally be spread out more over time, and the queues would be
shorter.

  Carson and Setia show that the relative performance of WT and PU,
measured in terms of mean read-response time, depends on several
parameters:

Read load
   The ratio of read operation arrival rate to the rate that the disk
   can support.  A read load of 1.0 is one that the disk could just
   barely keep up with, if no writes were done.

Write load
   The ratio of write operation arrival rate to the rate that the disk
   can support.

Cache hit ratio for writes
   The fraction of write operations that are satisfied by modifying
   already-dirty blocks in the buffer cache.

They expressed their results in tables showing, for a given read load
and write load, what write-hit ratio PU requires in order to match or
exceed the performance of WT.  (Since WT causes disk writes to occur
almost immediately, it cannot benefit from write hits in the cache.)

  They found that:

   - When the disk is not overloaded, ``the cache must eliminate
     80-90 % of all write accesses before the PU policy pays off.''

   - Under heavy loads, the PU policy gets some benefit from
     write-hits in the cache, and thus reduces the overall disk
A BETTER UPDATE POLICY                                                 2


     load.  In cases where the WT policy would saturate the disk,
     the PU policy gives a lower mean read-response time.

Cache hit rates vary, depending on cache size, application, and
replacement policy, but we are unlikely to achieve average write-hit
rates exceeding 80%.  For example, Baker et al. [1] traced user-level
file access patterns in a distributed system and found that 88% of the
bytes written were written sequentially; if the applications in question
used traditional buffering strategies, most of these sequential writes
would not have hit already-dirty buffer-cache blocks, and so the
write-hit rate must have been quite low.

  Carson and Setia found an analytical model for the mean response time.
Their simulations were used to validate this result, but they apparently
did not investigate other statistics besides the mean.  As section 4
will show, PU is especially bad for worst-case response time, and for
overall variance in response time.  It is not hard to construct a
situation where PU can lead to worst-case read response times of many
seconds.


2.2. Proposed alternative policies
  PU performs poorly because it generates long queues at periodic
intervals, and subsequent synchronous requests get stuck at the ends of
these queues.  Perhaps one could improve read-response times by changing
the queueing mechanism.

  UNIX systems typically maintain a single, unprioritized operation
queue for each disk.  Suppose that read operations were given priority
over asynchronous operations already in the queue. Then, read operations
would not ``see'' a queueing delay caused by the queued delayed writes.
Carson and Setia analyzed this periodic update with read priority (PURP)
policy, and showed that it mostly solves the read-response time problem.
I found PURP unsatisfactory, however, for several reasons:

   - Modification of the existing disk queue mechanism would
     require changes to numerous kernel modules, including all disk
     drivers and many of their clients (file systems, virtual
     memory systems, etc.)

   - Modern disk controllers and drives can queue several
     operations in their internal buffers.  One would either have
     to accept the resulting queueing delays, or somehow modify the
     hardware to support the new queueing mechanism.

   - Carson and Setia point out that fixed-priority schemes such as
     PURP introduce the potential for infinite delays of delayed
     writes, if the read load is enough to saturate the disk.
     Peacock [12] reported that adding PURP to System V Release 4
     does seem to hurt benchmark performance, although it
     substantially increases single-file write throughput.

Although implementation of a prioritized queueing scheme should be
helpful in general, it is neither a complete solution to the
bursty-update problem, nor is it the simplest solution.

  Carson and Setia also proposed the interval periodic update (IPU)
policy, in which each dirty block is written out when its age reaches a
threshold.  If file modifications are nicely spread out in time, this
means that the delayed writes back to the disk will also be spread out.
A BETTER UPDATE POLICY                                                 3


As with PU and PURP, IPU uses the buffer cache to eliminate some disk
writes that would be done by WT.  Unlike PU and PURP, IPU normally
avoids creating large bursts of writes, and so avoids the associated
queueing delays.  Carson and Setia show that IPU never gives worse mean
read response time than WT or PU, although in some situations it may
perform worse than PURP.

  Anna Hac [3] describes algorithms meant for deciding when to move
dirty blocks from the buffer cache to a disk queue.  In essence these
replace the time-driven update policies with dynamic algorithms, which
choose when to schedule disk writes based on the system load and disk
queue length.  Such adaptive algorithms may perform better than any of
the open-loop algorithms described in this paper, but they require more
extensive changes to the operating system.  I do not have anything
useful to say about them, and they merit additional study.


2.3. Choice of update interval
  UNIX systems have traditionally used a 30-second interval between
writes generated by the PU policy.  This means that, ignoring brief
queueing delays, no information will be vulnerable to a crash for longer
than 30 seconds.  (Applications that depend on reliable data storage
should arrange to write their data synchronously, using the fsync()
system call.  Many other applications, such as compilers, can afford to
use delayed writes because their output can easily be reconstructed, or
because if the system crashes during a run, the resulting partial output
is useless anyway.)

  Under the assumption that modifications occur more or less uniformly
over time, the average age of a delayed-write block, when it is written
to disk, is 15 seconds.

  The IPU policy also has a characteristic time scale, the age at which
a dirty buffer is scheduled for writing to the disk.  If we set this to
30 seconds, then (again ignoring queueing delays), by definition, no
information will be vulnerable to a crash for longer than 30 seconds.
Also by definition, the average age of a block, when written to disk, is
30 seconds.

  If we choose to set the update delay for IPU the same as the update
interval for PU, then both policies expose modified data to exactly the
same worst-case vulnerability.  Doing so, however, means that the mean
age of dirty blocks is roughly twice as it would be with the PU policy.
This suggests that IPU should see a higher write-hit ratio, and might
avoid a few more disk write operations.

  Carson and Setia showed that, as the update interval was increased,
the write-hit ratio at which PU began to pay off had to increase as
well.  We do expect this ratio to increase, but does it increase fast
enough?  Carson and Setia cite other work suggesting that it might [10]
(see also a more recent study [1]).  Unfortunately, I know of no actual
test of this hypothesis.  Still, one might suspect that the increased
average lifetime of dirty blocks, when the IPU policy is used, might
account for some performance advantage.  (Note that the experiments
reported in sections 4.1 and 4.2 carefully avoid repeated writes to the
same block during an update interval, and so should encounter abnormally
low cache hit ratios.)
A BETTER UPDATE POLICY                                                 4


2.4. Scaling properties
  One might ask why, if PU performs so badly, has this not been a
                   1
problem in practice .  The answer is that buffer cache sizes and disk
speeds are improving at different rates, which changes the ratio of disk
queue length to disk service latency.

  4.2BSD and related systems only delay partial-block writes.  Since
most files are written sequentially, most blocks are filled quickly, and
pending delayed writes of partial blocks are turned into asynchronous
immediate writes of filled blocks.  If the buffer cache is not large
enough to hold many entire files, then it makes little sense to delay
writes of full blocks, since these blocks are unlikely to be referenced
again quickly.

  Memory chips get larger: this is one of the most reliable laws of
recent history.  One can quibble over whether the doubling time is 18
months or two years, but main memory sizes do increase (at roughly
constant cost) as the years pass, and no other technology trend is quite
so steep [5].

  This trend means that, if the fraction of main memory used as a buffer
cache remains constant, the absolute size of buffer caches is increasing
with time.  (Many systems, including Mach, Sprite, and recent UNIX
implementations, no longer allocate a fixed fraction of main memory for
the buffer cache, so it can grow to fill all of memory.)  Since mean
file sizes do not seem to be increasing as rapidly [1], perhaps as main
memories get larger, using delayed writes would increasingly reduce disk
traffic because of write-hits in the cache.  (Traces do show that a few
large files are getting much larger [1], and so caching algorithms
should perhaps switch to write-through for any file larger than a
certain size.)

  Although increasing DRAM densities lead to larger buffer caches and
perhaps more use of delayed writes, disk technology trends are less
encouraging.  Disk access times have improved by perhaps one-third in
ten years.  Disk densities are increasing more rapidly, doubling every
three years [5].  Disk bandwidths tend to scale as the square root of
disk density (since density improvements come from both higher signal
rates and smaller track spacings), and also benefit from small increases
in rotation rate (from 3600 RPM to 5400 RPM), so over the past decade
they have improved by perhaps a factor of six.

  This means that the time it takes to write the entire buffer cache to
disk is growing, in absolute terms.  This is the key problem for the PU
update policy: the queueing delays caused by its burst of write requests
will get worse in the future.

  Delayed writes typically are queued in no particular order.  If the
disk driver does nothing to schedule the write operations more
carefully, the rate at which the queue can be drained depends mostly on
the disk's average access time.  In table 2-1, I show how long it takes
to write the entire buffer cache (assuming this is 10% of main memory)
for systems typical of 1983 and 1993, and I rashly project current
trends 10 years into the future.

_______________

  1
   Some systems have indeed exhibited poor behavior resulting from disk
queues containing many asynchronous write requests [12].
A BETTER UPDATE POLICY                                                 5


 ----------------------------------------------------------------------
 |      |          |        |           |             |               |
 | Year | RAM size | Buffer | Number of |  Avg. disk  | Time to write |
 |      |          |  size  |  buffers  | access time |  all buffers  |
 |      |          |        |           |             |               |
 |------------------------------------------------------------------  |
 |      |          |        |           |             |               |
 | 1983 |  1 MB    | 512 B  |  205      | 35 msec     |  7.2 sec      |
 |      |          |        |           |             |               |
 |------------------------------------------------------------------  |
 |      |          |        |           |             |               |
 | 1993 | 64 MB    |   4 KB | 1638      | 15 msec     | 26 sec        |
 |      |          |        |           |             |               |
 |------------------------------------------------------------------  |
 |      |          |        |           |             |               |
 | 2003 |  4 GB    |  64 KB | 6554      |  6 msec     | 39 sec        |
 |      |          |        |           |             |               |
 ----------------------------------------------------------------------

      Table 2-1:  Scaling for random write of entire buffer cache

     --------------------------------------------------------------
     |      |          |              |           |               |
     | Year | RAM size | Buffer cache |   Disk    | Time to write |
     |      |          |     size     | Bandwidth |  all buffers  |
     |      |          |              |           |               |
     |----------------------------------------------------------  |
     |      |          |              |           |               |
     | 1983 |  1 MB    |  103 KB      |  1 MB/s   |  0.1 sec      |
     |      |          |              |           |               |
     |----------------------------------------------------------  |
     |      |          |              |           |               |
     | 1993 | 64 MB    | 6554 KB      |  5.5 MB/s |  1.2 sec      |
     |      |          |              |           |               |
     |----------------------------------------------------------  |
     |      |          |              |           |               |
     | 2003 |  4 GB    |  410 MB      | 30 MB/s   | 13.6 sec      |
     |      |          |              |           |               |
     --------------------------------------------------------------

    Table 2-2:  Scaling for sequential write of entire buffer cache

  If the disk drive system can optimize the order of writes in the
queue, in the best case the queue can be drained at full disk bandwidth.
(Many UNIX disk drivers do sort requests to avoid seeks [6], and some
modern disk drives themselves reorder requests.)  In table 2-2, I show
how the delays for this process scale over time.  These numbers are much
better than those in table 2-1, but they are probably unattainable, and
in any case they are also getting worse.

3. Implementation
  In this section, I discuss the implementation of various update
policies, including the original UNIX implementation of PU, my
approximate implementation of IPU, and Sprite's implementation of a
similar policy.
A BETTER UPDATE POLICY                                                 6


3.1. 4.2BSD implementation of the PU policy
  Before describing how I implemented the IPU policy, I will describe
the ULTRIX implementation of the PU policy.  My code is a simple
modification of the ULTRIX implementation.

  Every 30 seconds, a daemon process (/etc/update) wakes up and does a
sync() system call.  This system call schedules writes for certain file
system meta-data (the superblock, for example), and then calls the
bflush() routine to update delayed writes.

  In the original 4.2BSD implementation, bflush() traversed a queue
containing all of the valid blocks in the buffer cache, and scheduled an
immediate write for each delayed-write block.  Once a block was written
and removed from the queue, the algorithm started again from the
beginning of the queue; this is done because the queue could be
manipulated by another process while the bflush() is waiting for the
write to complete.  Thus, in the worst case this required time
proportional almost to the square of the size of the buffer cache, and
as memories grew larger, the /etc/update process started to consume a
large fraction of the CPU.

  Recent versions of ULTRIX solve this problem by keeping a separate
list of delayed-write blocks (i.e., blocks that are dirty and have not
yet been queued for the disk).  The bflush() routine simply traverses
this list once; it need not examine clean blocks, nor does it have to
examine any block more than once.


3.2. Implementation of the IPU policy
  Carson and Setia point out that to implement a pure IPU policy would
require a somewhat complex timer mechanism.  Since the timers in the
UNIX kernel are quantized, if one wants to issue write operations in the
same order as the blocks are dirtied, then one also needs to maintain a
separate ordered queue.  The overhead of the queue and timers may not be
onerous, but it does complicate the kernel.

  But why implement pure IPU?  If a practical implementation must use
quantized time, why not use a relative coarse grain?  If the algorithm
that moves delayed-write blocks onto the disk queues runs, say, once per
second, then the queues will receive bursts of writes, but the bursts
will (on average) be about 3% as large as they would be with the PU
policy and a 30-second update interval.  There will also be a 1-second
quantization error in the maximum vulnerable period for a dirty block,
but who cares?

  I chose to implement this kind of approximate IPU (or ``AIPU'').  I
created an alternative version of the sync() system call, smoothsync(),
that takes as its parameter the age at which a dirty block should be
               2
written to disk .  The smoothsync() system call invokes a modified
version of the bflush() routine, called bflush_smooth().  The main

_______________

  2
   Actually, instead of creating a true system call, I added an ioctl
request, since this involved writing less code.  The net effect should
be identical.  I added one more ioctl, to write file system meta-data to
disk; this can be called once every 30 seconds, to preserve existing
sync() semantics.
A BETTER UPDATE POLICY                                                 7


difference is that bflush_smooth() only schedules a buffer for writing
if it has been dirty for longer than the specified threshold.

  I replaced the usual /etc/update program, which simply calls sync()
once every thirty seconds, with one that calls smoothsync(30) once a
second.  This program also forces the file system meta-data to disk once
every 30 seconds.

  The system must also record the time at which a buffer becomes dirty,
using a timestamp field in the header associated with each buffer.  I
did this by adding a few lines of code to a routine called brelse(),
which is the only place where a buffer is placed on the delayed-write
list.  At this point, if the timestamp field is zero, then it is set to
the current time; otherwise, it is left alone.  The brelse() routine is
also the only place where buffers are placed on the list of clean
buffers; at this point, the timestamp field is set to zero.

  Thus, a clean buffer always has a zero timestamp.  A dirty buffer
always has a timestamp reflecting when it was first dirtied.  Further
modifications of a dirty block do not update the timestamp; otherwise, a
block that was touched more often than once every 30 seconds would never
be written to the disk.

  ULTRIX already includes a timestamp field, busy_time, in the buffer
header.  This is used only when a buffer is busy (i.e., on a disk
operation queue), and so is never used when a buffer is on the
delayed-write list.  Therefore, it can be ``time-shared'' between these
two uses.  Other operating systems, including 4.xBSD, do not have such a
timestamp field, and so the implementation of IPU would require its
addition.  The space overhead is small; using modular arithmetic, a
one-byte field would allow maximum ages under 255 seconds.
                                                      2
  This modification re-introduces the possibility of N  behavior in the
worst case (that is, when about half of the buffer cache is due to be
written during a single interval).  I solved this by using an extra
queue.  The bflush_smooth() routine starts by traversing the
delayed-write queue and moving ready-to-write buffers from there onto a
pending-write queue; this can be done without blocking, and in time
linear in the size of the buffer cache.  In the second phase,
bflush_smooth() writes the blocks on the pending-write queue.  It could
block during this phase, but because it simply pulls the first entry off
of the pending-write queue, the algorithm is linear in the number of
ready-to-write blocks, and so is also linear in the size of the buffer
cache, even in the worst case.


3.3. Sprite's implementation of an IPU policy
  The Sprite operating system [9] implements an approximate IPU policy,
although somewhat different from the one I implemented.  Sprite keeps
track of the first-dirty time for the oldest dirty block of each file.
Every five seconds, it scans all the dirty files in its cache, and if a
file's oldest dirty block is more than 30 seconds old, all of the file's
dirty blocks are written back [4].  Because this policy can in theory
cause write-backs of fairly young blocks, it may perform somewhat
differently from IPU or AIPU.  However, since most files are open for
only brief periods [1], and so normally all writes to a file happen more
or less simultaneously, the average lifetime of a dirty block should be
close to 30 seconds.
A BETTER UPDATE POLICY                                                 8


4. Results
  In this section, I describe some simple measurements comparing my
implementation of IPU against the original PU policy.  All of these
measurements were done using a modified ULTRIX version 4.3 kernel,
running on either of two DECstation systems; the hardware is summarized
in table 4-1.  The SCSI disk drives used apparently do not reorder
requests, and the ULTRIX version 4.3 SCSI device driver does not sort
requests before issuing them.

 Description      CPU type      SPECMark  Disk    Average        Bandwidth    
                                 rating   type  access time

Faster system  DECstation 5000  18.5      RZ58  18 msec      3.8-5.0 Mbyte/sec
                     model 200

Slower system  DECstation 3100  11.3      RZ57  23 msec      2.2 Mbyte/sec

               Table 4-1:  Systems used for measurements

  In all of the experiments in sections 4.1 and 4.2, two processes ran
simultaneously:

   - A write-load generator, configured to dirty half the blocks in
     the buffer cache every 30 seconds, at a rate set so that no
     block would be touched twice in one minute.  This means that
     none of the writes would hit a dirty block in the cache.

   - A read-load generator, which read 10,000 randomly chosen
     blocks from a large file, measured the read-response time for
     each block, and generated a histogram of the delays.

On the faster system, the buffer cache held 1228 blocks, and the
read-load generator used a 34 Mbyte file.  On the slower system, the
buffer cache held 614 blocks, and the read-load generator used a 32
Mbyte file.  Both the generators used files stored on the same disk.

  I varied the system configuration, enabling or disabling the use of
delayed writes for full data blocks, and changing the update policy.
Six trials were done for each configuration.  I did not measure a pure
write-through configuration, since I do not expect any modern system to
use pure WT, given its known poor behavior.


4.1. Local tests
  The first set of tests show how the update policy and use of delayed
writes affects response time for reads when the disk is local to the
generating host.  Figures 4-1 and 4-2 show the results for the faster
and slower systems, respectively.

  The figures show a point for each trial, plotted with mean read
response time on the horizontal axis, and the standard deviation of read
response time on the vertical axis.  Open squares show results for the
default configuration (asynchronous writes, PU policy):  moderately high
mean response time, but low variance.  When I enabled delayed writes
without changing the update policy, the mean response time dropped
somewhat, but the variance increased tremendously (open circles).  I
then switched to the AIPU policy (filled circles), which reduced the
variance without markedly increasing the mean.  One could also use the
AIPU policy without delayed writes (filled squares); this results in
about the same mean as the default configuration, but slightly less
variance.
A BETTER UPDATE POLICY                                                 9


           [For figure, see PostScript version of this paper]

              Figure 4-1:  Local random reads, fast system


           [For figure, see PostScript version of this paper]

              Figure 4-2:  Local random reads, slow system

  At first, it seemed strange that the mean read response time is lower
for delayed writes, since the write-load generator was constructed to
avoid dirty-block cache hits; that is, the total number of write
operations should be the same in either case.  However, the generator
does its writes sequentially, which means that when a group of delayed
writes is sent to the disk, they are likely to be directed at nearby
disk locations, and many end up in the same cylinder group.  This
reduces the average number of disk seeks per write (relative to
non-delayed writes), and so reduces the load on the disk.  Since the
read-load generator issues random-access reads as fast as possible, disk
seeks are probably the rate-limiting bottleneck, and a reduction in the
number of write-related seeks leaves more disk-seek capacity to be used
for reads.  Also, issuing multiple writes to the same region of the disk
may reduce rotational latencies.

  For example, figure 4-1 shows that without delayed writes, the system
can support about 32 reads/sec., and the write-load generator in this
case issues about 20 writes/sec., for a total load of about 52 disk
operations/sec.  Based on the average access time shown in table 4-1,
the disk drive should support about 56 random-access operations/sec,
which corresponds closely.  With delayed writes, the read rate increases
to about 38 reads/sec., which means that the disk should only be doing
about 14-17 random writes/sec.  This suggests that some fraction of the
20 blocks written are being combined with neighbors.

  Figures 4-1 and 4-2 show the standard deviation of the response times,
but this hides how truly awful things can be with the PU policy.
Figures 4-3 and 4-4 show histograms of response time for the faster and
slower systems, respectively.  In these histograms, all six trials for
each configuration have been combined, and the x-axis has been divided
into logarithmic buckets.

  The danger of combining delayed writes and the PU policy now shows up
clearly (open circles on the histograms).  Because the PU policy puts
all the dirty delayed-write blocks on the queue at once, which in these
trials could be as high as about 600 blocks, some reads will be delayed
by up to 8 or 9 seconds.  A significant number are delayed by longer
than one second.

  With the AIPU policy (filled circles), however, only one thirtieth of
the dirty blocks show up on the disk queue at any one time.  The graph
in figure 4-3 shows a small peak in read-response time at about 300
msec.  Since we expect 20 blocks (600/30) to be queued once a second,
this implies a mean write-delay of about 15 msec, which corresponds
closely with the RZ58's specified average access time of 18 msec.  More
A BETTER UPDATE POLICY                                                10


important, this configuration shows a maximum delay of 746 msec, and all
but one of the 60,000 samples are below 450 msec.  AIPU clearly improves
upon PU, in this experiment.

  Even when I disabled delayed writes, the PU policy still lead to
occasional long delays (up to five or six seconds).  In other words,
AIPU performs better than PU even if one does not want to abandon the
improved safety of using asynchronous writes.


           [For figure, see PostScript version of this paper]

Figure 4-3:  Local random reads, fast system (histogram of response time


           [For figure, see PostScript version of this paper]

Figure 4-4:  Local random reads, slow system (histogram of response time


4.2. Remote tests
  Most NFS client implementations, to ensure cache consistency and
detection of write errors, force delayed writes to the server when a
file is closed.  This means that NFS clients get little advantage from
delayed writes, and do not depend much on the update policy.  More
recent file service protocols, however, such as Sprite [9], Spritely
NFS [15], and NQNFS [7], use explicit cache-consistency protocols and so
can benefit from delayed writes.

  I ran a set of experiments using Spritely NFS to access a remote disk,
using the ``slower'' system as the client, and the ``faster'' system as
the server.  Both the read-load and write-load generators ran on the
same client host.  The server host supports PrestoServe(TM) non-volatile
RAM (NVRAM).  I ran six trials in each of six configurations, as shown
in figure 4-5, using random-access reads.  I also ran six trials in each
of four configurations, as shown in figure 4-6, using sequential reads;
in this set of trials, delayed writes were always used.  The
sequential-read generator cycled through the blocks of a file much
larger than the buffer cache on either the client or server, so no reads
were satisfied by the caches.


           [For figure, see PostScript version of this paper]

                    Figure 4-5:  Remote random reads

  These experiments showed less conclusive results than the local-disk
experiments.  For random reads (figure 4-5), delayed write combined with
PU results in slightly poorer read response time than delayed write with
AIPU, although several AIPU trials exhibited much higher variance than
any other trials.  PrestoServe also seems to be generally beneficial.
A BETTER UPDATE POLICY                                                11


           [For figure, see PostScript version of this paper]

                  Figure 4-6:  Remote sequential reads

  For sequential reads (figure 4-6), AIPU seems mostly to reduce the
variation between trials.  The mean response time is slightly worse for
AIPU than for PU without PrestoServe, and slightly better with
PrestoServe.


4.3. Kernel-build benchmark
  To see if the update policy had any effect on a ``real'' application,
I measured the time it took to compile and link the entire ULTRIX V4.3
kernel, under different combinations of update policy and write policy.
These tests were all run on the ``faster'' system, with all kernel
source and object files on the local disk.  This process creates about
43 MB of object files, and a similar amount of temporary file data.  I
ran three trials in each configuration; the means of the results are
shown in table 4-2.

-------------------------------------------------------------------------
|                |                   |                   |              |
|                |     Periodic      | Approx. interval  |   Relative   |
|                |      update       |  periodic update  | elapsed time |
|                |                   |                   |              |
|---------------------------------------------------------------------  |
|                |                   |                   |              |
| Asynch writes  | 3280 sec. elapsed | 3234 sec. elapsed | 0.986        |
|                |      40548 writes |      37854 writes |              |
|                |                   |                   |              |
|---------------------------------------------------------------------  |
|                |                   |                   |              |
| Delayed writes | 3298 sec. elapsed | 3206 sec. elapsed | 0.972        |
|                |      31523 writes |      25966 writes |              |
|                |                   |                   |              |
-------------------------------------------------------------------------


                        Mean values for 3 trials

           Table 4-2:  Elapsed time on kernel-build benchmark

  The AIPU policy shows a small but clear advantage over the PU policy,
especially when using delayed writes.  In fact, when using delayed
writes with the PU policy, the net elapsed time is actually slightly
worse than the normal ULTRIX configuration.  The combination of delayed
writes and AIPU is about 2.1% faster than the asynch write/PU
combination.  Note that this benchmark is rather CPU-bound; on these
trials, the CPU idle time averaged between 12% and 14%.  I would expect
AIPU to show a larger benefit on a more I/O-bound application.

  The table also shows the number of disk writes charged to the
processes involved in the build.  The kernel charges a process the first
time a block is dirtied; subsequent writes to a dirty block are not
counted.  We see that while use of delayed writes substantially
decreases the write count, by increasing the chance that a write will
A BETTER UPDATE POLICY                                                12


hit a dirty block, the AIPU policy provides another big decrease,
probably by increasing the average lifetime of a dirty block.  The
combination of delayed writes and AIPU eliminates over 35% of the disk
write operations; AIPU by itself accounts for less than half of the
improvement.

  The table does not show the number of read I/Os charged, since this
hardly varies at all with the update policy, and only a few per cent
with the use of delayed writes.


4.4. CPU costs of update mechanism
  Does the update policy have any effect on the CPU-time cost of doing
the updates?  The AIPU policy scans the list of dirty blocks 30 times
more often than the PU policy does, so one might expect it to consume
more CPU time.

  I measured the CPU time charged to the /etc/update update process
during the kernel-build benchmark.  This includes both user-mode time
(which should be nearly zero) and kernel-mode time (which accounts for
all CPU time spent in the bflush() routine, as well as other activity).
It does not include kernel-mode time spent as a result of disk
interrupts.  The results are shown in table 4-3; note that the
underlying measurements were done with 1-second resolution, and so small
variations in the results are not significant.

  The table shows that, not surprisingly, aggressive use of delayed
writes does increase the CPU time spent in finding delayed-write blocks
and scheduling them for disk I/O.  Contrary to my expectation, however,
AIPU actually reduces the CPU cost of doing updates (although the total
cost is in either case insignificant).

  I cannot provide a definite explanation, but I suspect that the cause
may be the difference in the number of delayed-write blocks actually
written to disk.  Both algorithms scan every delayed-write block in the
buffer cache, and since the AIPU algorithm does this 30 times more
often, this suggests that the cost of actually scanning blocks does not
                                                                      3
dominate the CPU time; the time is probably spent in the device driver .
I ran additional trials using AIPU with a 30-second interval between
updates (but still using a 30-second age threshold for writing back
dirty blocks). This took more CPU time than AIPU with a 1-second
interval, but still less than PU.

  Table 4-3 shows that AIPU scans far more blocks than PU, because PU
scans each block exactly once, but AIPU may scan a dirty block many
times before deciding that it is old enough to write to disk.  However,
AIPU actually writes fewer blocks than PU, because AIPU allows the
average dirty block to stay in the cache longer.  Recall that with PU,
the average age of a block when written to disk is 15 seconds, but with

_______________

  3
   Note that for the kernel-build benchmark, PU with delayed writes
scans 16749 blocks in 6.0 CPU seconds, placing a lower bound of 6/16.7 =
0.36 msec per block scanned.  This corresponds to several hundred
instruction executions, much more than could be accounted for by the
scanning loop itself, so most of this time must be spent in the disk
driver.
A BETTER UPDATE POLICY                                                13


                          PU,       PU,     AIPU-1,   AIPU-1,   AIPU-30,  AIPU-30,
                         async    delayed    async    delayed    async    delayed 
                         writes    writes    writes    writes    writes    writes 

CPU time used           4.3 sec.  6.0 sec.  2.0 sec.  2.3 sec.  2.7 sec.  2.7 sec.
by /etc/update
(mean of 3 trials)

Blocks scanned          9531      16749     290214    508448    14593     25847
by /etc/update

Blocks written          9531      16749     5375      10023     5266      9519
by /etc/update

Dirty blocks modified   391K      419K      393K      426K      405K      426K
during benchmark

biowait() sleep events  8234      8274      7607      8121      7496      7751
during benchmark

           Table 4-3:  Statistics for kernel-build benchmark

AIPU the average is 30 seconds, assuming a uniform rate of file writes.
(AIPU-30, with a 30-second threshold and a 30-second interval between
updates, yields an average age of 45 seconds, and so does slightly fewer
writes than AIPU-1).

  A dirty block that stays in the cache for a longer time is more likely
to be modified again before being written to disk, which results in
fewer modifications of clean blocks, and hence fewer writes to disk.  I
counted the number of modifications of currently dirty blocks (in the
brelse() routine); the numbers are shown in table 4-3.  AIPU with
delayed writes results in moderately more dirty-block modifications
(426K vs. 419K for PU); the difference in dirty-block modifications is
roughly the same as the difference in delayed-write disk I/Os.

  The reduction in actual disk I/Os, caused by longer cache lifetimes
and more dirty-block modifications, may account for all of the
elapsed-time advantage of AIPU over PU on the kernel-build benchmark.
As table 4-3, with AIPU-1 and especially AIPU-30, the kernel ``sleeps''
less often for disk I/O than it does with PU (although I could not
measure the total sleep time).  However, this effect should not
contribute to the random-access results in sections 4.1 and 4.2, since
these experiments were constructed to avoid any cache hits on file
writes.


4.5. Burstiness of file writes
  The advantage of AIPU over PU is that the latter clumps together all
delayed writes from a 30-second period into a single burst of writes,
while the former preserves the original spacing of the file writes (with
1-second resolution).  If the file writes themselves arrive in bursts,
this eliminates AIPU's advantage.  In the worst case, when all file
writes during a 30-second period occur nearly at once, both policies
should perform the same.

  In the experiments reported in sections 4.1 and 4.2, the write-load
generator distributed file writes uniformly over time, which is the best
case for AIPU.  The kernel-build benchmark should be more representative
of real use; how bursty is its file-write pattern?  (Note that this kind
A BETTER UPDATE POLICY                                                14


of single-user benchmark is more likely to exhibit burstiness than a
multi-user benchmark, because the latter will tend to spread out the
file-system load among several jobs.)

  I modified the bflush() and bflush_smooth() routines to keep a
histogram of the number of blocks they queue for the disk on each
invocation.  I then ran one trial of the kernel-build benchmark for each
of PU, AIPU-1, and AIPU-30.  In all cases, the update period (age at
which a block is queued to the disk) was 30 seconds.

  With AIPU-1, since blocks are queued 30 seconds after the
corresponding file write, the burstiness in queue-batch size directly
mirrors the burstiness in the file-write pattern (with 1-second
granularity).


           [For figure, see PostScript version of this paper]

Figure 4-7:  Burst-size distributions for writes during kernel-build ben

  The results are shown in figure 4-7, in the form of cumulative
distributions for the number of blocks written as a function of the
burst size.  Each policy writes a different number of blocks (see table
4-3), so the curves do not end at the same ordinate.  For AIPU-1 (with 1
second between updates and a 30-second period), 95% of all delayed-write
blocks written were queued by bflush_smooth() in bursts of 40 blocks or
fewer.  The largest burst contained 104 blocks.  In other words, the
application delivered relatively small bursts of writes to the file
system.

  For PU (with 30 seconds between updates), 95% of the delayed-write
blocks queued were in bursts of more than 62 blocks, and 50% were queued
in bursts of more than 182 blocks.  The largest burst contained 344
blocks, which (assuming an 18 msec. mean access time) delayed any
subsequent synchronous operation by over 6 seconds.

  To summarize figure 4-7, the kernel-build benchmark does indeed spread
out its writes over periods longer than a second.  This results in far
smaller disk-queue bursts when AIPU-1 is used than when PU is used.

5. Future work
  Carson and Setia proposed using a periodic update with read priority
(PURP) policy.  Although I resisted implementing PURP, because of its
greater complexity, in the general case one cannot avoid long disk
queues even with IPU or an approximation.  For example, if an
application manages to dirty the entire buffer cache within a second or
so, thirty seconds later an IPU policy will schedule writes for all
those blocks, and the effect will be the same as with the PU policy.  In
other words, IPU depends on a relatively uniform distribution of file
writes (across time) to achieve its more uniform distribution of disk
writes.

  If long disk queues are inevitable, and most of the entries on such
queues are inherently asynchronous, then giving priority to reads and
synchronous writes should improve response time.  I suspect that, even
with a priority scheme, an IPU policy could outperform PU.  Suppose the
buffer cache is entirely filled by dirty blocks; then, a read operation
must wait until the system cleans a block before it can complete.  The
A BETTER UPDATE POLICY                                                15


IPU policy generates clean blocks once a second or so (assuming a
uniform distribution of disk writes, across time), but the PU policy
only does this every 30 seconds.  Thus, with PU, reads would be more
likely to block waiting for a free buffer.  This is speculation; we need
experiments, simulation, or more formal analysis to discover the truth.

  Peacock [11, 12] has described several systems that use PURP, but
apparently did not explore modified update policies.  He found that
adding read priority to System V Release 4, in which the buffer cache
can be quite large, actually reduced benchmark performance by preventing
asynchronous requests from getting a sufficient share of the disk.

  My experiments have all used a 30-second period for PU and for the
lifetime of dirty blocks in IPU, and a one-second granularity for IPU.
I suspect that use of different periods and granularities, within
reasonable limits, will not make a big difference, but this should be
the subject of additional experiments.

  Some UNIX file system implementations attempt to cluster several
blocks together when performing a disk write [8, 11].  That is, if the
cache contains several dirty blocks that are adjacent on disk, the file
system or disk driver attempts to write them all at once, which improves
throughput by eliminating seeks and rotational delays.  Clustering can
be done in several different ways, and may interact with the delayed
write policy (that is, are all data writes delayed, or only
partially-filled blocks?) and with the update policy.  The ULTRIX
systems tested in section 4 use a clustering algorithm; I have not done
experiments to see if this affects the relative performance of update
policies.

6. Summary and conclusions
  The experiments described in this paper show that

   - Use of delayed writes can improve overall file system
     performance, including read response times, on both synthetic
     and actual workloads,

   - But when delayed writes are combined with the traditional
     periodic update policy, variance in read response time
     increases significantly, and benchmark performance may
     decrease,

   - So one should use a better update policy, such as interval
     periodic update or an approximation, whenever one uses a
     delayed write policy.

  I also showed that one can easily implement an approximate interval
periodic update policy, with remarkably limited changes to the kernel of
a traditional operating system.

Acknowledgements
  Scott Carson and Sanjeev Setia helped to encourage me to perform these
experiments.  John Ousterhout, Jim Gray, Kent Peacock, and the anonymous
reviewers provided useful comments during the preparation of this paper.

References

[1] Mary G. Baker, John H. Hartman, Michael D. Kupfer, Ken W. Shirriff,
and John K. Ousterhout.  Measurements of a Distributed File System.  In
Proc. 13th Symposium on Operating Systems Principles, pages 198-212.
Pacific Grove, CA, October, 1991.
A BETTER UPDATE POLICY                                                16


[2] Scott D. Carson and Sanjeev Setia.  Analysis of the Periodic Update
Write Policy For Disk Cache.  IEEE Transactions on Software Engineering
18(1):44-54, January, 1992.

[3] Anna Hac.  Design Algorithms for Asynchronous Write Operations in
Disk-Buffer-Cache Memory.  J. Systems Software 16(3):243-253, November,
1991.

[4] John Hartman.  Private communication. 1993.

[5] John L. Hennessy and David A. Patterson.  Computer Architecture: A
Quantitative Approach.  Morgan Kaufmann, San Mateo, CA, 1990.

[6] Samuel J. Leffler, Marshall Kirk McCusick, Michael J. Karels, and
John S. Quarterman.  The Design and Implementation of the 4.3BSD UNIX
Operating System.  Addison-Wesley, Reading, MA, 1989.

[7] Rick Macklem.  Not Quite NFS, Soft Cache Consistency for NFS.  In
Proc. Winter 1994 USENIX Conference, pages 261-278.  San Francisco, CA,
January, 1994.

[8] L. W. McVoy and S. R. Kleiman.  Extent-like Performance from a UNIX
File System.  In Proc. Winter 1991 USENIX Conference, pages 33-43.
Dallas, TX, January, 1991.

[9] Michael N. Nelson, Brent B. Welch, and John K. Ousterhout.  Caching
in the Sprite Network File System.  ACM Transactions on Computer Systems
6(1):134-154, February, 1988.

[10] John K. Ousterhout, Herve Da Costa, David Harrison, John A. Kunze,
Mike Kupfer, and James G. Thompson.  A Trace-Driven Analysis of the UNIX
4.2 BSD File System.  In Proc. 10th Symposium on Operating Systems
Principles, pages 15-24.  Orcas Island, WA, December, 1985.

[11] J. Kent Peacock.  The Counterpoint Fast File System.  In Proc.
Winter 1988 USENIX Conference, pages 243-249.  Dallas, TX, February,
1988.

[12] J. Kent Peacock.  File System Multithreading in System V Release 4
MP.  In Proc. Winter 1992 USENIX Conference, pages 19-29.  San Antonio,
TX, June, 1992.

[13] Dennis M. Ritchie.  Private communication. 1994.

[14] D.M. Ritchie and K. Thompson.  The UNIX Time-Sharing System.  AT&T
Technical Journal 57(6):1905-1929, July-August, 1978.

[15] V. Srinivasan and Jeffrey C. Mogul.  Spritely NFS: Experiments with
Cache-Consistency Protocols.  In Proc. 12th Symposium on Operating
Systems Principles, pages 45-57.  Litchfield Park, AZ, December, 1989.


UNIX is a registered trademark of X/Open Company, Ltd.  ULTRIX and
DECstation are trademarks of Digital Equipment Corporation.  PrestoServe
is a trademark of Legato, Inc.


  Jeffrey Mogul received an S.B. from the Massachusetts Institute of
Technology in 1979, and his M.S. and Ph.D. degrees from Stanford
University in 1980 and 1986.  Since 1986, he has been a researcher at
the Digital Equipment Corporation Western Research Laboratory, working
A BETTER UPDATE POLICY                                                17


on network and operating systems issues for high-performance computer
systems.  He is a member of ACM, Sigma Xi, ISOC, and CPSR, the author or
co-author of several Internet Standards, an associate editor of
Internetworking: Research and Experience, and was Program Committee
Chair for the Winter 1994 USENIX Technical Conference.

  Address for correspondence: Digital Equipment Corporation Western
Research Laboratory, 250 University Avenue, Palo Alto, California, 94301
(mogul@wrl.dec.com)