An Analysis of Trace Data
         for Predictive File Caching in Mobile Computing

      Geoffrey H. Kuenning, Gerald J. Popek, Peter L. Reiher
              University of California, Los Angeles
                          April 6, 1994

                             Abstract
      One  way to  provide  mobile computers  with access  to
     the resources  of  a network,  even  in  the absence  of
     communication, is to  predict which information  will be
     used during disconnection and cache the appropriate data
     while still connected.  To  determine the feasibility of
     this approach, traces of file-access  activity for three
     diverse application domains  were collected  for periods
     of over  two months.    Analysis of  these traces  using
     setsitendaltonbenesmallacomparedvetosmodern disk wsizes,
     that users tend to reference the  same files for several
     days or even weeks  at a time, and  that different users
     do not tend to  write to the same file  except in highly
     constrained circumstances.  These  factors encourage the
     conclusion that an automated caching system can be built
     for a wide variety of environments.


1 Motivation

The value  of mobile computers  is that they  allow users to  work
while disconnected from  their normal resources.  However,  mobile
computers  typically have  a  great deal  less disk  storage  than
is available  via remote  mounting on  connected networks.    This
forces  mobile computer  users to  face a  challenging problem  of
----------------------------
 *
  This  work was  partially  supported by  the  Advanced  Research
Projects Agency under contract N00174-91-C-0107.


                                1


making sure their limited disks always store  the information they
will  need while  disconnected  from other  machines.    Requiring
users to deal  explicitly with this issue  puts a heavy burden  on
them, and the realities of modern software  methods make it nearly
impossible  for users  to  identify all  the files  they  actually
     1
need.     A fully  automated caching  mechanism that  predictively
stored  all files  a user  needs on  his mobile  machine would  be
very valuable.   Such a mechanism  is only practical, however,  if
information that can be gathered automatically  fully captures the
typical user's working set of files.
   A prototype system of this sort was developed  under CMU's Coda
system [6,  14] and  proved successful,  but was inconvenient  for
the user and was tested only in one application environment.
   We  undertook this  research  to investigate  the  practicality
of  automatic  file  caching  for  mobility  in  a  wider  set  of
application  domains,  and  to discover  new  and  less-burdensome
ways of  identifying files  to be  cached.   Our  approach was  to
collect  traces of  file-access activity  in several  environments
over a long period  of time, and analyze them for  feasibility and
predictability of caching.
   We chose to collect our own traces, rather  than using existing
traces,  for  three reasons.     First,  few existing  traces  are
long  enough.   Because  most existing  traces collect  read/write
activity,  a few  weeks  of data  is  sufficient to  tax  resource
limits.    We were  interested in  observing longer-term  periodic
behaviors such as  end-of-the-month billing work in  an accounting
department,  which therefore  required  a several-month  trace  to
establish a pattern.
   Second,  existing  traces  have tended  to  be  limited  to  an
engineering application  domain, usually programming.   We  wanted
to investigate  the behavior  of non-programmers as  well, in  the
twin  beliefs  that this  type  of  user will  eventually  be  the
largest population  of portable  users, and  that these users  may
behave quite differently from programmers.
   Third, most  previous studies  have generally  been limited  to
analysis  of working-set  sizes and  file-system performance  data
[1, 2, 6, 11, 14].   The latter is not relevant to  this research,
and the former, while very important, is  not in itself sufficient
----------------------------
 1
  For example,  starting the  X Window System  requires access  to
10--30  files or  more.    The  identities of  many of  these  are
surprising even to expert systems programmers [6].


                                2


to characterize the  user behaviors critical to  successful mobile
caching.
   Successful automated  caching requires  two characteristics  in
user behavior:

   * The working set of files,  as observed over a period  of days
     or weeks, must be small enough to fit on a portable's disk.

   * It must be  possible to predict  the working set in  advance,
     using hints such as the current working set,  historical file
     access patterns [15], or known patterns in user behavior.

   Analysis  of  the data  we  have  collected  shows  that  these
characteristics are present  in a number of  different application
domains.

2 Methodology

We  collected  our  traces  at  Locus  Computing  Corporation,   a
software development  and consulting  firm, during  the summer  of
1993.    One of  Locus'  products, PC/Interface  (PCI) [8],  is  a
DOS-to-Unix file system  implemented as a pseudo-disk driver  on a
DOS machine  which communicates via Ethernet  to a file server  on
the Unix  system, making  the Unix  file system  available to  the
DOS users  as native  PC files.   In  the environments  monitored,
the  local DOS  filesystem  was used  to store  some  applications
software,  but all  shared corporate  data was  accessed via  PCI.
The Unix  server for PCI  was modified to  log opens, closes,  and
deletes of  files.  By  avoiding read/write logging, we  minimized
the performance impact and kept the log files small.   Log entries
contain an operation type  and subtype (e.g., open for  read), the
Unix  timestamp in  seconds,  the Unix  UID  of the  invoker,  the
process ID,  the absolute pathname  of the file,  and the size  of
the file.
   Three  different user  environments were  monitored.    In  the
first, referred  to as ``personal  productivity,'' the server  was
a  machine that  acted  as the  network  filesystem for  47  users
running  business-oriented applications  such as  e-mail,  project
and calendar  scheduling, and  word processing.   These users  did
not tend to store  important files on their own machines,  so they
generated high  activity at the  server.   This server was  traced


                                3


                                            2
for 1563  hours (65.1  days, or 9.3  weeks),  recording  4,637,924
accesses.
   In the second environment, referred to as  ``programming,'' the
server  was a  cluster of  10 machines  running IBM's  Transparent
Computing  Facility,   an  adaption   of  the  Locus   distributed
operating system  [12], which  provides a  single-system image  to
users  of multiple  machines.   Each  machine ran  a separate  PCI
server,  and  logs from  these  servers  were later  combined  for
analysis.    Most of  the users  of this  server were  programmers
working  on DOS-based  software.    Because  they  performed  much
of  their work  locally,  accessing the  shared server  mostly  to
retrieve or update shared source files,  they generated relatively
little server  activity.   The traces  on this server  essentially
reflect  commits  to  a  shared  database,   while  omitting  most
localized file  activity.   This server was  accessed by 64  users
and  was  traced  for 1693  hours  (70.5  days,  or  10.1  weeks),
recording 93,719 accesses.
   In the  third environment, referred  to as ``commercial,''  the
server was a  single machine used by the accounting  department to
run a  commercial accounting  application.   The master  corporate
accounting database  was kept on the  Unix server, but all  access
to this  (shared) database  was via DOS  workstations running  the
commercial  package.   This  server was  accessed by  7 users  and
was traced  for 1257 hours  (52.4 days,  or 7.5 weeks),  recording
371,830 accesses.
   The nature  of the  traced environment (local  files stored  on
PC's, with  shared files stored  remotely) parallels the  expected
behavior of  mobile users,  who will  probably store  heavily-used
                    3
applications locally   but make extensive use of  shared resources
when they  are network-connected.   However, based on  preliminary
analysis of  these traces, we  also generated two modified  traces
that omitted  certain characteristics we  felt might be absent  on
portable platforms due  to different software and  user behaviors.
For  the commercial  environment,  we reduced  all file  sizes  to
a  maximum of  1  MB,  on the  theory  that very  large  databases
----------------------------
 2
  50 days into this trace,  there was a data gap  of approximately
48 hours due to an administrative error.  It  does not appear that
t3is gap affects the validity of the analysis.
  We hope that even  these will eventually fall under  the purview
of an automated caching system.


                                4


would be represented by smaller slices in  a portable environment.
This  change  primarily affected  the  statistics  on  working-set
sizes  and  the  amount  of  data  involved   in  write  conflicts
and  attention shifts,  which  are measures  of file  sharing  and
working-set variability  that we will  define in Section  3.   For
the  productivity environment,  we  eliminated all  references  to
fax  spooling and  mail  files,  because  such files  are  handled
in  a  queued  (as  opposed  to  shared)  manner  in  disconnected
environments.    This change  affected all  of  the statistics  we
analyzed.   These two data sets  are referred to as the  ``reduced
commercial''  and  ``reduced productivity''  environments  in  the
tables and graphs.
   Once the traces were  collected, we canonicalized them  using a
simple awk  script that  converts relative  pathnames to  absolute
form,  correlates  each  close with  the  corresponding  open  and
produces  an  output  line whose  format  is  independent  of  the
operation  type to  make  subsequent  processing easier.     These
canonicalized files  were then  compressed and used  as the  basis
for  our  analysis.     The  largest  of  these  files  (from  the
productivity  server) is  nearly 18  megabytes  in its  compressed
form, and about 10 times that large when expanded.
   Originally, we used a  collection of shell and awk  scripts for
all analysis.   As the collected data grew, many of  these scripts
became computationally impractical  and were replaced by  tailored
programs.    The  current  design  performs the  analysis  in  two
phases.  First, a single-pass program reads  the data and extracts
summary information  of interest.   For example, for each  24-hour
day in the collected data, the extraction  program writes a single
line for each  user giving the total  size of that user's  working
set, measured  in both megabytes  and files.   A second pass  then
analyzes  these  summary files  with  general-purpose  statistical
tools, generating  the final tables  and graphs presented in  this
paper.


3 Statistics

We  generated the  same  statistics  for each  parameter  in  each
environment:   mean,  standard deviation,  and maximum.    Besides
the  traditional measure  of working-set  size, we  looked at  two
measures  that  have  special  application  to  mobility:    write
conflicts and attention shifts.
   We  define a  write  conflict event  to  occur when  two  users
write  to the  same  file within  a  relatively short  time  span.


                                5


In a  mobile environment,  a conflicted  file might be  replicated
on two  or more  computers, and  the system would  be required  to
automatically resolve these  conflicts after the fact in  a manner
similar to  the Ficus distributed  file system [3,  4, 7, 13],  to
force the user  to resolve them by  hand [6], or to limit  writing
to  only one  user.    We  examined  conflicting writes  within  a
24-hour period (corresponding to taking a  machine home overnight)
and a 7-day period (corresponding to traveling with a machine).
   An attention shift occurs when a single  user radically changes
his  or her  working  set.    We  identified attention  shifts  by
looking  at the  working  sets in  successive active  n-hour  time
periods  (which did  not necessarily  represent  adjacent days  or
weeks).   Within each  time period, we  counted the total  numbers
of files accessed,  k  and k , and then  calculated k=min(k ,k  ).
                     1      2                              1  2
Within the second  period, we also  counted the total number m  of
files that  had not been referenced  during the first period,  but
                                        4
that had existed prior to either period.    An attention shift was
defined to occur  if m>=pk, where 0<= p<=1.  Attention  shifts can
be characterized by  the parameters p, expressed as  a percentage,
and n,  the number of hours  in the period.   We use the  notation
p%/n to  describe an attention shift parameter  pair.  Based  on a
sensitivity analysis  (see Figures  6--8),  we chose  p=20%.    We
chose n =24 and  n=168  (1 week)  because these represent  typical
disconnection periods for many portable users.
   A  final  characteristic of  an  attention  shift  is  the  age
of  the shift,  which  represents the  amount  of time  which  has
elapsed since the  user last referenced one of the  ``new'' files.
We  estimated the  age by  locating  the most  recently-referenced
``new''  file  (a  file included  in  count  m),  and  subtracting
its  reference time  from the  start time  of  the second  period.
This  is  a  conservative  measure,  since  it  assumes  that  the
most-recently-referenced  file  is representative  of  the  entire
group m of ``new'' files.
   However,  since many  of  the  newly-referenced files  did  not
appear previously  in the  trace, it  was not  always possible  to
find a file to use  in calculating the age of the shift.   In this
case, we  conservatively assumed that  the ``new'' files had  been
----------------------------
 4
  We eliminated files that  were created during the  second period
because they are not problematical for a  caching system that must
predict which existing files need to be stored.


                                6


referenced exactly one  second before the beginning of  the entire
trace.    Because of  these two  assumptions, the  attention-shift
ages reported  in this paper  are only a lower  bound on the  true
ages that would be encountered by a predictive caching system.
   The bounded  locality intervals  discussed in  [9] are  similar
to attention  shifts, but are  parameterized on working-set  sizes
rather than on the expected length of a disconnection.
   The statistics we report are:

Working-set statistics. For each day  and week, we calculated  the
     working  set size  in  files,  MB, and  number  of  accesses.
     Means and  standard deviations were  calculated by  averaging
     data across time for each UID, and then  calculating the mean
     and standard deviation across the per-UID means.

Attention-shift statistics. For  each 1-day  and  7-day  attention
     shift, we examined the  total size of the working  set needed
     to hold both the old  and the new data (in files and  MB). We
     also calculated  the per-user  attention shift  rate per  day
     and per week.  Finally, we calculated the age of each shift.
Conflict statistics. For each conflict, we examined  the number of
     users involved and  the size of the file  involved.  We  also
     calculated the per-user conflict rate per day and per week.

   Success  in  mobile  computing  depends  on  small  values  for
all  of these  statistics.    Clearly,  the  working set  must  be
small enough  to fit comfortably  on the typical portable's  disk.
The  attention-shift rate  should remain  low,  both so  that  the
longer-period working set  remains small and so that it  is easier
to predict the future  working set based on recent behavior.   The
conflict rate must remain low to allow convenient file updates.


4 Analysis

The results of our analysis are very  encouraging for our intended
application,  automated caching  of  files for  mobile  computers.
As hoped,  working sets  are small  and attention-shift rates  are
low.    Conflict rates  are generally  low,  and it  is clear  how
one  could handle  conflicts  in the  environments that  had  high
conflict rates.   However, attention-shift  ages tend to be  high,
indicating that a predictive caching system will  need to exercise
significant intelligence  to ensure  that a  portable computer  is
prepared for attention shifts.


                                7


   Each table  of statistics given  below lists  the mean for  the
statistic,  followed by  the standard  deviation (in  parentheses)
and the maximum.  For example, in Table 1,  the mean daily working
set for the  productivity environment was 1.0 MB, with  a standard
deviation of 2.0 MB and a maximum of 134.5 MB.
   With  the exception  of  Figures  6--8, all  figures  show  the
variation in a given measure over the duration of the  trace.  For
example,  Figure 1 shows  the daily  and weekly  working sets  for
the productivity environment, for each day and  each week captured
                 5
during the trace.

4.1 Working Sets

Table 1  summarizes the working-set  sizes we  observed.   Figures
1--4 show  the variation  in mean  and maximal  working set  sizes
with time.
   Mean  working-set  sizes  tended  to  be  small  in  all  three
environments,  with the  largest being  about 18  MB  per day  and
24  MB  per  week,  in  the   commercial  environment.     Maximal
working  sets were  very  large  (148 MB  per  week) only  in  the
personal-productivity  environment,  apparently due  to  a  single
grep-style  operation  that occurred  in  week  9.    This  ``grep
phenomenon'' is  clearly visible in  Figure 1.   Eliminating  this
single  maximum  produced  a secondary  maximum  of  only  76  MB.
Maximal  working sets  in the  other environments  ranged only  to
66 MB.
   These working-set  figures indicate  that  it will  be easy  to
store  enough files  on a  portable disk  to  satisfy the  average
     6
user,   although  some  software  or user  behavior  may  have  to
----------------------------
 5
  In these and all other graphs, the lines  connecting data points
are present only to  make it easier to see associated  points, and
dailyomaximaniinftheiright-handessidesIofpFiguresar4 andh5ugappear
to exceed the  weekly maxima, careful examination shows  that only
the connecting lines cross, and the actual  data points for weekly
maxima are always larger than the daily values.
 6
next feweyearsoasiuserstmoveztowardschmultimediatapplications, but
we  also expect  that disk  sizes will  increase sufficiently  for
portable computers to keep  pace.  In some sense,  this phenomenon
is  self-regulating,  since users  will  not  tend to  use  images


                                8


                              Daily              Daily              Weekly             Weekly

                             WS Size            WS Size            WS Size             WS Size

                              (MB)              (Files)              (MB)              (Files)

 Environment           Mean   sigma   Max  Mean  sigma   Max  Mean  sigma   Max   Mean  sigma  Max
----------------------------------------------------------------------------------------------------
 Productivity            1.0  (2.0) 134.5 -   39   (80) 3293 -  2.7  (4.7)148.4  - 110   (215)3284
                                          -                  -                   -
 Reduced Productivity    0.7  (1.8)  41.1 -    7   (10)  547 -  1.4  (2.8) 43.6  -  19    (31) 548
                                          -                  -                   -
 Programming             0.3  (0.4)  18.0 -   10   (27) 2153 -  0.6  (1.1) 18.3  -  22    (55)2170
                                          -                  -                   -
 Commercial             18.2 (13.1)  65.0 -  294  (442) 1643 - 26.8 (16.6) 65.7  - 374   (553)1638
                                          -                  -                   -
 Reduced Commercial     10.9  (6.0)  33.6 -  294  (442) 1643 - 16.8  (8.7) 33.8  - 374   (553)1638
                                          -                  -                   -

                 Table 1:  Working-Set Statistics

------------------------------------------------------------------


    Figure 1:  Working-Set Sizes for Productivity Environment


                                9


Figure 2:  Working-Set Sizes for Reduced Productivity Environment
------------------------------------------------------------------


     Figure 3:  Working-Set Sizes for Programming Environment
------------------------------------------------------------------


     Figure 4:  Working-Set Sizes for Commercial Environment


                                10


 Figure 5:  Working-Set Sizes for Reduced Commercial Environment


                                11


                           Number Per             MB                Files               Age

                          User Per Day         Involved            Involved           (Days)

 Environment           Mean   sigma  Max   Mean  sigma   Max  Mean   sigma  Max  Mean  sigma   Max
---------------------------------------------------------------------------------------------------
 Productivity            0.4   (0.3) 0.8 -  1.6  (6.5) 135.7 -   64  (164) 3296 - 10.0 (15.7) 64.7
                                         -                   -                  -
 Reduced Productivity    0.2   (0.2) 0.5 -  0.8  (3.2)  41.1 -   13   (33)  548 - 26.2 (19.7) 64.7
                                         -                   -                  -
 Programming             0.3   (0.2) 0.5 -  0.6  (1.6)  20.9 -   16  (109) 2161 - 28.0 (21.3) 70.2
                                         -                   -                  -
 Commercial              0.3   (0.3) 0.9 - 21.8 (13.8)  65.7 -  217  (398) 1654 -  3.2  (4.6) 35.7
                                         -                   -                  -
 Reduced Commercial      0.3   (0.3) 0.9 - 14.6  (8.1)  33.8 -  217  (398) 1654 -  3.2  (4.6) 35.7
                                         -                   -                  -

        Table 2:  20%/24-Hour Attention Shifts (All Users)

change.   (For  example, instead  of relying  on a  large grep,  a
user might  use an inverted index  to locate the files  containing
references to a particular string [10].)


4.2 Attention Shifts
Tables  2  and 3  summarize  the  attention  shifts observed.
Figures 6--8 show the sensitivity of attention-shift  rates to the
parameter p.     Except in the commercial environment,  the number
of attention shifts steadily decreases with increasing  p, but the
exact shape of  the curve is quite  inconsistent.  In the  absence
of a clear-cut change in curvature (a knee or  cliff), to guide us
in the selection  of p, we  chose p=20%,  which is near enough  to
the peak of the curves that we will not  tend to underestimate the
number of attention shifts,  yet not so small that we  will detect
a shift every time a user accesses one or two new files.
   Figures  9--11 show  the  variations in  attention-shift  rates
with time, for p=20%.     The amount of data involved in attention
shifts was  generally small  (33 MB  or less),  though the  maxima
----------------------------

capacity.s  extensively if this  would tax their portable  storage


                                12


                           Number Per             MB                Files               Age

                         User Per Week         Involved            Involved           (Days)

 Environment           Mean   sigma  Max   Mean  sigma   Max  Mean   sigma  Max  Mean  sigma   Max
---------------------------------------------------------------------------------------------------
 Productivity            0.6   (0.3) 0.8 -  4.7 (12.4) 151.8 -  177  (376) 3423 - 15.7 (15.2) 62.7
                                         -                   -                  -
 Reduced Productivity    0.3   (0.2) 0.4 -  2.0  (5.5)  44.3 -   37   (71)  553 - 32.4 (18.9) 62.7
                                         -                   -                  -
 Programming             0.4   (0.2) 0.6 -  1.7  (3.4)  22.6 -   55  (215) 2174 - 28.9 (20.0) 68.2
                                         -                   -                  -
 Commercial              0.5   (0.4) 1.0 - 33.3 (17.4)  66.8 -  420  (584) 1661 - 11.1  (6.1) 33.7
                                         -                   -                  -
 Reduced Commercial      0.5   (0.4) 1.0 - 21.1  (9.0)  33.8 -  420  (584) 1661 - 11.1  (6.1) 33.7
                                         -                   -                  -

             Table 3:  20%/168-Hour Attention Shifts


------------------------------------------------------------------


Figure   6:     Attention-Shift   Sensitivity   for   Productivity
Environment


                                13


Figure   7:      Attention-Shift   Sensitivity   for   Programming
Environment
------------------------------------------------------------------


Figure 8:  Attention-Shift Sensitivity for Commercial Environment
------------------------------------------------------------------


Figure 9:  20% Attention-Shift Rates for Productivity Environment


                                14


Figure 10:  20% Attention-Shift Rates for Programming Environment


------------------------------------------------------------------


 Figure 11:  20% Attention-Shift Rates for Commercial Environment


                                15


were  large (up  to 152  MB; this  follows from  the  size of  the
maximal working  set and  the definition of  an attention  shift).
In  all three  environments, the  number of  attention shifts  was
surprisingly large  and consistent, averaging  up to 0.6 per  user
per week.  This has serious implications  for a predictive caching
scheme, because  it shows that simply caching  least-recently-used
files is not sufficient.
   However,  because  of  the  small  size  of  the  working  sets
involved  in   the  average  attention   shift,  a   well-designed
predictive cache  can afford  to store  both the old  and the  new
set, so that attention  shifts need not affect the usability  of a
mobile computer.
   Of  course,  if there  is  space  to  store both  the  old  and
new  working  set,  the  question  arises  whether  a  simple  LRU
scheme would  be sufficient to ensure  that both working sets  are
available.   The  attention-shift age  figures shown  in Tables  2
and  3 belie  this  notion.    For both  the programming  and  the
reduced productivity  environments, the mean  age of an  attention
shift  is  over  4  weeks and  the  maximum  is  near  the  length
of  the trace,  indicating that  an LRU  cache  would very  likely
have been  flushed by transient  phenomena before the older  files
were  re-referenced.    This  hypothesis is  strengthened  by  the
observation that  the conservative method  of estimating the  ages
of previously-unreferenced  files, explained  in section 3,  would
produce a mean age  of approximately half the length of  the trace
(about 5  weeks) if there  were absolutely  no historical data  in
the trace.   In actuality, the new  working set may not  have been
accessed  for many  months and  thus may  have  been flushed  from
even a very  lengthy LRU cache.   Other methods will be  needed to
ensure that  a mobile machine  will be  prepared for an  attention
shift.  The  above data merely assures us that there will  be room
to store both  today's and tomorrow's working sets once  they have
been identified.

4.3 Conflicts

Tables 4 and 5  show statistics about conflicts and their  rate of
occurrence, respectively.      Figures 12--14 show the  variations
in conflict rates  with time.     Conflicts were very rare  in the
``programming'' environment, averaging 0.01 conflict  per user per
day, and only 0.10 per week.  In nearly  every case only two users
were involved in  a given conflict, although occasionally  a third
would write to the same file within 24 hours.
   As expected,  the 7  users of  the ``commercial''  environment,


                                16


                          Daily Conflicts     Weekly Conflicts

   Environment           Mean   sigma   Max  Mean   sigma   Max
  ---------------------------------------------------------------
   Productivity           1.19 (1.16)  4.28 - 5.57  (3.58)10.11
                                            -
   Reduced Productivity   0.00 (0.01)  0.05 - 0.02  (0.03) 0.07
                                            -
   Programming            0.01 (0.02)  0.06 - 0.10  (0.09) 0.28
                                            -
   Commercial             4.29 (4.74) 16.29 -11.30  (8.92)24.57
                                            -
   Reduced Commercial     4.29 (4.74) 16.29 -11.30  (8.92)24.57
                                            -

                     Table 4:  Conflict Rates
------------------------------------------------------------------


                          MB Involved       Users Involved       MB Involved       Users Involved

                            in Daily           in Daily           in Weekly           in Weekly

                           Conflicts           Conflicts          Conflicts           Conflicts

 Environment           Mean   sigma  Max   Mean  sigma   Max  Mean   sigma  Max  Mean   sigma   Max
-----------------------------------------------------------------------------------------------------
 Productivity           0.02  (0.08)2.05 - 3.39 (3.06) 22.00 - 0.02 (0.08) 2.05 - 3.61 (3.62) 27.00
                                         -                   -                  -
 Reduced Productivity   0.04  (0.04)0.12 - 2.00 (0.00)  2.00 - 0.04 (0.04) 0.12 - 2.00 (0.00)  2.00
                                         -                   -                  -
 Programming            0.07  (0.16)1.08 - 2.02 (0.15)  3.00 - 0.06 (0.12) 1.08 - 2.09 (0.29)  3.00
                                         -                   -                  -
 Commercial             0.22  (0.81)5.37 - 3.16 (1.10)  6.00 - 0.27 (0.83) 5.37 - 3.16 (1.28)  6.00
                                         -                   -                  -
 Reduced Commercial     0.17  (0.81)5.37 - 3.16 (1.10)  6.00 - 0.20 (0.83) 5.37 - 3.16 (1.28)  6.00
                                         -                   -                  -

                       Table 5:  Conflicts


                                17


     Figure 12:  Conflict Rates for Productivity Environment
------------------------------------------------------------------


      Figure 13:  Conflict Rates for Programming Environment
------------------------------------------------------------------


      Figure 14:  Conflict Rates for Commercial Environment


                                18


with  its shared  accounting database,  produced  a high  conflict
rate of 11  per user per week, with  up to 6 users writing  to the
same file in a single day.  In a  mobile environment, an automated
resolver similar to  those discussed in [13] would be  required to
handle these  numerous conflicts.   Since accounting  applications
typically involve appending records to a  transaction database, we
expect that such a resolver would be easy to write.
   The  surprise was  the ``personal  productivity''  environment,
which produced  conflict rates up  to 1.2 per  user per day,  with
up  to 22  users writing  to the  same file  in  a single  24-hour
period.   We examined these conflicts  in more detail to  discover
the cause,  and found that nearly  all of them involved  mailboxes
or fax-spooling files.
   Since both  mailbox and  spooling files operate  in a  modified
append-only  mode (all  but one  user appends  to the  end of  the
file, and a  simple locking mechanism prevents update  while other
file  contents are  modified),  this does  not present  a  problem
for mobility.    In fact,  the retry-on-failure queuing  algorithm
of  mailers  would  handle  mailbox  conflicts  with  no  software
changes.     In view  of  these  observations,  we  generated  the
``reduced  productivity'' trace,  which omitted  these files  from
the  statistics.   With  this change,  the  conflict rate  dropped
to  only 0.04  per  user  per week,  a  number  so small  that  it
could conceivably  be handled even  without the help of  automatic
resolvers.


5 Future Work

Based  on the  above  analysis, we  expect  to build  a  prototype
caching  system incorporating  a  prediction mechanism  which,  by
observing user behavior,  will calculate the current  working set,
detect  attention  shifts,  and predict  possible  future  working
sets.   A cache manager will  then ensure that these working  sets
are available  on the  portable computer when  it is  disconnected
from the network.
   A  cache  miss  during   disconnection  is  a  serious,   often
catastrophic event for a  user who cannot continue to work  in the
absence of a critical  file.  There are only two real  options for
dealing with this case:

  1. Provide  enough alternate  working  sets that  the  user  can
     shift to a secondary or tertiary task [6, 14].
  2. Provide a  foreground  or  background method  that  initiates


                                19


     communication (most  likely expensive and  slow) to  retrieve
     the missing file [5].

   We plan  to provide  both of  these options  in our  prototype,
though we hope to rely primarily on the first.


6 Conclusions

The data  gathered and analysis  performed in this study  strongly
indicate that  predictive file caching  for mobile computing is  a
feasible approach.   However, the data also indicates  that simple
LRU caching  is insufficient.   Therefore,  we conclude that  more
sophisticated automatic  predictive file  caching mechanisms  will
be required to  make the file system  of a mobile computer  appear
transparently the  same as the file  system of a desktop  machine.
We intend  to investigate  suitable algorithms  for this  purpose,
guided by these results and by further analysis of our data.

Trademarks

PC/Interface is a trademark of Locus Computing  Corporation.  Unix
is a trademark of X/Open Company, Ltd.


References

 [1] Mary  G.  Baker,   John  H.   Hartman,  Michael  D.   Kupfer,
     Ken W.  Sherriff, and John K.  Ousterhout. Measurements of  a
     distributed file  system.  In Proceedings  of the  Thirteenth
     Symposium on  Operating Systems  Principles, pages  198--211.
     ACM, October 1991.
 [2] Matthew  Blaze  and   Rafael  Alonso.  Dynamic   hierarchical
     caching  for   large-scale  distributed   file  systems.   In
     Proceedings  of  the  Twelfth  International   Conference  on
     Distributed Computing Systems, pages 521--528, June 1992.

 [3] Richard  G.  Guy.  Ficus:     A  Very  Large  Scale  Reliable
     Distributed File  System. Ph.D.  dissertation, University  of
     California, Los  Angeles, June 1991.  Also available as  UCLA
     technical report CSD-910018.

 [4] Richard G. Guy, John  S. Heidemann, Wai Mak, Thomas  W. Page,
     Jr., Gerald  J. Popek,  and Dieter Rothmeier.  Implementation


                                20


     of the  Ficus replicated  file system.  In USENIX  Conference
     Proceedings, pages 63--71. USENIX, June 1990.

 [5] L. B. Huston  and Peter Honeyman. Disconnected  operation for
     AFS.  In Proceedings of  the USENIX  Symposium on Mobile  and
     Location-Independent Computing, pages 1--10. USENIX, 1993.

 [6] James Jay  Kistler. Disconnected  Operation in a  Distributed
     File System. Ph.D. dissertation,  Carnegie-Mellon University,
     May 1993.
 [7] Puneet   Kumar   and   Mahadev  Satyanarayanan.    Supporting
     application-specific resolution  in an optimistically  repli-
     cated  file system.  In Proceedings  of  the Fourth  Workshop
     on  Workstation  Operating   Systems,  pages  66--70,   Napa,
     California, October 1993. IEEE.

 [8] Locus   Computing   Corporation,    Inglewood,    California.
     PC/Interface Reference Manual, February 1993.

 [9] Shikharesh  Majumdar and  Richard  B. Bunt.  Measurement  and
     analysis of locality phases in file referencing  behavior. In
     Proceedings of  Performance 86 and  ACM Sigmetrics 86,  Joint
     Conference  on Computer  Performance  Modelling,  Measurement
     and Evaluation, pages 180--192, Raleigh, NC, May 1986. ACM.
[10] Udi Manber  and Sun  Wu. GLIMPSE:  A tool  to search  through
     entire file systems. In USENIX  Conference Proceedings, pages
     23--32, San Francisco, CA, January 1994. USENIX.

[11] John K. Ousterhout,  Herve Da Costa, David Harrison,  John A.
     Kunze, Mike  Kupfer, and  James G.  Thompson. A  trace-driven
     analysis of  the Unix 4.2 BSD  file system. Technical  Report
     UCB/CSD 85/230, UCB, 1985.

[12] Gerald J.  Popek and Bruce J.  Walker. The Locus  Distributed
     System Architecture. The MIT Press, 1985.
[13] Peter  Reiher,  John  S.  Heidemann,  David  Ratner,  Gregory
     Skinner, and  Gerald J.  Popek. Resolving  file conflicts  in
     the  Ficus file  system.  In USENIX  Conference  Proceedings.
     USENIX, June 1994. To be published.

[14] Mahadev Satyanarayanan,  James J.  Kistler, Lily B.  Mummert,
     Maria R.  Ebling, Puneet  Kumar, and Qi  Lu. Experience  with
     disconnected  operation in  a mobile  computing  environment.
     In  Proceedings  of  the  USENIX  Symposium   on  Mobile  and


                                21


     Location-Independent Computing, pages 11--28,  Cambridge, MA,
     August 1993. USENIX.

[15] Carl  D. Tait  and Dan  Duchamp.  Detection and  exploitation
     of  file  working  sets.  In   Proceedings  of  the  Eleventh
     International  Conference on  Distributed Computing  Systems,
     pages 2--9, 1991.


Authors
Geoffrey H. Kuenning  is a Ph.D. candidate in computer  science at
UCLA. He  received the B.S. and  M.S. degrees in computer  science
from Michigan  State University  in 1973  and 1974,  respectively.
His  research  interests include  operating  systems,  distributed
environments,  and mobile  computing.    He is  a  member of  ACM,
the  IEEE Computer  Society,  and CPSR.  His Internet  address  is
geoff@ficus.cs.ucla.edu.
   Peter  Reiher  received  his  B.S.  in  electrical  engineering
from  the  University  of  Notre  Dame  in  1979.     He  received
his  M.S.  in  computer  science  from  UCLA  in  1984,   and  his
Ph.D. in  computer science  in 1987.    He has  worked on  several
distributed operating  systems projects.   His research  interests
include  distributed operating  systems,  optimistic  computation,
and security  for distributed systems.    His Internet address  is
reiher@ficus.cs.ucla.edu.
   Gerald J.  Popek has been  a Professor  of Computer Science  at
UCLA since  1973.   His academic  background includes a  doctorate
in  computer science  from  Harvard University.    He  co-authored
``The LOCUS  Distributed System Architecture,''  MIT Press,  1985,
and has written more than 70 professional  articles concerned with
computer security,  system software,  and computer  architectures.
Dr. Popek is  a principal founder of Locus  Computing Corporation,
the largest independent  developer of UNIX-based connectivity  and
distributed processing software technology.


                                22