################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


            ``Stacking'' Vnodes:  A Progress Report


              Glenn C. Skinner and Thomas K. Wong

                          SunSoft Inc.
                       2550 Garcia Avenue
                    Mountain View, Ca 94043


                            ABSTRACT

          People are dissatisfied with the file system ser-
     vices that come with their UNIX systems.  They want to
     add new and better features.  At present they have two
     choices:   express  their  service as a user-level NFS
     server, or use the vnode/VFS  interface  to  build  at
     least  part  of  it  into  the  kernel.   Although the
     vnode/VFS interface has been remarkably successful  as
     a kernel structuring concept, it has failed to provide
     source  portability  between  UNIX  versions  or  even
     binary compatibility between releases of the same UNIX
     version.  It has been obvious for  some  time  that  a
     redesign  of the vnode/VFS interface that allowed file
     systems to be shipped as binary  kernel  modules  that
     survive   from  release  to  release  is  needed.   We
     describe a prototype kernel with a vnode/VFS interface
     that would allow this.  It is based on earlier work on
     ``stacking''  vnodes  at  Sun  and  at  UCLA,  but  it
     replaces  the  stacking  concept  by  a  more strictly
     object-oriented concept of interposition.

1.  Introduction

     People are dissatisfied with the file system services  that
come with their UNIX systems.  Existing file systems are looking
increasingly old and shopworn, and developers want  to  add  new
and better features.  At present they have two choices:  express
their service as a user-level NFS server, or use the  vnode  and
Virtual  File  System  (VFS) interfaces [Kleiman 86] to build at
least part of it into the kernel.  Neither of  these  approaches
is satisfactory.

     User-level NFS servers are a viable approach for some  pur-
poses,  and  people have used them to implement services such as
automounters [Callaghan  &  Lyon  89,  Pendry  89],  autocachers


                         April 14, 1993


[Minnich  93],  and  the  Restore-o-Mounter  [Moran  & Lyon 93].
These services all primarily concern themselves with name  space
manipulation  and stay out of the way of data movement.  Indeed,
the big drawback of user-level NFS servers is that it is  nearly
impossible  to  provide  adequate  performance and semantics for
data movement operations.  This  limitation  severely  restricts
the approach's applicability.

     Kernel-resident file system modules avoid  the  limitations
of  user-level  NFS  servers,  but suffer from problems of their
own.  They rely critically on the vnode/VFS interface  as  their
means of connection into the kernel environment.  But the inter-
face suffers  from  problems  of  compatibility  and  stability.
[Webber 93] states:

     UNIX kernels with a VFS architecture have been commer-
     cially  available  for  many years.  Sun Microsystems,
     for example, described their VFS architecture  in  the
     1986  Summer USENIX proceedings.  By many measures the
     VFS concept has been  quite  successful,  but  from  a
     third  party  point  of view there are two major prob-
     lems:  few vendors have the same VFS  interface  [and]
     few   vendors  provide  release-to-release  source  or
     binary compatibility for VFS modules.

     We call these two problems the VFS portability problem
     and   the  lock-step  release  problem,  respectively.
     Together, they make VFS modules expensive to  produce,
     expensive to port, and expensive to maintain.

Compounding these problems is the interface's lack  of  modular-
ity:   complete  file  system implementations are large and com-
plex, and the vnode/VFS interface provides no support for decom-
posing them into smaller components.

     It has been obvious for some time that a  redesign  of  the
vnode/VFS interface that addressed these problems is needed.  We
describe a prototype kernel  with  a  vnode/VFS  interface  that
allows  for  binary  compatible, modular file system implementa-
tions.  It is based on earlier work on  ``stacking''  vnodes  at
Sun  and at UCLA, but it replaces the stacking concept by a more
strictly object-oriented concept of interposition.

     In the rest of this paper, we start by analyzing the  defi-
ciencies  of  the  existing  interface.   We  then introduce the
interposition model and explain how it addresses these deficien-
cies.  In subsequent sections we describe the design of a proto-
type system that realizes this model and our experiences  imple-
menting  the prototype.  Next we go over a laundry list of prob-
lem areas we  have  not  yet  addressed.   Finally,  we  examine
related work and present concluding remarks.


2.  Problems with the Vnode Interface

     The Virtual File System (VFS) and vnode interfaces form the
boundary  between  file  system implementations and the parts of
the kernel that access file system services.  The VFS  interface
is  concerned  with  operations  pertaining to file systems as a
whole, such as mount and unmount, whereas  the  vnode  interface
governs  access  to individual files.  The two interfaces have a
similar structure, so the discussion below concentrates  on  the
vnode  interface  as  illustrative  of  both.  Figure 1 shows an
idealized sketch of the vnode interface.

struct vnode {
        struct vfs	*v_vfsp;	/* owning VFS */
        enum vtype	v_type;		/* file, dir, etc. */
        struct vnodeops *v_op;		/* the ops vector */
        void		*v_private;	/* internal state */
        void		*v_public;	/* visible state */
};
struct vnodeops {
        int     (*vop_open)();
        int     (*vop_read)();
        int     (*vop_write)();
        /* and so on */
};

              Figure 1:  Standard Vnode Interface

     The interface is  object-oriented  in  the  sense  that  it
represents  each  file  as a data structure called a vnode along
with a set of operations (a so-called ops vector) for manipulat-
ing  it.   Every vnode belongs to some VFS, which is responsible
for defining its operations and maintaining its internal  state.
Clients  of the interface invoke vnode operations through macros
like:

        #define VOP_X(vp, args) (*(vp)->v_op->vop_x)(vp, args)

In object-oriented terminology, calling this macro requests  the
vnode object *vp to invoke its vop_x method.

     In the introduction, we indicted the vnode interface on two
basic  counts:   that it lacks stability and that it lacks modu-
larity.  Both of these failings are  aggravated  by  non-object-
oriented aspects of the interface.

o     Vnodes are not opaque.

     Some of the  vnode's  internal  structure  is  visible  and
     available  for  direct  manipulation by code outside of its
     owning VFS.  This  exposure  makes  it  very  difficult  to
     change the vnode structure without breaking binary compati-
     bility and  is  a  major  contributor  to  the  interface's
     instability  from  release to release.  It also complicates
     resolving the next weakness:

o     The vnode interface makes no provision for inheritance.

     From the interface inheritance viewpoint, the interface  is
     not  extensible,  making  it  difficult to accommodate file
     systems with  novel  semantics.   From  the  implementation
     inheritance viewpoint, the interface is not modular, making
     it impossible to develop file systems as incremental  units
     of behavior added to previously available components.

3.  Interposition and Composition

     The deficiencies listed in the last section  lead  directly
to  a  set  of  requirements for a revised vnode interface.  The
interface should be object-oriented.  Specifically, it must:

o     accommodate multiple file systems implementing the  inter-
     face,

o     be opaque, and

o     support some form of implementation inheritance.

Another desirable requirement, that our work does  not  address,
is that the interface should:

o     support some form of interface inheritance.

     The existing vnode interface already  satisfies  the  first
requirement.   The  second  is  straightforward:   in many cases
existing references to exposed state turn out to be  unnecessary
and  can  be eliminated.  We handled the remaining references by
converting them to calls to access functions, which we added  to
the set of vnode operations.

     The implementation inheritance requirement is trickier.  We
require  that  the  interface  allow for binary compatibility of
independently developed modules, which rules out  language-based
subclassing  schemes.   One way to satisfy this requirement, and
the way we chose, is to use delegation rather than strict inher-
itance.  This technique allows one file system module to forward
an operation request (possibly modified) to  another  module  as
part of that operation's implementation.

     Experience with a previous attempt to add delegation to the
vnode  interface  [Rosenthal 90] and contemplation of its after-
math [Rosenthal 92] had alerted us to a pitfall in adding  dele-
gation  to  the  interface.   In the existing vnode interface, a
given vnode serves two purposes:  it is the in-kernel  represen-
tative  of some file system object, and it records the state and
operations that the file system implementation applies  to  that
object.  That is, vnodes capture both identity and behavior.  We
wanted to use delegation to add behavior, but not at  the  price
of adding unwanted identities.  This observation led us to use a
specialized form of delegation that we call interposition, where
a delegating (or interposing) file system module in effect takes
control of the  target  of  the  delegation,  by  capturing  all
accesses to it.

     Putting these requirements together led us to  the  revised
vnode interface outlined in Figure 2.

struct vnode {
        struct pvnode   *v_chain;	/* head of interposition chain */
};
struct pvnode {
        struct vfs	*p_vfsp;	/* owning VFS */
        struct vnodeops *p_op;          /* the ops vector */
        struct pvnode   *p_link;	/* next older interposer */
        void		*p_private;     /* this interposer's state */
}
struct vnodeops {
        /* as before */
};

            Figure 2:  Interposition Vnode Interface

We retain the term vnode for the part that captures the identity
of  the corresponding file system object.  A pvnode contains the
information that describes the behavior that a given file system
imposes  on  an object; each pvnode belongs to a particular file
system implementation as recorded in the  p_vfsp  field  and  is
attached to a single vnode.  Pvnodes are organized into interpo-
sition chains; the vnode's v_chain field names the  first,  most
recently  interposed,  pvnode in its chain and the p_link pvnode
field links to the next most recent interposing pvnode.

     We illustrate how interposition works through an example.


            vp
                              dnlc
                                            ufs

               Figure 3:  An Interposition Chain

Figure 3 depicts a vnode named by vp and its associated interpo-
sition  chain; the circle designates the vnode and the boxes its
interposing pvnodes.  Suppose the file associated with vp  is  a
directory  and  some process tries to look up a file in it.  The
kernel's vnode layer will issue the  call  VOP_LOOKUP(vp,  NULL,
name,  &resultvp) to initiate the lookup.  VOP_LOOKUP is defined
as:[1]
_________________________
  [1] This definition is simplified  for  the  sake  of
#define VOP_LOOKUP(vp, pvp, name, result)         vn_lookup(vp, pvp, name, result)
int vn_lookup(struct vnode *vp, struct pvnode *pvp, char *name, struct vnode **result)
{
        struct pvnode *npvp = pvp ? pvp->p_link : vp->v_chain;
        return (*(npvp->p_op->vop_lookup)(vp, npvp, name, result));
}


Since the initial call  has  a  NULL  pvnode  argument,  control
passes  to  the lookup method defined as part of dnlc's ops vec-
tor.  Dnlc is an interposer that the directory name lookup cache
VFS  module  defines.   Its  lookup method consults the module's
cache to see whether the target file already has a vnode, If so,
the  method  sets the result argument to that vnode and returns.
Otherwise, it forwards the operation to the base ufs file system
module  with VOP_LOOKUP(vp, pvp, name, result), where pvp is the
pvnode argument it was called with.   When  the  forwarded  call
returns,  dnlc's  lookup  method enters the resulting vnode into
the cache and returns.

     The preceding example showed how one can build up  behavior
applicable  to a single object.  However, there is often need to
combine multiple objects together into a composite whole.   This
process  is  called  composition;  the  interposition  framework
accommodates it by allowing pvnodes to  maintain  references  to
other vnodes as part of their private state.

     A reference to the composing vnode acts as an  entry  point
into the overall composite object.  The components of the compo-
sition may be named by referring to their vnodes; it  is  up  to
the  composer to ensure that such accesses are semantically sen-
sible, perhaps by interposing on each of the composees to  exert
control over their behavior.

     Here is another example, showing how to combine composition
with  interposition  to  handle  file system mounts.  We use the
technique illustrated in this example  to  factor  knowledge  of
mount  point  semantics  out  of all our file system implementa-
tions.


_________________________
illustration;  locking  considerations  and the need to
determine whether a new vnode has been created make the
actual definition more complicated.

    mnt_pt
                       cross_mntpt
                                     mnt_pt_dir
                                                  ...


   fs_root
                        do_dotdot
                                      root_dir
                                                  ...

                Figure 4:  Mounted File Systems

In Figure 4, mnt_pt is a vnode naming a directory  acting  as  a
mount  point  and  fs_root  names  the  root  of the file system
mounted there.  Cross_mntpt and do_dotdot are  interposers  that
supply modified versions of VOP_LOOKUP and pass all other opera-
tions through respectively to mnt_pt_dir and root_dir;  they  in
turn  head the remainder of mnt_pt's and fs_root's interposition
chains.  Cross_mntpt and do_dotdot each  compose  on  the  other
vnode participating in the mount.

     When a lookup request reaches mnt_pt, control first reaches
cross_mntpt.   Its lookup method compares the name argument with
"..".  If they are equal, this component  of  the  overall  path
name  being looked up passes out from underneath the part of the
file system name space the mount controls; otherwise,  the  com-
ponent  lands  inside  the  mount.   Cross_mntpt's lookup method
proceeds accordingly, by forwarding  the  request  to  the  next
interposer  (mnt_pt_dir) if the name is "..", and by passing the
request on to its  composee  (fs_root)  otherwise.   Do_dotdot's
lookup  method is the dual of cross_mnt_pt's; if the name is not
"..", it forwards the request to the next interposer and  other-
wise passes the request to its composee.

4.  Design

     We wanted to limit the scope of our prototyping effort,  so
that  the  project  remained tractable.  Thus, we decided not to
tackle adding support for transactions or page cache consistency
to  the  vnode  interface, since we knew from previous work that
they pose tough problems.  Instead, we decided to see  how  much
mileage  we could get from developing our ideas about interposi-
tion.  We knew at the outset that we would have to  solve  prob-
lems  relating to vnode configuration and locking.  Not surpris-
ingly, we also encountered problems we had not anticipated.   We
discuss some of them below.

4.1.  Vnode Configuration

     When the system creates a new vnode it must  be  configured
with  interposers that properly capture the semantics desired of
the corresponding file system object.  Thus, operations such  as
VOP_LOOKUP  that  potentially  create new vnodes must have hooks
for distinguishing newly created vnodes from previously existent
ones and for adding behavior (new interposers) to new vnodes.

     We satisfy this requirement by having the vnode layer  sup-
ply  a  virgin  vnode as a tentative candidate to be returned as
the result of the operation.  If an interposer decides  that  an
existent vnode is appropriate as the return value, it is free to
substitute that vnode for the tentatively supplied one.   If  an
interposer  needs  to  tell  whether  a  forwarded operation has
returned a new vnode, it can check to see whether the  forwardee
returned the same vnode it was given.

     Now we have the tools we need.  To ensure  that  the  vnode
resulting  from  a  lookup call behaves properly, we arrange for
the directory vnode handling the lookup call to have  an  inter-
poser  that examines the result vnode.  If the vnode is new, the
interposer pushes whatever VFS modules are needed to supply  the
requisite  behavior.   If  not, the interposer leaves the result
vnode alone, since it presumably had  instances  of  the  proper
modules pushed when it was originally created.

4.2.  Concurrency and Locking

     We wanted our implementation to allow  maximal  concurrency
for  operations  on  a  given vnode and, in particular, to allow
pops (removing the most recent interposer) to proceed in  paral-
lel with other operations.  We also needed a way to ensure atom-
icity of creates  and  destroys,  so  that  intermediate  states
occurring as behavior is added or removed are invisible.

     Vnode destruction posed particular problems.  The implemen-
tation  destroys  a given vnode by popping off interposers until
none are left and then deallocating the  vnode  itself.   During
this  time  a  concurrent  VOP_LOOKUP  could  retrieve an as yet
unpopped pvnode from the defunct vnode  and  attempt  to  return
that  vnode.   We  had  to design a locking protocol between the
interposition framework and  file  system  implementations  that
prevented  this  race  condition  and  was robust in the face of
deallocating the vnode while the file system was  attempting  to
synchronize with it.

     Our design uses a single lock in each vnode to  guard  that
vnode's  interposition  chain.  Multiple threads can acquire the
lock for shared access, or a single thread can  acquire  it  for
exclusive  access.   A  given thread can acquire the lock recur-
sively in either access mode, although  an  attempt  to  upgrade
from  shared  to  exclusive  access  will  block  until no other
threads hold the lock.   Whenever  a  thread  executes  a  vnode
operation, the interposition framework acquires the vnode's lock
on its behalf.  While holding the lock, the framework obtains  a
reference  to the pvnode supplying the method to be executed and
updates a count of threads executing in that pvnode.

     When initiating operations that can cause plumbing changes,
such  as  VOP_POP  or  VOP_LOOKUP,  the  framework  acquires the
exclusive version of the lock, to  prevent  other  threads  from
accessing  the  vnode while it is in an inconsistent state.  For
destruction operations, the framework locks the vnode being dis-
mantled;  for  creation  operations,  it locks the newly created
vnode.  In either case, the thread executing the  operation  can
go on to initiate other operations on the target vnode, since it
holds the exclusive version of the lock and it is legitimate  to
acquire that lock recursively.

     We handle destroy/lookup races between  the  framework  and
VFS  modules  by  breaking  the operations up into phases.  When
popping a pvnode, the framework notifies  the  pvnode's  VFS  by
calling  VOP_POP.   This call informs the VFS that the pvnode is
no longer attached to its vnode, although  there  may  still  be
threads executing within it or in other pvnodes further down the
interposition chain.  At this point, the  VFS  must  update  the
pvnode's  internal  state  to  mark it as having been popped and
henceforth refrain from trying to  discover  what  vnode  it  is
attached  to.   When no more threads are executing in the pvnode
(the reason for maintaining the  thread  execution  count),  the
framework  calls VOP_INACTIVE to tell the owning VFS that it can
destroy the pvnode.

     On the lookup side, a VFS that finds a  pvnode  that  would
satisfy the lookup request must first mark it as a candidate for
use, to prevent the inactive routine from  destroying  it.   The
VFS  must  then  check to see whether the pvnode has been popped
but still has threads executing against its old incarnation.  If
so,  it must wait for the threads to finish and then recheck its
lookup criteria.  Otherwise, it asks the interposition framework
to  retrieve  the  vnode  that  the pvnode is attached to.  This
request may fail because of a concurrent VOP_POP or  vn_inactive
call;  if  so, the VFS must try again after handshaking with the
other thread.   Finally,  after  successfully  having  run  this
gauntlet, the VFS's lookup method can return the vnode.

4.3.  Ill-Defined Vnode Operations

     Before starting to work on the design, we had realized that
the interposition model's strong object-orientation implied that
it must be possible to designate a single vnode as the recipient
of  each  operation, but we failed at first to appreciate all of
the ramifications.  We soon ran into  trouble  with  operations,
such   as   VOP_CMP  and  the  directory  operations  (VOP_LINK,
VOP_RENAME, et al), that act on multiple vnodes.

     In the old system, two vnodes from the same VFS exhibit the
same  behavior  because  they inherit the same vnode ops vector.
However, this may not be true  under  the  interposition  model,
since  different  behaviors may have been pushed onto either one
or both of the vnodes.  An  immediate  consequence  is  that  we
don't  know  what  to  do  with these vnode operations under the
interposition model, since the choice of  vnode  to  receive  an
operation  may  dictate  its  results.   (E.g.,  the  result  of
VOP_CMP(v1, v2) may not be the same as VOP_CMP(v2,  v1).)   More
fundamentally,  it  is  only possible to interpose on one of the
vnodes involved in the operation (the one to which the operation
is directed), but a basic tenet of the model is that interposers
have an opportunity to influence all  aspects  of  the  behavior
that  the  corresponding vnode exhibits.  Hence, operations that
use secondary vnode arguments  for  anything  other  than  their
identity are incompatible with the interposition model.

     We addressed this problem by decomposing each vnode  opera-
tion  that has multiple vnodes as input arguments into primitive
vnode operations that each only require one vnode  as  an  input
argument.   For  example, VOP_LINK turned into calls to the file
vnode to fetch its file identifier (fid) and  to  increment  its
link  count,  and  a call to the directory vnode to add an entry
associating the fid and leaf name.

     These decomposed operation sequences have to be atomic;  it
would  be incorrect to allow other threads to observe intermedi-
ate states occurring part way through a sequence.   One  obvious
response to this requirement would be to introduce transactions.
We were reluctant to take that step and were able to avoid it by
exploiting  properties  of  the vnode lock we had already intro-
duced to maintain interposition chain integrity.

     When the vnode layer gets  a  multiple-vnode  operation  it
starts  by  determining  which  vnodes  will  participate in the
operation.  It sorts them into a canonical order (by address) to
avoid  deadlock,  and  acquires each vnode's exclusive lock.  At
this point, the thread handling the operation can do whatever is
required  to the affected vnodes without concern that intermedi-
ate states will be visible to other  threads.   It  proceeds  by
executing  the  same  algorithm  that the underlying file system
implementation would execute in  the  old  system,  calling  the
decomposed  single-vnode  methods  to  execute each step.  After
completion, the vnode layer unlocks the vnodes, and the modified
state resulting from the overall operation becomes visible.

     The vnode layer has to take  care  to  handle  errors  from
intermediate operations by issuing operations to back out of the
changes made so far; it maintains enough state to allow it to do
so.   There  is no guarantee that these back-out operations will
succeed; this situation  is  one  where  our  pseudo-transaction
technique  has  failure modes that the old system does not exhi-
bit.

     The current set of decomposed vnode  operations  is  listed
below  in  Table 1.  We do not regard it as final and may change
it as our design evolves.

              Table 1: Decomposed Vnode Operations
 |________________|__________________________________________|
 | Name           |                  Purpose                 |
 |________________|__________________________________________|
 | VOP_FID        |  get a file's fid (file identifier)      |
 | VOP_MKOBJ      |  create an object  (e.g.,  regular  file,|
 |                |  directory,  special file, symbolic link)|
 |                |  and return its vnode and fid            |
 | VOP_DIRADDENTRY|  add  a  directory  entry  for  the  file|
 |                |  identified by the fid                   |
 | VOP_DIRRMENTRY |  remove a directory entry                |
 | VOP_INCLINK    |  increment a file's link count           |
 | VOP_DECLINK    |  decrement a file's link count           |
 |________________|__________________________________________|

4.4.  Transforming VFS Operations into Vnode Operations

     To operate correctly under the  interposition  model,  many
VFS operations also need the ability to interpose on other VFSs.
For example, the (old) VFS_SYNC call must be forwarded to  every
VFS that is part of a given mounted file system to insure proper
ordering when flushing data to the file system media.  To reduce
complexity,  we  would  like  to avoid interposing on VFS opera-
tions.  To achieve this end, we have converted many  VFS  opera-
tions  into vnode operations.  The underlying idea is to treat a
VFS operation as a vnode operation for which the file  system(s)
interpreting  the  operation ignore the identity of the vnode to
which the operation was issued.  In other words, a VFS operation
can go to any vnode associated with the target VFS.

     Table 2 lists the set of VFS operations we have transformed
into vnode operations.

   Table 2: Vnode Operations Transformed from VFS Operations
|__________________|__________________________________________|
|Name              |                  Purpose                 |
|__________________|__________________________________________|
|VOP_PREPARE_UMOUNT|  check if the VFS identified by the vnode|
|                  |  is ready to unmount (e.g., not busy)    |
|VOP_UMOUNT        |  unmount a VFS identified by the vnode   |
|VOP_VGET          |  return the vnode identified by the fid  |
|VOP_STATVFS       |  return statvfs information for  the  VFS|
|                  |  identified by the vnode                 |
|VOP_SYNC          |  flush  data   belonging   to   the   VFS|
|                  |  identified   by   the   vnode   to   the|
|                  |  underlying media                        |
|__________________|__________________________________________|


5.  Implementation

     Having resolved the design issues described above,  it  was
time  to  start  implementing  our prototype.  We soon had a key
insight:  that it was possible to have the interposition  frame-
work coexist with the old vnode layer, with all the vnodes asso-
ciated with a given mount point being uniformly either old style
or new.  This idea allowed us to develop our prototype incremen-
tally, first converting to the  interposition  framework  simple
file systems such as fdfs and tmpfs, and later more complex file
systems such as ufs and specfs.

     For our prototype to be a successful proof of  concept,  we
thought  it  important  for it to boot using new style mounts as
the default.  We have converted it to do so, although some  file
systems such as nfs, fifofs, and procfs remain available only on
their old form.

     Our strategy of allowing old and new style file systems  to
coexist worked according to plan.  This technique has allowed us
to obtain valuable performance  information  without  having  to
convert  all  file  systems  to be interposition-aware.  We also
view it as the basis of a migration strategy for  converting  to
the  interposition  framework; file system developers can choose
to take the plunge at their own pace without  facing  a  system-
wide flag day.

5.1.  VFS Conversion Experience

     The first step of our implementation was to change the ker-
nel  to allow old and new style file systems to coexist.  We did
this by redefining the vnode structure to be the  union  of  the
old  version and the new.  We added a new VOP_GETVNVER operation
to return the vnode structure's effective  version  number.   As
mentioned  above,  we  also  added a set of new vnode operations
(e.g., VOP_GETVRDEV) to act as access functions for   previously
accessible  fields in the old vnode structure that are no longer
visible (e.g., v_rdev).  Since the ops vector  field  is  unused
for  new  style  vnodes,  we  assigned  a set of ``just say no''
operations to that field in all such vnodes, to return ENOSYS if
invoked inadvertently.  We implemented parallel sets of routines
in the vnode layer to handle old  and  new  style  vnodes.   The
implementation  switches to the appropriate routine based on the
target  vnode's  VOP_GETNVER  value.   Some  routines,  such  as
lookuppn, required more work, as they have to cope with both old
and new vnodes, which can occur when crossing mount points.

     The first file system we converted  was  tmpfs.   The  main
objective  was  to  test file object creation and directory ser-
vices.  We chose it first because it has no persistent  storage.
Therefore, it is simpler to debug, and has a smaller probability
of triggering race conditions that would expose bugs in the pro-
totype.  The conversion was smooth, and produced no surprises.

     Our next file system was ufs.  Here,  we  looked  for  com-
ponents  that  could  be  reimplemented as interposers and could
therefore be reused by other file system implementations.   This
was  the  most  difficult  file system to convert, because it is
more likely to encounter race conditions while waiting for  disk
i/o  completions.   In addition, the converted version had to be
efficient, or else the performance goals of the prototype  would
not be met.

     Specfs presented a new challenge because it  uses  composi-
tion to support devices that have multiple device names with the
same major and minor device number.  Since we retained the  same
basic  design,  we  thought the conversion would be straightfor-
ward.  This was true for block devices, but we encountered  dif-
ficulties  when  dealing with character devices, especially with
clone devices and the console device.  The main problem was that
the  kernel directly accessed the private data of specfs vnodes,
violating vnode opaqueness.  There  are  two  reasons  why  this
problem occurs:

o     The VOP_GETATTR operation is  not  extensible  and  cannot
     return  attributes  (e.g., dev_info and common_specvp) that
     are unique to specfs, but that the kernel needs.  This is a
     deficiency of the old vnode interface.

o     It is more efficient to access these attributes directly.

     The ideal solution to this problem would be to  provide  an
extensible  VOP_GETATTR operation to return both the file system
type independent and file system type specific attributes,  thus
allowing  the  kernel to access specfs-specific data while still
respecting vnode opaqueness.

     The short term  solution  we  devised  was  to  add  a  new
VOP_GETKEYEDDATA operation, which allows interposers that recog-
nize the input ``key'' argument  to  return  interposer-specific
data.

     Overall, our conversion effort was not too  difficult  and,
in fact, it became simpler as we identified more interposers and
made them available.  With the exception  of  debugging  specfs,
the  toughest  part  turned  out  to  be  breaking down existing
directory-related vnode operations into more primitive ones.

5.2.  The Interposer Toolkit

     One of our goals in this project is  to  make  file  system
implementation easier.  We plan to meet this goal by providing a
toolkit of interposers  that  file  system  implementations  can
depend  on  to  meet  standard  requirements  (e.g.   POSIX), to
satisfy the UNIX file system semantics, and to acquire  new  and
enhanced file system features.

     As we convert existing file systems to  operate  under  the
interposition  model,  we try to identify file system components
that can be removed and reimplemented as interposers.   The  set
of  interposers  we  have identified so far is listed below.  We
expect to extend the toolkit with new interposers as we continue
our  effort  to  factor  existing file systems into smaller com-
ponents.

                   Table 3: The Interposer Toolkit
|_______|____________________________________________________________|
|Name   |                           Purpose                          |
|_______|____________________________________________________________|
|config |  push the proper behavior onto a new vnode                 |
|dnlc   |  add/remove a name to/from the directory name lookup cache |
|mfrl   |  handle mandatory file and record locking                  |
|mount  |  splice a VFS into the file system name space              |
|specref|  provide access to special files                           |
|ro     |  allow only read-only vnode operations to proceed          |
|_______|____________________________________________________________|


     We were pleasantly surprised by  how  easy  the  interposer
toolkit  is  to  use.   If certain standard file system features
(e.g., dnlc) are desired, all it takes is  to  include  them  as
part  of  the file system configuration at mount time.  There is
no need to do any programming in the file system implementation.
This  is a big win over the existing approach for providing com-
mon file system services through library  services  because  the
file  system implementors must write code in advance in order to
take advantage of these services.

5.3.  Performance

     We used two sets of benchmarks to assess our prototype.

     The first benchmark consisted of a perl script used to time
creation  and  deletion  of 1000 files of different sizes (e.g.,
0k, 1K, 4K, 10K) on a new style ufs file system that used all of
the  toolkit  interposers  listed above except ro.  We used this
test mainly during the early stage of the prototype development.
The data obtained from this benchmark showed no significant per-
formance degradation.  We found these results  highly  encourag-
ing, since our prototype properly handled multi-threaded locking
and used interposition in a nontrivial way.

     After we had converted specfs and switched the prototype to
use  new  style vnodes as the default, we re-ran the script, and
again found no significant performance degradation.  We then ran
one  of  our  standard  benchmarks  to  measure  the prototype's
throughput under a workload simulating a C  program  development
environment.   The  system  throughput  from  the  benchmark  is
reported as the number of work scripts processed per  hour,  for
various  values  of a workload parameter representing the number
of simulated users.  We expected this benchmark to  reveal  some
amount  of  performance  degradation.   Because of the factoring
that we have done, both of  vnode  operations  and  file  system
implementations, optimizations possible under the old monolithic
implementations are no longer available.

     To run this throughput  benchmark  successfully,  a  system
must  be very robust and our prototype has only recently reached
this level of maturity.

     We ran the benchmark on  a  Sun  Sparcstation  1  with  16M
memory  installed with SunOS 5.1.  Table 4 compares results from
runs with our prototype kernel and with the standard 5.1 kernel.

                  Table 4: Throughput Results
  |_________|______________|______________|_________________|
  |         |   SunOS 5.1  |   Prototype  |                 |
  |         | _____________|______________|                 |
  | Workload|  Scripts/Hour|  Scripts/Hour|  degradation (%)|
  |_________|______________|______________|_________________|
  |     1   |      13.83   |      13.83   |        0        |
  |    16   |     185.03   |     181.82   |       -1.7      |
  |    24   |     198.58   |     191.28   |       -3.6      |
  |    28   |     193.70   |     184.85   |       -4.5      |
  |    32   |     175.42   |     179.47   |       +2.3      |
  |    36   |     159.41   |     143.35   |       -9.9      |
  |_________|______________|______________|_________________|


     The system throughput benchmark result indicates  that  the
prototype  suffers  less than 5% degradation in peak throughput,
but degrades more rapidly after the peak workload.  We have  not
yet  analyzed  the cause of the degradation, but suspect that it
is a symptom of additional CPU overhead.  Note that we have  not
yet  tuned  the  prototype, and have not implemented performance
saving features that are now possible  under  the  interposition
model.   We  therefore believe that we should be able to further
improve the prototype.

6.  The To-Do List

     In this section we discuss areas that  our  prototype  does
not  yet  address.   We start with things we think we understand
and proceed to issues of increasing difficulty.

6.1.  Configurator Interposers

     If we succeed in converting existing file  systems  into  a
rich  interposer toolkit of file system components, we will have
created another problem:  How can people specify combinations of
components  that  do  what they want?  We expect that most users
would prefer to remain ignorant of the issue altogether and that
most  others would prefer to use pre-existent file system confi-
gurations that are known to behave properly.  Thus, the  problem
reduces  to  devising  a  way for file system vendors to specify
``canned configurations''.

     In the existing system, each mount can differ by file  sys-
tem  type  from  other mounts and the mount system call takes an
argument specifying that type.  We plan to exploit  this  inter-
face  to use the type argument to name a configurator interposer
that is responsible for ensuring that all files  underneath  the
new  mount  point  behave  appropriately.  The mount system call
will push this interposer on the root vnode of the newly mounted
file  system.  It is subsequently responsible for pushing a copy
of itself on directories, as  lookup  calls  create  vnodes  for
them,  and for pushing the sequence of interposers that makes up
the canned configuration on all files,  as  lookups  create  new
vnodes  for them.  By induction, the overall effect is to ensure
the proper layering configuration for all vnodes associated with
the mount.

6.2.  A Directory Interposer

     We want to split ufs into a directory piece and  an  inode-
level  file  access  piece,  and make the former available as an
interposer that other base file systems can use.  With  the  old
set  of  vnode operations, this separation is infeasible; opera-
tions such as VOP_CREATE mix together actions from  both  pieces
(creating  a  file  and  entering it in a directory), preventing
their separation.  Fortunately for this purpose, we have already
had  to decompose operations that mix directory access with file
access, as explained above.

     When pushed on top of a regular file, this interposer  will
make  it  behave  as  a  directory.  For example, it will handle
VOP_LOOKUP calls by reading the underlying file, looking  for  a
matching  entry.   If it finds one, it will extract the fid from
the entry and use it as the argument to a VOP_VGET call  on  the
file.   The  VFS defining the file's behavior will then look the
fid up and return its vnode, which the directory interposer then
returns in turn.  Other operations proceed similarly.

     Note that the directory interposer expects the VFS  supply-
ing  behavior for the file it is pushed on to supply a primitive
directory service of its own, in the form of the VOP_VGET opera-
tion.  The interposer converts that VFS's flat, name-by-fid name
space into the standard hierarchical name space.

6.3.  Impedance Mismatches

     Decomposing  multi-vnode  operations  into  more  primitive
single-vnode  operations has consequences for existing file sys-
tems.  Consider, for example, the NFS  client  side  implementa-
tion.   It has a particular fixed vocabulary, in the form of the
NFS protocol, that it must use to communicate with  the  server.
This  vocabulary  is well-matched to the old set of vnode opera-
tions, but is at a considerably higher level than the new set of
operations.   Thus,  the  client  implementation  now  faces  an
impedance matching problem where it  must  convert  an  incoming
stream  of  vnode  operations into a corresponding stream of NFS
protocol RPCs.  Because of  the  different  levels  of  the  two
streams,  the  client  implementation  must  accumulate incoming
vnode operations until they add up to a protocol operation.

     We can handle this new implementation requirement by adding
to  the  implementation  a state machine that tracks progress in
accumulating operations that add up to a protocol request.  How-
ever,  the mechanics of the addition are likely to be ugly.  One
complication is that operations can occur concurrently on multi-
ple  vnodes,  and  the  implementation must keep these logically
separate input streams distinct.

     We plan to address this problem  by  taking  a  small  step
toward  adding transactions to the vnode interface.  Our motiva-
tion is the observation that the vnode layer  now  handles  many
incoming requests by decomposing them into more primitive opera-
tions.  Thus, it has knowledge of which operations  are  associ-
ated  with which higher level requests.  Stated differently, the
vnode layer is effectively creating transactions, so it might as
well  pass  some  information  about  them  to the participants.
Thus, we intend to add an  additional  argument  to  each  vnode
operation  identifying  the  transaction  to which the operation
belongs.

6.4.  Transactions

     Having tiptoed this far toward  full  transaction  support,
why  not  go  the whole way?  We have a design sketch [Rosenthal
92], but are reluctant to use it because of concerns  about  its
ramifications.  Adding transaction support to the kernel's vnode
layer and to file system implementations is as invasive  a  step
as  converting  them to the interposition model.  We judged that
making two changes of this magnitude would add unacceptable risk
to our project.

     However, the project is now at a point where it is time  to
reexamine  this decision.  If we decide to proceed, our transac-
tion facility would be confined to the kernel's vnode layer  and
to  file  system  implementations;  it would not be available at
user level for general use.  We would  use  it  to  replace  our
current  ad hoc lock-based implementations in those parts of the
prototype.

     Here is a sketch of the transaction design and  the  effect
of using the facility on some sample code.  Each vnode operation
would take an additional transaction context argument (replacing
the  transaction  identifier mentioned in the previous section).
If the context is NULL, the operation is not part of a  transac-
tion and proceeds as before.  Otherwise, the operation must join
the transaction the context argument  identifies,  by  adding  a
record  of its own to the context.  If the operation cannot join
the transaction, it must fail, causing the transaction to abort.
Each  record  in  the  transaction  context contains pointers to
callback routines to be executed when  the  transaction  aborts,
prepares  to commit, or commits.  The transaction framework pro-
vides routines to invoke  the  appropriate  callback  from  each
record, for use when the transaction reaches the abort, prepare,
or commit stage.

     After restructuring to use transactions, the code in Figure
5 becomes that of Figure 6.

                if (VOP_FOO(vp) != 0) {
                        /* About a page of code cleaning up the mess */
                        goto bad1;
                }
                /* The FOO operation succeeded - do something more */
                if (VOP_BAR(vp) != 0) {
                        /* About a page of code cleaning up the mess */
                        goto bad2;
                }
                /* The BAR operation succeeded - do something more */
                return (0);
        bad2:
        bad1:
                return (ENOTTY);

              Figure 5:  Code Before Transactions

                vtc = kmem_alloc(sizeof struct vtc);
                if (VOP_FOO(vp, vtc) == 0) {
                        if (VOP_BAR(vp, vtc) == 0) {
                                if (vtc_try_commit(vtc) == 0)
                                        return (0);
                        }
                }
                return (vtc_abort(vtc));

               Figure 6:  Code After Transactions

6.5.  Page Cache Consistency

     Up to now, we have avoided grappling with  the  interaction
of the VM system with the interposition framework.

     A major issue is how to structure the page cache.   In  the
old  system,  each  vnode  has an associated list of pages whose
contents cache parts of the corresponding file.  The  pages  are
each  marked  with  the address of the vnode and offset into the
file.  As a matter of  philosophy,  we  believe  that  each  VFS
module should be able to interact with the page cache as it sees
fit to store data consistent with the behavior it imposes  on  a
given  file.   This  consideration  suggests that the page cache
should identify pages by pvnode and offset instead of  by  vnode
and offset, as in the old system.

     The other primary issue is how  interposers  and  composers
that cooperatively interact to manage a composite file can main-
tain coherence among the pages that they have entered  into  the
cache  while  still  providing for efficient i/o involving those
pages.

     We have  preliminary  ideas  about  how  to  resolve  these
issues,  but  they  are not yet well enough developed to report.
However, it is already apparent that  we  will  have  to  add  a
richer  set of methods for file system modules and the VM system
to use to communicate with each other; anything that the VM sys-
tem  does  that might affect the status of a cached page must be
visible to the relevant file system modules and vice versa.

7.  Related Work

     People have taken two basic approaches to overcoming  defi-
ciencies  in  the  system's file system infrastructure:  finding
better ways to  support  user  level  file  system  modules  and
attacking the limitations of the vnode interface.

     Watchdogs [Bershad & Pinkerton 88]  and  the  File  Monitor
Interface [Webber 93] are examples of the former approach.  Both
are founded on the premise that the  kernel  is  a  sufficiently
inhospitable environment that it is desirable to provide facili-
ties for user-level file system modules.   In  contrast  to  the
user-level  NFS  server  approach, they rely on kernel modifica-
tions and a specialized  user-kernel  communication  channel  to
provide  finer  control over how the user level file system pro-
cess can handle operations, both in the choice of  which  opera-
tions  require  intervention  and in how to dispose of them.  In
contrast with in-kernel techniques  such  as  the  interposition
model,  approaches that attempt to move most of the work to user
level offer a congenial and  portable  file  system  development
environment,  at the price of context switching overhead and the
possibility of deadlock and additional  data  copying  overhead.
They  are  also  potentially  compatible  with the interposition
approach; the kernel hooks for communicating with  a  user-level
process  could  be packaged as an interposer that is pushed onto
vnodes of interest.

     The Ficus project at UCLA [Guy et al 90, Heidemann 92]  and
the  stacking  vnode  work of [Rosenthal 90] pioneered the other
major approach to  improving  the  file  system  infrastructure.
Both  extend  the  vnode interface by introducing facilities for
vnode ``stacking''.

     Heidemann's Ficus work added interface inheritance  to  the
vnode  interface  through  operation extensibility:  file system
modules are free to define new vnode operations.  At system boot
time,  the  layering  framework  determines  the complete set of
vnode operations and constructs ops vectors for each file system
module.   His  framework  also  supports  static  implementation
inheritance; file systems are stackable in the sense that  their
vnodes can compose on those of other file systems.  The layering
configurations for vnodes of each mountable file system type are
determined at system boot time and remain fixed for each vnode's
lifetime.  This work has been successful enough to merit  incor-
poration into the Berkeley 4.4bsd release.

     Our  interposition  work  is   a   direct   descendant   of
Rosenthal's stacking vnode prototype [Rosenthal 90].  His system
supports implementation inheritance in the form of dynamic  com-
position, by forming vnode stacks.  The configuration of a stack
may change dynamically over its lifetime.  His  layering  frame-
work  provides  explicit  support for stacking in that, unless a
file system  module  explicitly  chooses  otherwise,  operations
directed  at  any of the vnodes in a stack are redirected to the
vnode at the top of the stack.  This mechanism  guarantees  that
stack  configurations  are  opaque  to  their  clients, but also
prevents a vnode from being part of more than  one  stack,  pre-
cluding  ``fan in'' configurations.  (Neither Heidemann's frame-
work nor the interposition model  suffers  from  this  problem.)
The  mechanism  also  causes  ambiguity  between  a  given vnode
pointer's role in naming the vnode it points at  and  in  naming
the  stack  that vnode is in.  This ambiguity is the fundamental
flaw of Rosenthal's stacking model and attempts  to  resolve  it
led to our interposition model.

8.  Conclusion

     The interposition model has been the key to solving many of
the  problems that bedeviled Rosenthal's original vnode stacking
model.  While developing our prototype  implementation  we  have
demonstrated that:

o     The interposition model works.

     We  have  implemented  a  robust  prototype  that   heavily
     exploits interposition yet exhibits reasonable performance.
     We see no significant technical barriers  to  bringing  our
     prototype to production quality.

o     Interposers are a powerful file system implementation  and
     packaging tool.

     We have used interposers to  centralize  many  file  system
     services, such as mounting, lookup performance caching, and
     access to special files.  We expect to extend this central-
     ization further, to directory services and media allocation
     policies.  By doing so, we have simplified  maintenance  of
     these  services  and  simplified  the remaining file system
     implementations.  We have made it possible to shift  effort
     in  file  system  development toward providing unique added
     value, assembling the remaining behavior from standard com-
     ponents.

     Many issues remain, most notably the transaction  and  page
caching issues described above.  We think that our locking tech-
niques are (barely) sufficient to allow us to avoid solving  the
transaction  problem, but we know we must solve the page caching
problem.  We are encouraged by our progress so far and feel that
we  will  be  able  to  resolve  these problems, yielding a much
improved file system infrastructure.

9.  Acknowledgements

     Thanks go to David Rosenthal, for originating this line  of
work,  and  to  the  members of the UNIX International Stackable
Files Work Group, for providing the impetus to resume  the  work
after it had lain fallow for a couple years.

10.  References

[Bershad & Pinkerton 88]  Bershad, Brian N.  and C. Brian  Pink-
                          erton,   ``Watchdogs:   Extending  the
                          UNIX File System'', Proceedings of the
                          Winter 1988 USENIX Conference, Dallas,
                          1988.

[Callaghan & Lyon 89]     Callaghan, Brent and Tom  Lyon,  ``The
                          Automounter'',   Proceedings   of  the
                          Winter  1989  USENIX  Conference,  San
                          Diego, 1989.

[Guy et al 90]            Guy, Richard G.,  John  S.  Heidemann,
                          Wal  Mak,  Thomas W. Page, Jr., Gerald
                          J.  Popek,   and   Dieter   Rothmeier,
                          ``Implementation  of  the Ficus Repli-
                          cated File  System'',  Proceedings  of
                          the  Summer  1990  USENIX  Conference,
                          Anaheim, 1990.

[Heidemann 92]            Heidemann,  John  S.,  ``File   System
                          Development  with  Stackable Layers'',
                          Technical Report, Department  of  Com-
                          puter  Science, University of Califor-
                          nia, Los Angeles, 1992.

[Kleiman 86]              Kleiman,  Steven  R.,  ``Vnodes:    An
                          Architecture  for Multiple File System
                          Types in Sun  UNIX'',  Proceedings  of
                          the  Summer  1986  USENIX  Conference,
                          Atlanta, 1986.

[Minnich 93]              Minnich, Ronald G., ``The  AutoCacher:

                          A File Cache Which Operates at the NFS
                          Level'',  Proceedings  of  the  Winter
                          1993  USENIX  Conference,  San  Diego,
                          1993.

[Moran & Lyon 93]         Moran,  Joe  and   Bob   Lyon,   ``The
                          Restore-o-Mounter:    The  File  Motel
                          Revisited'', Proceedings of the Summer
                          1993  USENIX  Conference,  Cincinnati,
                          1993.

[Pendry 89]               Pendry,  Jan-Simon,  ``Amd - an  Auto-
                          mounter'',  Technical  Report, Depart-
                          ment of Computing,  Imperial  College,
                          London, 1989.

[Rosenthal 90]            Rosenthal, David,  S.  H.,  ``Evolving
                          the  Vnode Interface'', Proceedings of
                          the  Summer  1990  USENIX  Conference,
                          Anaheim, 1990.

[Rosenthal 92]            Rosenthal, David S. H., ``Requirements
                          for   a  "Stacking"  Vnode/VFS  Inter-
                          face'',  UNIX  International  document
                          SF-01-92-N014, Parsippany, 1992.

[Webber 93]               Webber, Neil, ``Operating System  Sup-
                          port  for  Portable File System Exten-
                          sions'',  Proceedings  of  the  Winter
                          1993  USENIX  Conference,  San  Diego,
                          1993.

11.  Author Information

     Glenn Skinner is a Senior Staff Engineer at SunSoft.  He is
responsible  for riding herd over the SunOS architecture and, in
what little time is left over, he tries to do real work.  He  is
discovering  that,  no  matter how hard he tries, he can't leave
the influence of his years of STREAMS development  behind.   His
e-mail address is glenn.skinner@Eng.Sun.COM.

     Thomas K. Wong is a member of the Technical  Staff  in  the
File  System  Group  at  Sunsoft. He received his S.B. degree in
Computer Science from MIT.  He is a member of X3B11.1  and  ECMA
TC15,  and  was  the  project  editor  for the ECMA Standard 168
(ISO/IEC DIS 13490) Volume and File Structure of  Read-Only  and
Write-Once  Compact Disc Media for Information Interchange.  His
e-mail address is thomas.wong@Eng.Sun.COM.