################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org ``Stacking'' Vnodes: A Progress Report Glenn C. Skinner and Thomas K. Wong SunSoft Inc. 2550 Garcia Avenue Mountain View, Ca 94043 ABSTRACT People are dissatisfied with the file system ser- vices that come with their UNIX systems. They want to add new and better features. At present they have two choices: express their service as a user-level NFS server, or use the vnode/VFS interface to build at least part of it into the kernel. Although the vnode/VFS interface has been remarkably successful as a kernel structuring concept, it has failed to provide source portability between UNIX versions or even binary compatibility between releases of the same UNIX version. It has been obvious for some time that a redesign of the vnode/VFS interface that allowed file systems to be shipped as binary kernel modules that survive from release to release is needed. We describe a prototype kernel with a vnode/VFS interface that would allow this. It is based on earlier work on ``stacking'' vnodes at Sun and at UCLA, but it replaces the stacking concept by a more strictly object-oriented concept of interposition. 1. Introduction People are dissatisfied with the file system services that come with their UNIX systems. Existing file systems are looking increasingly old and shopworn, and developers want to add new and better features. At present they have two choices: express their service as a user-level NFS server, or use the vnode and Virtual File System (VFS) interfaces [Kleiman 86] to build at least part of it into the kernel. Neither of these approaches is satisfactory. User-level NFS servers are a viable approach for some pur- poses, and people have used them to implement services such as automounters [Callaghan & Lyon 89, Pendry 89], autocachers April 14, 1993 [Minnich 93], and the Restore-o-Mounter [Moran & Lyon 93]. These services all primarily concern themselves with name space manipulation and stay out of the way of data movement. Indeed, the big drawback of user-level NFS servers is that it is nearly impossible to provide adequate performance and semantics for data movement operations. This limitation severely restricts the approach's applicability. Kernel-resident file system modules avoid the limitations of user-level NFS servers, but suffer from problems of their own. They rely critically on the vnode/VFS interface as their means of connection into the kernel environment. But the inter- face suffers from problems of compatibility and stability. [Webber 93] states: UNIX kernels with a VFS architecture have been commer- cially available for many years. Sun Microsystems, for example, described their VFS architecture in the 1986 Summer USENIX proceedings. By many measures the VFS concept has been quite successful, but from a third party point of view there are two major prob- lems: few vendors have the same VFS interface [and] few vendors provide release-to-release source or binary compatibility for VFS modules. We call these two problems the VFS portability problem and the lock-step release problem, respectively. Together, they make VFS modules expensive to produce, expensive to port, and expensive to maintain. Compounding these problems is the interface's lack of modular- ity: complete file system implementations are large and com- plex, and the vnode/VFS interface provides no support for decom- posing them into smaller components. It has been obvious for some time that a redesign of the vnode/VFS interface that addressed these problems is needed. We describe a prototype kernel with a vnode/VFS interface that allows for binary compatible, modular file system implementa- tions. It is based on earlier work on ``stacking'' vnodes at Sun and at UCLA, but it replaces the stacking concept by a more strictly object-oriented concept of interposition. In the rest of this paper, we start by analyzing the defi- ciencies of the existing interface. We then introduce the interposition model and explain how it addresses these deficien- cies. In subsequent sections we describe the design of a proto- type system that realizes this model and our experiences imple- menting the prototype. Next we go over a laundry list of prob- lem areas we have not yet addressed. Finally, we examine related work and present concluding remarks. 2. Problems with the Vnode Interface The Virtual File System (VFS) and vnode interfaces form the boundary between file system implementations and the parts of the kernel that access file system services. The VFS interface is concerned with operations pertaining to file systems as a whole, such as mount and unmount, whereas the vnode interface governs access to individual files. The two interfaces have a similar structure, so the discussion below concentrates on the vnode interface as illustrative of both. Figure 1 shows an idealized sketch of the vnode interface. struct vnode { struct vfs *v_vfsp; /* owning VFS */ enum vtype v_type; /* file, dir, etc. */ struct vnodeops *v_op; /* the ops vector */ void *v_private; /* internal state */ void *v_public; /* visible state */ }; struct vnodeops { int (*vop_open)(); int (*vop_read)(); int (*vop_write)(); /* and so on */ }; Figure 1: Standard Vnode Interface The interface is object-oriented in the sense that it represents each file as a data structure called a vnode along with a set of operations (a so-called ops vector) for manipulat- ing it. Every vnode belongs to some VFS, which is responsible for defining its operations and maintaining its internal state. Clients of the interface invoke vnode operations through macros like: #define VOP_X(vp, args) (*(vp)->v_op->vop_x)(vp, args) In object-oriented terminology, calling this macro requests the vnode object *vp to invoke its vop_x method. In the introduction, we indicted the vnode interface on two basic counts: that it lacks stability and that it lacks modu- larity. Both of these failings are aggravated by non-object- oriented aspects of the interface. o Vnodes are not opaque. Some of the vnode's internal structure is visible and available for direct manipulation by code outside of its owning VFS. This exposure makes it very difficult to change the vnode structure without breaking binary compati- bility and is a major contributor to the interface's instability from release to release. It also complicates resolving the next weakness: o The vnode interface makes no provision for inheritance. From the interface inheritance viewpoint, the interface is not extensible, making it difficult to accommodate file systems with novel semantics. From the implementation inheritance viewpoint, the interface is not modular, making it impossible to develop file systems as incremental units of behavior added to previously available components. 3. Interposition and Composition The deficiencies listed in the last section lead directly to a set of requirements for a revised vnode interface. The interface should be object-oriented. Specifically, it must: o accommodate multiple file systems implementing the inter- face, o be opaque, and o support some form of implementation inheritance. Another desirable requirement, that our work does not address, is that the interface should: o support some form of interface inheritance. The existing vnode interface already satisfies the first requirement. The second is straightforward: in many cases existing references to exposed state turn out to be unnecessary and can be eliminated. We handled the remaining references by converting them to calls to access functions, which we added to the set of vnode operations. The implementation inheritance requirement is trickier. We require that the interface allow for binary compatibility of independently developed modules, which rules out language-based subclassing schemes. One way to satisfy this requirement, and the way we chose, is to use delegation rather than strict inher- itance. This technique allows one file system module to forward an operation request (possibly modified) to another module as part of that operation's implementation. Experience with a previous attempt to add delegation to the vnode interface [Rosenthal 90] and contemplation of its after- math [Rosenthal 92] had alerted us to a pitfall in adding dele- gation to the interface. In the existing vnode interface, a given vnode serves two purposes: it is the in-kernel represen- tative of some file system object, and it records the state and operations that the file system implementation applies to that object. That is, vnodes capture both identity and behavior. We wanted to use delegation to add behavior, but not at the price of adding unwanted identities. This observation led us to use a specialized form of delegation that we call interposition, where a delegating (or interposing) file system module in effect takes control of the target of the delegation, by capturing all accesses to it. Putting these requirements together led us to the revised vnode interface outlined in Figure 2. struct vnode { struct pvnode *v_chain; /* head of interposition chain */ }; struct pvnode { struct vfs *p_vfsp; /* owning VFS */ struct vnodeops *p_op; /* the ops vector */ struct pvnode *p_link; /* next older interposer */ void *p_private; /* this interposer's state */ } struct vnodeops { /* as before */ }; Figure 2: Interposition Vnode Interface We retain the term vnode for the part that captures the identity of the corresponding file system object. A pvnode contains the information that describes the behavior that a given file system imposes on an object; each pvnode belongs to a particular file system implementation as recorded in the p_vfsp field and is attached to a single vnode. Pvnodes are organized into interpo- sition chains; the vnode's v_chain field names the first, most recently interposed, pvnode in its chain and the p_link pvnode field links to the next most recent interposing pvnode. We illustrate how interposition works through an example. vp dnlc ufs Figure 3: An Interposition Chain Figure 3 depicts a vnode named by vp and its associated interpo- sition chain; the circle designates the vnode and the boxes its interposing pvnodes. Suppose the file associated with vp is a directory and some process tries to look up a file in it. The kernel's vnode layer will issue the call VOP_LOOKUP(vp, NULL, name, &resultvp) to initiate the lookup. VOP_LOOKUP is defined as:[1] _________________________ [1] This definition is simplified for the sake of #define VOP_LOOKUP(vp, pvp, name, result) vn_lookup(vp, pvp, name, result) int vn_lookup(struct vnode *vp, struct pvnode *pvp, char *name, struct vnode **result) { struct pvnode *npvp = pvp ? pvp->p_link : vp->v_chain; return (*(npvp->p_op->vop_lookup)(vp, npvp, name, result)); } Since the initial call has a NULL pvnode argument, control passes to the lookup method defined as part of dnlc's ops vec- tor. Dnlc is an interposer that the directory name lookup cache VFS module defines. Its lookup method consults the module's cache to see whether the target file already has a vnode, If so, the method sets the result argument to that vnode and returns. Otherwise, it forwards the operation to the base ufs file system module with VOP_LOOKUP(vp, pvp, name, result), where pvp is the pvnode argument it was called with. When the forwarded call returns, dnlc's lookup method enters the resulting vnode into the cache and returns. The preceding example showed how one can build up behavior applicable to a single object. However, there is often need to combine multiple objects together into a composite whole. This process is called composition; the interposition framework accommodates it by allowing pvnodes to maintain references to other vnodes as part of their private state. A reference to the composing vnode acts as an entry point into the overall composite object. The components of the compo- sition may be named by referring to their vnodes; it is up to the composer to ensure that such accesses are semantically sen- sible, perhaps by interposing on each of the composees to exert control over their behavior. Here is another example, showing how to combine composition with interposition to handle file system mounts. We use the technique illustrated in this example to factor knowledge of mount point semantics out of all our file system implementa- tions. _________________________ illustration; locking considerations and the need to determine whether a new vnode has been created make the actual definition more complicated. mnt_pt cross_mntpt mnt_pt_dir ... fs_root do_dotdot root_dir ... Figure 4: Mounted File Systems In Figure 4, mnt_pt is a vnode naming a directory acting as a mount point and fs_root names the root of the file system mounted there. Cross_mntpt and do_dotdot are interposers that supply modified versions of VOP_LOOKUP and pass all other opera- tions through respectively to mnt_pt_dir and root_dir; they in turn head the remainder of mnt_pt's and fs_root's interposition chains. Cross_mntpt and do_dotdot each compose on the other vnode participating in the mount. When a lookup request reaches mnt_pt, control first reaches cross_mntpt. Its lookup method compares the name argument with "..". If they are equal, this component of the overall path name being looked up passes out from underneath the part of the file system name space the mount controls; otherwise, the com- ponent lands inside the mount. Cross_mntpt's lookup method proceeds accordingly, by forwarding the request to the next interposer (mnt_pt_dir) if the name is "..", and by passing the request on to its composee (fs_root) otherwise. Do_dotdot's lookup method is the dual of cross_mnt_pt's; if the name is not "..", it forwards the request to the next interposer and other- wise passes the request to its composee. 4. Design We wanted to limit the scope of our prototyping effort, so that the project remained tractable. Thus, we decided not to tackle adding support for transactions or page cache consistency to the vnode interface, since we knew from previous work that they pose tough problems. Instead, we decided to see how much mileage we could get from developing our ideas about interposi- tion. We knew at the outset that we would have to solve prob- lems relating to vnode configuration and locking. Not surpris- ingly, we also encountered problems we had not anticipated. We discuss some of them below. 4.1. Vnode Configuration When the system creates a new vnode it must be configured with interposers that properly capture the semantics desired of the corresponding file system object. Thus, operations such as VOP_LOOKUP that potentially create new vnodes must have hooks for distinguishing newly created vnodes from previously existent ones and for adding behavior (new interposers) to new vnodes. We satisfy this requirement by having the vnode layer sup- ply a virgin vnode as a tentative candidate to be returned as the result of the operation. If an interposer decides that an existent vnode is appropriate as the return value, it is free to substitute that vnode for the tentatively supplied one. If an interposer needs to tell whether a forwarded operation has returned a new vnode, it can check to see whether the forwardee returned the same vnode it was given. Now we have the tools we need. To ensure that the vnode resulting from a lookup call behaves properly, we arrange for the directory vnode handling the lookup call to have an inter- poser that examines the result vnode. If the vnode is new, the interposer pushes whatever VFS modules are needed to supply the requisite behavior. If not, the interposer leaves the result vnode alone, since it presumably had instances of the proper modules pushed when it was originally created. 4.2. Concurrency and Locking We wanted our implementation to allow maximal concurrency for operations on a given vnode and, in particular, to allow pops (removing the most recent interposer) to proceed in paral- lel with other operations. We also needed a way to ensure atom- icity of creates and destroys, so that intermediate states occurring as behavior is added or removed are invisible. Vnode destruction posed particular problems. The implemen- tation destroys a given vnode by popping off interposers until none are left and then deallocating the vnode itself. During this time a concurrent VOP_LOOKUP could retrieve an as yet unpopped pvnode from the defunct vnode and attempt to return that vnode. We had to design a locking protocol between the interposition framework and file system implementations that prevented this race condition and was robust in the face of deallocating the vnode while the file system was attempting to synchronize with it. Our design uses a single lock in each vnode to guard that vnode's interposition chain. Multiple threads can acquire the lock for shared access, or a single thread can acquire it for exclusive access. A given thread can acquire the lock recur- sively in either access mode, although an attempt to upgrade from shared to exclusive access will block until no other threads hold the lock. Whenever a thread executes a vnode operation, the interposition framework acquires the vnode's lock on its behalf. While holding the lock, the framework obtains a reference to the pvnode supplying the method to be executed and updates a count of threads executing in that pvnode. When initiating operations that can cause plumbing changes, such as VOP_POP or VOP_LOOKUP, the framework acquires the exclusive version of the lock, to prevent other threads from accessing the vnode while it is in an inconsistent state. For destruction operations, the framework locks the vnode being dis- mantled; for creation operations, it locks the newly created vnode. In either case, the thread executing the operation can go on to initiate other operations on the target vnode, since it holds the exclusive version of the lock and it is legitimate to acquire that lock recursively. We handle destroy/lookup races between the framework and VFS modules by breaking the operations up into phases. When popping a pvnode, the framework notifies the pvnode's VFS by calling VOP_POP. This call informs the VFS that the pvnode is no longer attached to its vnode, although there may still be threads executing within it or in other pvnodes further down the interposition chain. At this point, the VFS must update the pvnode's internal state to mark it as having been popped and henceforth refrain from trying to discover what vnode it is attached to. When no more threads are executing in the pvnode (the reason for maintaining the thread execution count), the framework calls VOP_INACTIVE to tell the owning VFS that it can destroy the pvnode. On the lookup side, a VFS that finds a pvnode that would satisfy the lookup request must first mark it as a candidate for use, to prevent the inactive routine from destroying it. The VFS must then check to see whether the pvnode has been popped but still has threads executing against its old incarnation. If so, it must wait for the threads to finish and then recheck its lookup criteria. Otherwise, it asks the interposition framework to retrieve the vnode that the pvnode is attached to. This request may fail because of a concurrent VOP_POP or vn_inactive call; if so, the VFS must try again after handshaking with the other thread. Finally, after successfully having run this gauntlet, the VFS's lookup method can return the vnode. 4.3. Ill-Defined Vnode Operations Before starting to work on the design, we had realized that the interposition model's strong object-orientation implied that it must be possible to designate a single vnode as the recipient of each operation, but we failed at first to appreciate all of the ramifications. We soon ran into trouble with operations, such as VOP_CMP and the directory operations (VOP_LINK, VOP_RENAME, et al), that act on multiple vnodes. In the old system, two vnodes from the same VFS exhibit the same behavior because they inherit the same vnode ops vector. However, this may not be true under the interposition model, since different behaviors may have been pushed onto either one or both of the vnodes. An immediate consequence is that we don't know what to do with these vnode operations under the interposition model, since the choice of vnode to receive an operation may dictate its results. (E.g., the result of VOP_CMP(v1, v2) may not be the same as VOP_CMP(v2, v1).) More fundamentally, it is only possible to interpose on one of the vnodes involved in the operation (the one to which the operation is directed), but a basic tenet of the model is that interposers have an opportunity to influence all aspects of the behavior that the corresponding vnode exhibits. Hence, operations that use secondary vnode arguments for anything other than their identity are incompatible with the interposition model. We addressed this problem by decomposing each vnode opera- tion that has multiple vnodes as input arguments into primitive vnode operations that each only require one vnode as an input argument. For example, VOP_LINK turned into calls to the file vnode to fetch its file identifier (fid) and to increment its link count, and a call to the directory vnode to add an entry associating the fid and leaf name. These decomposed operation sequences have to be atomic; it would be incorrect to allow other threads to observe intermedi- ate states occurring part way through a sequence. One obvious response to this requirement would be to introduce transactions. We were reluctant to take that step and were able to avoid it by exploiting properties of the vnode lock we had already intro- duced to maintain interposition chain integrity. When the vnode layer gets a multiple-vnode operation it starts by determining which vnodes will participate in the operation. It sorts them into a canonical order (by address) to avoid deadlock, and acquires each vnode's exclusive lock. At this point, the thread handling the operation can do whatever is required to the affected vnodes without concern that intermedi- ate states will be visible to other threads. It proceeds by executing the same algorithm that the underlying file system implementation would execute in the old system, calling the decomposed single-vnode methods to execute each step. After completion, the vnode layer unlocks the vnodes, and the modified state resulting from the overall operation becomes visible. The vnode layer has to take care to handle errors from intermediate operations by issuing operations to back out of the changes made so far; it maintains enough state to allow it to do so. There is no guarantee that these back-out operations will succeed; this situation is one where our pseudo-transaction technique has failure modes that the old system does not exhi- bit. The current set of decomposed vnode operations is listed below in Table 1. We do not regard it as final and may change it as our design evolves. Table 1: Decomposed Vnode Operations |________________|__________________________________________| | Name | Purpose | |________________|__________________________________________| | VOP_FID | get a file's fid (file identifier) | | VOP_MKOBJ | create an object (e.g., regular file,| | | directory, special file, symbolic link)| | | and return its vnode and fid | | VOP_DIRADDENTRY| add a directory entry for the file| | | identified by the fid | | VOP_DIRRMENTRY | remove a directory entry | | VOP_INCLINK | increment a file's link count | | VOP_DECLINK | decrement a file's link count | |________________|__________________________________________| 4.4. Transforming VFS Operations into Vnode Operations To operate correctly under the interposition model, many VFS operations also need the ability to interpose on other VFSs. For example, the (old) VFS_SYNC call must be forwarded to every VFS that is part of a given mounted file system to insure proper ordering when flushing data to the file system media. To reduce complexity, we would like to avoid interposing on VFS opera- tions. To achieve this end, we have converted many VFS opera- tions into vnode operations. The underlying idea is to treat a VFS operation as a vnode operation for which the file system(s) interpreting the operation ignore the identity of the vnode to which the operation was issued. In other words, a VFS operation can go to any vnode associated with the target VFS. Table 2 lists the set of VFS operations we have transformed into vnode operations. Table 2: Vnode Operations Transformed from VFS Operations |__________________|__________________________________________| |Name | Purpose | |__________________|__________________________________________| |VOP_PREPARE_UMOUNT| check if the VFS identified by the vnode| | | is ready to unmount (e.g., not busy) | |VOP_UMOUNT | unmount a VFS identified by the vnode | |VOP_VGET | return the vnode identified by the fid | |VOP_STATVFS | return statvfs information for the VFS| | | identified by the vnode | |VOP_SYNC | flush data belonging to the VFS| | | identified by the vnode to the| | | underlying media | |__________________|__________________________________________| 5. Implementation Having resolved the design issues described above, it was time to start implementing our prototype. We soon had a key insight: that it was possible to have the interposition frame- work coexist with the old vnode layer, with all the vnodes asso- ciated with a given mount point being uniformly either old style or new. This idea allowed us to develop our prototype incremen- tally, first converting to the interposition framework simple file systems such as fdfs and tmpfs, and later more complex file systems such as ufs and specfs. For our prototype to be a successful proof of concept, we thought it important for it to boot using new style mounts as the default. We have converted it to do so, although some file systems such as nfs, fifofs, and procfs remain available only on their old form. Our strategy of allowing old and new style file systems to coexist worked according to plan. This technique has allowed us to obtain valuable performance information without having to convert all file systems to be interposition-aware. We also view it as the basis of a migration strategy for converting to the interposition framework; file system developers can choose to take the plunge at their own pace without facing a system- wide flag day. 5.1. VFS Conversion Experience The first step of our implementation was to change the ker- nel to allow old and new style file systems to coexist. We did this by redefining the vnode structure to be the union of the old version and the new. We added a new VOP_GETVNVER operation to return the vnode structure's effective version number. As mentioned above, we also added a set of new vnode operations (e.g., VOP_GETVRDEV) to act as access functions for previously accessible fields in the old vnode structure that are no longer visible (e.g., v_rdev). Since the ops vector field is unused for new style vnodes, we assigned a set of ``just say no'' operations to that field in all such vnodes, to return ENOSYS if invoked inadvertently. We implemented parallel sets of routines in the vnode layer to handle old and new style vnodes. The implementation switches to the appropriate routine based on the target vnode's VOP_GETNVER value. Some routines, such as lookuppn, required more work, as they have to cope with both old and new vnodes, which can occur when crossing mount points. The first file system we converted was tmpfs. The main objective was to test file object creation and directory ser- vices. We chose it first because it has no persistent storage. Therefore, it is simpler to debug, and has a smaller probability of triggering race conditions that would expose bugs in the pro- totype. The conversion was smooth, and produced no surprises. Our next file system was ufs. Here, we looked for com- ponents that could be reimplemented as interposers and could therefore be reused by other file system implementations. This was the most difficult file system to convert, because it is more likely to encounter race conditions while waiting for disk i/o completions. In addition, the converted version had to be efficient, or else the performance goals of the prototype would not be met. Specfs presented a new challenge because it uses composi- tion to support devices that have multiple device names with the same major and minor device number. Since we retained the same basic design, we thought the conversion would be straightfor- ward. This was true for block devices, but we encountered dif- ficulties when dealing with character devices, especially with clone devices and the console device. The main problem was that the kernel directly accessed the private data of specfs vnodes, violating vnode opaqueness. There are two reasons why this problem occurs: o The VOP_GETATTR operation is not extensible and cannot return attributes (e.g., dev_info and common_specvp) that are unique to specfs, but that the kernel needs. This is a deficiency of the old vnode interface. o It is more efficient to access these attributes directly. The ideal solution to this problem would be to provide an extensible VOP_GETATTR operation to return both the file system type independent and file system type specific attributes, thus allowing the kernel to access specfs-specific data while still respecting vnode opaqueness. The short term solution we devised was to add a new VOP_GETKEYEDDATA operation, which allows interposers that recog- nize the input ``key'' argument to return interposer-specific data. Overall, our conversion effort was not too difficult and, in fact, it became simpler as we identified more interposers and made them available. With the exception of debugging specfs, the toughest part turned out to be breaking down existing directory-related vnode operations into more primitive ones. 5.2. The Interposer Toolkit One of our goals in this project is to make file system implementation easier. We plan to meet this goal by providing a toolkit of interposers that file system implementations can depend on to meet standard requirements (e.g. POSIX), to satisfy the UNIX file system semantics, and to acquire new and enhanced file system features. As we convert existing file systems to operate under the interposition model, we try to identify file system components that can be removed and reimplemented as interposers. The set of interposers we have identified so far is listed below. We expect to extend the toolkit with new interposers as we continue our effort to factor existing file systems into smaller com- ponents. Table 3: The Interposer Toolkit |_______|____________________________________________________________| |Name | Purpose | |_______|____________________________________________________________| |config | push the proper behavior onto a new vnode | |dnlc | add/remove a name to/from the directory name lookup cache | |mfrl | handle mandatory file and record locking | |mount | splice a VFS into the file system name space | |specref| provide access to special files | |ro | allow only read-only vnode operations to proceed | |_______|____________________________________________________________| We were pleasantly surprised by how easy the interposer toolkit is to use. If certain standard file system features (e.g., dnlc) are desired, all it takes is to include them as part of the file system configuration at mount time. There is no need to do any programming in the file system implementation. This is a big win over the existing approach for providing com- mon file system services through library services because the file system implementors must write code in advance in order to take advantage of these services. 5.3. Performance We used two sets of benchmarks to assess our prototype. The first benchmark consisted of a perl script used to time creation and deletion of 1000 files of different sizes (e.g., 0k, 1K, 4K, 10K) on a new style ufs file system that used all of the toolkit interposers listed above except ro. We used this test mainly during the early stage of the prototype development. The data obtained from this benchmark showed no significant per- formance degradation. We found these results highly encourag- ing, since our prototype properly handled multi-threaded locking and used interposition in a nontrivial way. After we had converted specfs and switched the prototype to use new style vnodes as the default, we re-ran the script, and again found no significant performance degradation. We then ran one of our standard benchmarks to measure the prototype's throughput under a workload simulating a C program development environment. The system throughput from the benchmark is reported as the number of work scripts processed per hour, for various values of a workload parameter representing the number of simulated users. We expected this benchmark to reveal some amount of performance degradation. Because of the factoring that we have done, both of vnode operations and file system implementations, optimizations possible under the old monolithic implementations are no longer available. To run this throughput benchmark successfully, a system must be very robust and our prototype has only recently reached this level of maturity. We ran the benchmark on a Sun Sparcstation 1 with 16M memory installed with SunOS 5.1. Table 4 compares results from runs with our prototype kernel and with the standard 5.1 kernel. Table 4: Throughput Results |_________|______________|______________|_________________| | | SunOS 5.1 | Prototype | | | | _____________|______________| | | Workload| Scripts/Hour| Scripts/Hour| degradation (%)| |_________|______________|______________|_________________| | 1 | 13.83 | 13.83 | 0 | | 16 | 185.03 | 181.82 | -1.7 | | 24 | 198.58 | 191.28 | -3.6 | | 28 | 193.70 | 184.85 | -4.5 | | 32 | 175.42 | 179.47 | +2.3 | | 36 | 159.41 | 143.35 | -9.9 | |_________|______________|______________|_________________| The system throughput benchmark result indicates that the prototype suffers less than 5% degradation in peak throughput, but degrades more rapidly after the peak workload. We have not yet analyzed the cause of the degradation, but suspect that it is a symptom of additional CPU overhead. Note that we have not yet tuned the prototype, and have not implemented performance saving features that are now possible under the interposition model. We therefore believe that we should be able to further improve the prototype. 6. The To-Do List In this section we discuss areas that our prototype does not yet address. We start with things we think we understand and proceed to issues of increasing difficulty. 6.1. Configurator Interposers If we succeed in converting existing file systems into a rich interposer toolkit of file system components, we will have created another problem: How can people specify combinations of components that do what they want? We expect that most users would prefer to remain ignorant of the issue altogether and that most others would prefer to use pre-existent file system confi- gurations that are known to behave properly. Thus, the problem reduces to devising a way for file system vendors to specify ``canned configurations''. In the existing system, each mount can differ by file sys- tem type from other mounts and the mount system call takes an argument specifying that type. We plan to exploit this inter- face to use the type argument to name a configurator interposer that is responsible for ensuring that all files underneath the new mount point behave appropriately. The mount system call will push this interposer on the root vnode of the newly mounted file system. It is subsequently responsible for pushing a copy of itself on directories, as lookup calls create vnodes for them, and for pushing the sequence of interposers that makes up the canned configuration on all files, as lookups create new vnodes for them. By induction, the overall effect is to ensure the proper layering configuration for all vnodes associated with the mount. 6.2. A Directory Interposer We want to split ufs into a directory piece and an inode- level file access piece, and make the former available as an interposer that other base file systems can use. With the old set of vnode operations, this separation is infeasible; opera- tions such as VOP_CREATE mix together actions from both pieces (creating a file and entering it in a directory), preventing their separation. Fortunately for this purpose, we have already had to decompose operations that mix directory access with file access, as explained above. When pushed on top of a regular file, this interposer will make it behave as a directory. For example, it will handle VOP_LOOKUP calls by reading the underlying file, looking for a matching entry. If it finds one, it will extract the fid from the entry and use it as the argument to a VOP_VGET call on the file. The VFS defining the file's behavior will then look the fid up and return its vnode, which the directory interposer then returns in turn. Other operations proceed similarly. Note that the directory interposer expects the VFS supply- ing behavior for the file it is pushed on to supply a primitive directory service of its own, in the form of the VOP_VGET opera- tion. The interposer converts that VFS's flat, name-by-fid name space into the standard hierarchical name space. 6.3. Impedance Mismatches Decomposing multi-vnode operations into more primitive single-vnode operations has consequences for existing file sys- tems. Consider, for example, the NFS client side implementa- tion. It has a particular fixed vocabulary, in the form of the NFS protocol, that it must use to communicate with the server. This vocabulary is well-matched to the old set of vnode opera- tions, but is at a considerably higher level than the new set of operations. Thus, the client implementation now faces an impedance matching problem where it must convert an incoming stream of vnode operations into a corresponding stream of NFS protocol RPCs. Because of the different levels of the two streams, the client implementation must accumulate incoming vnode operations until they add up to a protocol operation. We can handle this new implementation requirement by adding to the implementation a state machine that tracks progress in accumulating operations that add up to a protocol request. How- ever, the mechanics of the addition are likely to be ugly. One complication is that operations can occur concurrently on multi- ple vnodes, and the implementation must keep these logically separate input streams distinct. We plan to address this problem by taking a small step toward adding transactions to the vnode interface. Our motiva- tion is the observation that the vnode layer now handles many incoming requests by decomposing them into more primitive opera- tions. Thus, it has knowledge of which operations are associ- ated with which higher level requests. Stated differently, the vnode layer is effectively creating transactions, so it might as well pass some information about them to the participants. Thus, we intend to add an additional argument to each vnode operation identifying the transaction to which the operation belongs. 6.4. Transactions Having tiptoed this far toward full transaction support, why not go the whole way? We have a design sketch [Rosenthal 92], but are reluctant to use it because of concerns about its ramifications. Adding transaction support to the kernel's vnode layer and to file system implementations is as invasive a step as converting them to the interposition model. We judged that making two changes of this magnitude would add unacceptable risk to our project. However, the project is now at a point where it is time to reexamine this decision. If we decide to proceed, our transac- tion facility would be confined to the kernel's vnode layer and to file system implementations; it would not be available at user level for general use. We would use it to replace our current ad hoc lock-based implementations in those parts of the prototype. Here is a sketch of the transaction design and the effect of using the facility on some sample code. Each vnode operation would take an additional transaction context argument (replacing the transaction identifier mentioned in the previous section). If the context is NULL, the operation is not part of a transac- tion and proceeds as before. Otherwise, the operation must join the transaction the context argument identifies, by adding a record of its own to the context. If the operation cannot join the transaction, it must fail, causing the transaction to abort. Each record in the transaction context contains pointers to callback routines to be executed when the transaction aborts, prepares to commit, or commits. The transaction framework pro- vides routines to invoke the appropriate callback from each record, for use when the transaction reaches the abort, prepare, or commit stage. After restructuring to use transactions, the code in Figure 5 becomes that of Figure 6. if (VOP_FOO(vp) != 0) { /* About a page of code cleaning up the mess */ goto bad1; } /* The FOO operation succeeded - do something more */ if (VOP_BAR(vp) != 0) { /* About a page of code cleaning up the mess */ goto bad2; } /* The BAR operation succeeded - do something more */ return (0); bad2: bad1: return (ENOTTY); Figure 5: Code Before Transactions vtc = kmem_alloc(sizeof struct vtc); if (VOP_FOO(vp, vtc) == 0) { if (VOP_BAR(vp, vtc) == 0) { if (vtc_try_commit(vtc) == 0) return (0); } } return (vtc_abort(vtc)); Figure 6: Code After Transactions 6.5. Page Cache Consistency Up to now, we have avoided grappling with the interaction of the VM system with the interposition framework. A major issue is how to structure the page cache. In the old system, each vnode has an associated list of pages whose contents cache parts of the corresponding file. The pages are each marked with the address of the vnode and offset into the file. As a matter of philosophy, we believe that each VFS module should be able to interact with the page cache as it sees fit to store data consistent with the behavior it imposes on a given file. This consideration suggests that the page cache should identify pages by pvnode and offset instead of by vnode and offset, as in the old system. The other primary issue is how interposers and composers that cooperatively interact to manage a composite file can main- tain coherence among the pages that they have entered into the cache while still providing for efficient i/o involving those pages. We have preliminary ideas about how to resolve these issues, but they are not yet well enough developed to report. However, it is already apparent that we will have to add a richer set of methods for file system modules and the VM system to use to communicate with each other; anything that the VM sys- tem does that might affect the status of a cached page must be visible to the relevant file system modules and vice versa. 7. Related Work People have taken two basic approaches to overcoming defi- ciencies in the system's file system infrastructure: finding better ways to support user level file system modules and attacking the limitations of the vnode interface. Watchdogs [Bershad & Pinkerton 88] and the File Monitor Interface [Webber 93] are examples of the former approach. Both are founded on the premise that the kernel is a sufficiently inhospitable environment that it is desirable to provide facili- ties for user-level file system modules. In contrast to the user-level NFS server approach, they rely on kernel modifica- tions and a specialized user-kernel communication channel to provide finer control over how the user level file system pro- cess can handle operations, both in the choice of which opera- tions require intervention and in how to dispose of them. In contrast with in-kernel techniques such as the interposition model, approaches that attempt to move most of the work to user level offer a congenial and portable file system development environment, at the price of context switching overhead and the possibility of deadlock and additional data copying overhead. They are also potentially compatible with the interposition approach; the kernel hooks for communicating with a user-level process could be packaged as an interposer that is pushed onto vnodes of interest. The Ficus project at UCLA [Guy et al 90, Heidemann 92] and the stacking vnode work of [Rosenthal 90] pioneered the other major approach to improving the file system infrastructure. Both extend the vnode interface by introducing facilities for vnode ``stacking''. Heidemann's Ficus work added interface inheritance to the vnode interface through operation extensibility: file system modules are free to define new vnode operations. At system boot time, the layering framework determines the complete set of vnode operations and constructs ops vectors for each file system module. His framework also supports static implementation inheritance; file systems are stackable in the sense that their vnodes can compose on those of other file systems. The layering configurations for vnodes of each mountable file system type are determined at system boot time and remain fixed for each vnode's lifetime. This work has been successful enough to merit incor- poration into the Berkeley 4.4bsd release. Our interposition work is a direct descendant of Rosenthal's stacking vnode prototype [Rosenthal 90]. His system supports implementation inheritance in the form of dynamic com- position, by forming vnode stacks. The configuration of a stack may change dynamically over its lifetime. His layering frame- work provides explicit support for stacking in that, unless a file system module explicitly chooses otherwise, operations directed at any of the vnodes in a stack are redirected to the vnode at the top of the stack. This mechanism guarantees that stack configurations are opaque to their clients, but also prevents a vnode from being part of more than one stack, pre- cluding ``fan in'' configurations. (Neither Heidemann's frame- work nor the interposition model suffers from this problem.) The mechanism also causes ambiguity between a given vnode pointer's role in naming the vnode it points at and in naming the stack that vnode is in. This ambiguity is the fundamental flaw of Rosenthal's stacking model and attempts to resolve it led to our interposition model. 8. Conclusion The interposition model has been the key to solving many of the problems that bedeviled Rosenthal's original vnode stacking model. While developing our prototype implementation we have demonstrated that: o The interposition model works. We have implemented a robust prototype that heavily exploits interposition yet exhibits reasonable performance. We see no significant technical barriers to bringing our prototype to production quality. o Interposers are a powerful file system implementation and packaging tool. We have used interposers to centralize many file system services, such as mounting, lookup performance caching, and access to special files. We expect to extend this central- ization further, to directory services and media allocation policies. By doing so, we have simplified maintenance of these services and simplified the remaining file system implementations. We have made it possible to shift effort in file system development toward providing unique added value, assembling the remaining behavior from standard com- ponents. Many issues remain, most notably the transaction and page caching issues described above. We think that our locking tech- niques are (barely) sufficient to allow us to avoid solving the transaction problem, but we know we must solve the page caching problem. We are encouraged by our progress so far and feel that we will be able to resolve these problems, yielding a much improved file system infrastructure. 9. Acknowledgements Thanks go to David Rosenthal, for originating this line of work, and to the members of the UNIX International Stackable Files Work Group, for providing the impetus to resume the work after it had lain fallow for a couple years. 10. References [Bershad & Pinkerton 88] Bershad, Brian N. and C. Brian Pink- erton, ``Watchdogs: Extending the UNIX File System'', Proceedings of the Winter 1988 USENIX Conference, Dallas, 1988. [Callaghan & Lyon 89] Callaghan, Brent and Tom Lyon, ``The Automounter'', Proceedings of the Winter 1989 USENIX Conference, San Diego, 1989. [Guy et al 90] Guy, Richard G., John S. Heidemann, Wal Mak, Thomas W. Page, Jr., Gerald J. Popek, and Dieter Rothmeier, ``Implementation of the Ficus Repli- cated File System'', Proceedings of the Summer 1990 USENIX Conference, Anaheim, 1990. [Heidemann 92] Heidemann, John S., ``File System Development with Stackable Layers'', Technical Report, Department of Com- puter Science, University of Califor- nia, Los Angeles, 1992. [Kleiman 86] Kleiman, Steven R., ``Vnodes: An Architecture for Multiple File System Types in Sun UNIX'', Proceedings of the Summer 1986 USENIX Conference, Atlanta, 1986. [Minnich 93] Minnich, Ronald G., ``The AutoCacher: A File Cache Which Operates at the NFS Level'', Proceedings of the Winter 1993 USENIX Conference, San Diego, 1993. [Moran & Lyon 93] Moran, Joe and Bob Lyon, ``The Restore-o-Mounter: The File Motel Revisited'', Proceedings of the Summer 1993 USENIX Conference, Cincinnati, 1993. [Pendry 89] Pendry, Jan-Simon, ``Amd - an Auto- mounter'', Technical Report, Depart- ment of Computing, Imperial College, London, 1989. [Rosenthal 90] Rosenthal, David, S. H., ``Evolving the Vnode Interface'', Proceedings of the Summer 1990 USENIX Conference, Anaheim, 1990. [Rosenthal 92] Rosenthal, David S. H., ``Requirements for a "Stacking" Vnode/VFS Inter- face'', UNIX International document SF-01-92-N014, Parsippany, 1992. [Webber 93] Webber, Neil, ``Operating System Sup- port for Portable File System Exten- sions'', Proceedings of the Winter 1993 USENIX Conference, San Diego, 1993. 11. Author Information Glenn Skinner is a Senior Staff Engineer at SunSoft. He is responsible for riding herd over the SunOS architecture and, in what little time is left over, he tries to do real work. He is discovering that, no matter how hard he tries, he can't leave the influence of his years of STREAMS development behind. His e-mail address is glenn.skinner@Eng.Sun.COM. Thomas K. Wong is a member of the Technical Staff in the File System Group at Sunsoft. He received his S.B. degree in Computer Science from MIT. He is a member of X3B11.1 and ECMA TC15, and was the project editor for the ECMA Standard 168 (ISO/IEC DIS 13490) Volume and File Structure of Read-Only and Write-Once Compact Disc Media for Information Interchange. His e-mail address is thomas.wong@Eng.Sun.COM.