################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Discovery and Hot Replacement of Replicated Read-Only File Systems, with Application to Mobile Computing Erez Zadok and Dan Duchamp Computer Science Department Columbia University ABSTRACT We describe a mechanism for replacing files, including open files, of a* read-only file system while the file system remains mounted; the act of replacement is transparent to the user. Such a "hot replacement" mechanism can improve fault-tolerance, * *performance, or both. Our mechanism monitors, from the client side, the latency of oper* *ations directed at each file system. When latency degrades, the client automatically seeks* * a replacement file system that is equivalent to but hopefully faster than the current fi* *le system. The files in the replacement file system then take the place of those in the current* * file system. This work has particular relevance to mobile computers, which in some cases mig* *ht move over a wide area. Wide area movement can be expected to lead to highly variable* * response time, and give rise to three sorts of problems: increased latency, increas* *ed failures, and decreased scalability. If a mobile client moves through regions having par* *tial replicas of common file systems, then the mobile client can depend on our mechanism to* * provide increased fault tolerance and more uniform performance. 1 Introduction The strongest trend in the computer industry today is the miniaturization * *of workstations into portable "notebook" or "palmtop" computers. Wireless network links [3] and new * *internetworking technology [8] offer the possibility that computing sessions could run without * *interruption even as computers move, using information services drawn from an infrastructure of (mos* *tly) stationary servers. We contend that operation of mobile computers according to such a model wi* *ll raise problems that require re-thinking certain issues in file system design.1 One such issue * *is how to cope with a client that moves regularly yet unpredictably over a wide area. Several problems arise when a client moves a substantial distance away fro* *m its current set of servers. One is worse latency, since files not cached at the client must be* * fetched over longer distances. Another problem is increased probability of loss of connectivity, si* *nce gateway failures often lead to partitions. The final problem is decreased overall system "scalab* *ility:" more clients moving more data over more gateways means greater stress on the shared network. One obvious way to mitigate these problems is to ensure that a file servic* *e client uses "nearby" servers at all times. A simple motivating example is that if a computer moves f* *rom New York to Boston, then in many cases it is advantageous to switch to using the Boston cop* *ies of "common" files like those in /usr/ucb. As the client moves, the file service must be ab* *le to provide service first from one server, then from another. This switching mechanism should requi* *re no action on the _______________________________1 Examples of such re-thinking can be found in [25] and [26]. part of administrators (since presumably too many clients will move too often a* *nd too quickly for administrators to track conveniently) and should be invisible to users, so that* * users need not become system administrators. We have designed and implemented just such a file system _ it adaptively d* *iscovers and mounts a "better" copy of a read-only file system which is fully or partially r* *eplicated. We define a better replica to be one providing better latency. Running our file service g* *ives a mobile client some recourse to the disadvantages mentioned above. Our mechanism monitors file* * service latencies and, when response becomes inadequate, performs a dynamic attribute-guided sear* *ch for a suitable replacement file system. Many useful "system" file systems _ and almost all file systems that one w* *ould expect to be replicated over a wide area _ are typically exported read-only. Examples in* *clude common executables, manual pages, fonts, include files, etc. Indeed, read-only areas * *of the file space are growing fast, as programs increase the amount of configuration information, ima* *ges, and on-line help facilities. Although our work is motivated by the perceived needs of mobile computers * *that might roam over a wide area and/or frequently cross between public and private networks, o* *ur work can be useful in any environment characterized by highly variable response time and/or* * high failure rates. Note that for a client to continue use of a file system as it moves, there* * must be underlying network support that permits the movement of a computer from one network to ano* *ther without interruption of its sessions. Several such schemes have been developed [8, 7, 2* *8, 29]. The remainder of this paper is organized as follows. In order to make a se* *lf-contained pre- sentation, Section 2 provides brief explanations of other systems that we use i* *n constructing ours. Section 3 outlines our design and Section 4 evaluates the work. Lastly, we ment* *ion related work in Section 5 and summarize in Section 6. 2 Background Our work is implemented in and on SunOS 4.1.2. We have changed the kernel'* *s client-side NFS implementation, and outside the operating system we have made use of the Am* *d automounter and the RLP resource location protocol. Each is explained briefly below. 2.1 NFS Particulars about the NFS protocol and implementation are widely known and* * published [20, 12, 9, 19]. For the purpose of our presentation, the only uncommon facts that n* *eed to be known are: o Translation of a file path name to a vnode is done mostly within a single * *procedure, called au_lookuppn(), that is responsible for detecting and expanding symbolic li* *nks and for detect- ing and crossing mount points. o The name of the procedure in which an NFS client makes RPCs to a server is* * rfscall(). We have made substantial alterations to au_lookuppn(), and slight alterati* *ons to rfscall(), nfs_mount(), nfs_unmount() and copen().2 We added two new system calls: one for* * controlling and querying the added structures in the kernel, and the other for debugging. F* *inally, we added fields to three major kernel data structures: vnode and vfs structures and the * *open file table. 2.2 RLP We use the RLP resource location protocol [1] when seeking a replacement f* *ile system. RLP is a general-purpose protocol that allows a site to send broadcast or unicast requ* *est messages asking either of two questions: _______________________________2 Copen() is the common code for open() and create(). 1.Do you (recipient site) provide this service? 2.Do you (recipient site) know of any site that provides this service? A service is named by the combination of its transport service (e.g., TCP)* *, its well-known port number as listed in /etc/services, and an arbitrary string that has meaning to * *the service. Since we search for an NFS-mountable file system, our RLP request messages contain in* *formation such as the NFS transport protocol (UDP [16]), port number (2049) and service-specific * *information such as the name of the root of the file system. 2.3 Amd Amd [15] is a widely-used automounter daemon. Its most common use is to de* *mand-mount file systems and later unmount them after a period of disuse; however, Amd has many * *other capabilities. Amd operates by mimicking an NFS server. An Amd process is identified to * *the kernel as the "NFS server" for a particular mount point. The only NFS calls for which Am* *d provides an implementation are those that perform name binding: lookup, readdir, and readli* *nk. Since a file must have its name resolved before it can be used, Amd is assured of receiving * *control during the first use of any file below an Amd mount point. Amd checks whether the file sys* *tem mapped to that mount point is currently mounted; if not, Amd mounts it, makes a symbolic * *link to the mount point, and returns to the kernel. If the file system is already mounted, Amd re* *turns immediately. An example, taken from our environment, of Amd's operation is the followin* *g. Suppose /u is designated as the directory in which all user file systems live; Amd service* *s this directory. At startup time, Amd is instructed that the mount point is /n. If any of the thre* *e name binding operations mentioned above occurs for any file below /u, then Amd is invoked. A* *md consults its maps, which indicate that /u/foo is available on server bar. This file system i* *s then mounted locally at /n/bar/u/foo and /u/foo is made a symbolic link to /n/bar/u/foo. (Placing th* *e server name in the name of the mount point is purely a configuration decision, and is not e* *ssential.) Our work is not dependent on Amd; we use it for convenience. Amd typically* * controls the (un)mounting of all file systems on the client machines on which it runs, and t* *here is no advantage to our work in circumventing it and performing our own (un)mounts. 2.3.1 How Our Work Goes Beyond Amd Amd does not already possess the capabilities we need, nor is our work a s* *imple extension to Amd. Our work adds at least three major capabilities: 1.Amd keeps a description of where to find to-be-mounted file systems in "mo* *unt-maps." These maps are written by administrators and are static in the sense that Amd ha* *s no ability for automated, adaptive, unplanned discovery and selection of a replacement fi* *le system. 2.Because it is only a user-level automount daemon, Amd has limited means to* * monitor the response of rfscall() or any other kernel routine. Many systems provide a tool, like nfsstat, that returns timing information* * gathered by the kernel. However, nfsstat is inadequate because it is not as accurate as ou* *r measurements, and provides weighted average response time rather than measured response time* *. Our method additionally is less sensitive to outliers measures both short-term and lo* *ng-term performance. 3.Our mechanism provides for transparently switching open files from one fil* *e system to its replacement. 3 Design The key issues we see in this work are: 1.Is a switching mechanism really needed? Why not use the same file systems * *no matter where you are? 2.When and how to switch from one replica to another. 3.How to ensure that the new file system is an acceptable replacement for th* *e old one. 4.How to ensure consistency if updates are applied across different replicas. 5.Fault tolerance: how to protect a client from server unavailability. 6.Security: NFS is designed for a local "workgroup" environment in which the* * space of user IDs is centrally controlled. These issues are addressed below. 3.1 Demonstrating the Need We contend that adaptive client-server matchups are desirable because runn* *ing file system operations over many network hops is bad for performance in three ways: increas* *ed latency, increased failures, and decreased scalability. It is hard to ascertain exact failure rat* *es and load on shared resources without undertaking a full-scale network study; however, we were able* * to gather some key data to support our claim. We performed a simple study to measure how latency i* *ncreases with distance. First, we used the traceroute program3 to gather data* * points measured between a host at Columbia and several other hosts around the campus, city, reg* *ion, and continent. Latencies were measured by a Columbia host, which is a Sun-4/75 equipped with a* * microsecond resolution clock. The cost of entering the kernel and reading the clock is neg* *ligible, and so the measurements are accurate to a small fraction of a millisecond. Next, we mounted NFS file systems that are exported Internet-wide by certa* *in hosts. We measured the time needed to copy 1MB from these hosts using a 1KB block size. A* * typical result is plotted in Figure 1. Latency jumps by almost two orders of magnitude at the * *tenth hop, which represents the first host outside Columbia. _____________________________________________________________________________ Figure 1: NFS Read Latency vs. Network Hop Count _____________________________________________________________________________ _______________________________3 Written by Van Jacobson and widely available by anonymous ftp. 3.2 When to Switch We have modified the kernel so that rfscall() measures the latency of ever* *y NFS lookup and maintains a per-filesystem data structure storing a number of recently meas* *ured latencies. We chose to time the lookup operation rather than any other operation or m* *ixture of operations for two reasons. The first is that lookup is the most frequently invoked NFS op* *eration. We felt other calls would not generate enough data points to accurately characterize latency.* * The second reason is that lookup exhibits the least performance variability of the common NFS ope* *rations. Limiting variability of measured server latencies is important in our work, since we wan* *t to distinguish transient changes in server performance from long-term changes. (At the outset of our work, we measured variances in the latency of the mo* *st common NFS operations and discovered huge swings, shown in Figure 2, even in an extended L* *AN environment that has been engineered to be uniform and not to have obvious bottlenecks. The* * measured standard deviations were 1027 msec for all NFS operations, 2547 msec for read, and 596 m* *sec for lookup.) _____________________________________________________________________________ Figure 2: Variability and Latency of NFS operations _____________________________________________________________________________ After addition of each newly measured lookup operation, the median latency* * is computed over the last 30 and 300 calls. We compute medians because medians are relative* *ly insensitive to outliers. We take a data point no more than once per second, so during busy tim* *es these sampling intervals correspond to 30 seconds and 5 minutes, respectively. This policy pro* *vides insurance against anomalies like ping-pong switching between a pair of file systems: a file syste* *m can be replaced no more frequently than every 5 minutes. The signal to switch is when, at any moment, the short-term median latency* * exceeds the long- term median latency by a factor of 2. Looking for a factor of two difference be* *tween short-term and long-term medians is our attempt to detect a change in performance which is* * substantial and "sudden," yet not transient. The length of the short-term and long-term medians* * as well as the ratio used to signal a switch are heuristics chosen after experimentation in our envi* *ronment. All these parameters can be changed from user level through a debugging system call that * *we have added. 3.3 Locating a Replacement When a switch is triggered, rfscall() starts a non-blocking RPC out to our* * user-level process that performs replacement, nfsmgrd.4 The call names the guilty file server, the* * root of the file system being sought, the kernel architecture, and any mount options affecting the file* * system. Nfsmgrd uses these pieces of information to compose and broadcast an RLP request. The file s* *ystem name keys the search, while the server name is a filter: the search must not return the s* *ame file server that is already in use. The RLP message is received by the nfsmgrd at other sites on the same broa* *dcast subnet. To formulate a proper response, an nfsmgrd must have a view of mountable file syst* *ems stored at its site and also mounted file systems that it is using _ either type could be what* * is being searched for. Both pieces of information are trivially accessible through /etc/fstab, /* *etc/exports, and /etc/mtab. The nfsmgrd at the site that originated the search uses the first response* * it gets; we suppose that the speed with which a server responds to the RLP request gives a hint abo* *ut its future performance. (The Sun Automounter [2] makes the same assumption about replicate* *d file servers.) If a read-only replacement file system is available, nfsmgrd instructs Amd to m* *ount it and terminates the out-of-kernel RPC, telling the kernel the names of the replacement server a* *nd file system. The flow of control is depicted in Figure 3. _____________________________________________________________________________ Figure 3: Flow of Control During a Switch _____________________________________________________________________________ _______________________________4 Non-blocking operation is provided by a special kernel implementation of Sun* * RPC. 3.4 Using the Replacement Once a replacement file system has been located and mounted, all future at* *tempts to open files on the replaced file system will be routed to the replacement whenever they can* * be. Also, in all cases for which it is possible, open files on the replaced file system will be switch* *ed to their equivalents on the replacement. We describe these two cases in Sections 3.4.2 and 3.4.3, re* *spectively. 3.4.1 Relevant Changes to Kernel Data Structures In order to accommodate file system replacement, we have added some fields* * to three important kernel data structures: struct vfs, which describes mounted file systems; struc* *t vnode, which describes open files; and struct file, which describes file descriptors. The fields added to struct vfs, excluding debugging fields, are: o The field vfs_replaces is valid in the vfs structure of the replacement fi* *le system; it points to the vfs structure of the file system being replaced. o The field vfs_replaced_by is valid in the replaced file system's vfs struc* *t; it points to the vfs structure of the replacement file system. When a replacement file system is mounted, our altered version of nfs_moun* *t() sets the re- placed and replacement file systems pointing to each other. o The field vfs_nfsmgr_flags is valid for any NFS file system. One flag indi* *cates whether the file system is managed by nfsmgrd; another indicates whether a file system* * switch is in progress. o The field vfs_median_info contains almost all of the pertinent information* * about the perfor- mance of the file system, including the 300 most recent nfs_lookup() respo* *nse times. o The field vfs_dft is the Duplicate File Table (DFT). This per-filesystem t* *able lists which files in the replacement file system have been compared to the corresponding fil* *e on the original file system mounted by Amd. Only equivalent files can be accessed on the r* *eplacement file system. The mechanism for making comparisons is described in Section 3.4.2. The size of the DFT is fixed (but changeable) so that new entries inserted* * will automatically purge old ones. This is a simple method to maintain "freshness" of entries. The DFT is a hash table whose entries contain a file pathname relative to * *the mount point, a pointer to the vfs structure of the replacement file system, and an extra * *pointer for threading the entries in insertion order. This data structure permits fast lookups k* *eyed by pathname and quick purging of older entries. The only field added to struct vnode is v_last_used, which contains the la* *st time that rfscall() made a remote call on behalf of this vnode. This information is used * *in "hot replacement," as described in Section 3.4.3. The only field added to struct file is f_path, which contains the relative* * pathname from the mount point to the file for which the descriptor was opened. Different entr* *ies may have different pathnames for the same file if several hard links point to the file. 3.4.2 After Replacement: Handling New Opens When Amd mounts a file system it makes a symlink from the desired location* * of the file system to the mount point. For example, /u/foo would be a symlink pointing to the real* * mount point of /n/bar/u/foo; by our local convention, this would indicate that server bar expo* *rts /u/foo. Users and application programs know only the name /u/foo. The information that bar exports a proper version of /u/foo is placed in A* *md's mount-maps by system administrators who presumably ensure that the file system bar:/u/foo * *is a good version of whatever /u/foo should be. Therefore, we regard the information in the clien* *t's Amd mount- maps as authoritative, and consider any file system that the client might mount* * and place at /u/foo as a correct and complete copy of the file system. We call this file system the* * master copy, and use it for comparison against the replacement file systems that our mechanism locat* *es and mounts. The new open algorithm is shown in Figure 4. After a replacement has been * *mounted, whenever name resolution must be performed for any file on the replaced file system, the* * file system's DFT is first searched for the relative pathname. If the DFT indicates that the replace* *ment file system has an equivalent copy of the file, then that file is used. _____________________________________________________________________________ open() - examine vfs_replaced_by field to see if there is a replacement file system; if (no replacement file system) - continue name resolution; return; " if (DFT entry doesn't exist) - create and begin DFT entry; call out to perform file comparison; finish DFT entry; " if (files equivalent) - get vfs of replacement from vfs_replaces field; continue name resolution on replacement file system; " else continue name resolution on master copy; " Figure 4: New Open Algorithm _____________________________________________________________________________ If the DFT contains an entry for the pathname, then the file on the replac* *ement file system has already been compared to its counterpart on the master copy. A field in the* * DFT tells if the comparison was successful or not. If not, then the rest of the pathname has to * *be resolved on the master copy. If the comparison was successful, then the file on the replacement* * file system is used; in that case, name resolution continues at the root of the replacement file sys* *tem. If the DFT contains no entry for the pathname, then it is unknown whether * *the file on the replacement file system is equivalent to the corresponding file on the master c* *opy. To test equivalence, au_lookuppn() calls out of the kernel to nfsmgrd, pas* *sing it the two host names, the name of the file system, and the relative pathname to be compared. A* * partial DFT entry is constructed, and a flag in it is turned on to indicate that there is a compa* *rison in progress and that no other process should initiate the same comparison.5 Nfsmgrd then applies, at user level, whatever tests might be appropriate t* *o determine whether the two files are equivalent. This flow of control is depicted in Figure 5. Pre* *sently, we are performing file checksum comparison: nfsmgrd calls a checksumd daemon on each of the file * *servers, requesting the checksum of the file being compared. Checksumd, which we have written for t* *his work, computes MD4 [18] file checksums on demand and then stores them for later use; checksums* * can also be pre- computed and stored. Nfsmgrd collects the two checksums, compares them, and responds to the ker* *nel, telling au_lookuppn() which pathname to use, always indicating the file on the replacem* *ent file system if possible. Au_lookuppn() completes the construction of the DFT entry, unlocks it* *, and marks which _______________________________5 This avoids the need to lock the call out to nfsmgrd. _____________________________________________________________________________ Figure 5: Flow of Control During File Comparison _____________________________________________________________________________ vfs is the proper one to use whenever the same pathname is resolved again. In this fashion, all new pathname resolutions are re-directed to the repla* *cement file system whenever possible. Note that the master copy could be unmounted (e.g., Amd by default unmount* *s a file system after a few minutes of inactivity), and this would not affect our mechanism. Th* *e next use of a file in that file system would cause some master copy to be automounted, before any * *of our code is encountered. 3.4.3 After Replacement: Handling Files Already Open When a file system is replaced, it is possible that some files will be ope* *n on the replaced file system at the moment when the replacement is mounted. Were the processes with t* *hese open files to continue to use the replaced file system, several negative consequences might e* *nsue. First, since the replacement is presumed to provide faster response, the processes using files o* *pen on the replaced file systems experience worse service. Second, since the total number of mounted fil* *e systems grows as replacements happen, the probability rises that some file system eventually bec* *omes unavailable and causes processes to block. Further, the incremental effect of each successive f* *ile system replacement operation is reduced somewhat, since files that are open long-term do not benef* *it from replacement. Finally, kernel data structures grow larger as the number of mounted file syste* *ms climbs. Motivated by these reasons, we decided to switch open files from the replaced file system* * to the replacement file system whenever the file on the replacement file system is equivalent to t* *hat on the master copy. Although this idea might at first seem preposterous, it is not, since we r* *estrict ourselves to read- only file systems. We assume that files on read-only file systems6 change very * *infrequently and/or are updated with care to guard against inconsistent reads.7 Whether operating c* *onditions uphold this assumption or not, the problem of a file being updated8 while being read e* *xists independently of our work, and our work does not increase the danger. _______________________________6 7That is, they are exported as read-only to some hosts, although they might b* *e exported as read-write to others. 8An example of "careful update" is provided by the SUP utility [22]. That is, updated by a host to which the file system is exported read-write. We allow for a replacement file system to be itself replaced. This raises* * the possibility of creating a "chain" of replacement file systems. Switching vnodes from the old * *file system to its replacement limits this chain to length two (the master copy and the current re* *placement) in steady state. The "hot replacement" code scans through the global open file table, keyin* *g on entries by vfs. Once an entry is found that uses the file system being replaced, a secondary sc* *an locates all other entries using the same vnode. In a single entry into the kernel (i.e., "atomica* *lly"), all file descriptors pointing to that vnode are switched, thereby avoiding complex questions of lock* *ing and reference counting. Hot replacement requires knowing pathnames. Thanks to our changes, the vfs* * structure records the pathname it is mounted on and identifies the replacement file system; also,* * the relative pathname of the file is stored in the file table entry. This information is extracted, c* *ombined with the host names, and passed out to nfsmgrd to perform comparison, as described above. If * *the comparison is successful, the pathname on the replacement file system is looked up, yieldi* *ng a vnode on the replacement file system. This vnode simply replaces the previous vnode in all e* *ntries in the open file table. This results in a switch the next time a process uses an open file * *descriptor. Hot replacement is enabled by the statelessness of NFS and by the vfs/vnod* *e interfaces within the kernel. Since the replaced server keeps no state about the client, and sinc* *e the open file table knows only a pointer to a vnode, switching this pointer in every file table ent* *ry suffices to do hot replacement. An interesting issue is at which time to perform the hot replacement of vn* *odes. Since each file requires a comparison to determine equivalence, switching vnodes of all the ope* *n files of a given file system could be a lengthy process. The four options we considered are: 1.Switch as soon as a replacement file system is mounted (the early approach* *). 2.Switch only if/when an RPC for that vnode hangs (the late approach). 3.Switch if/when the vnode is next used (the "on-demand" approach). 4.Switch whenever a daemon instructs it to (the "flexible" approach). The decision to switch earlier or later is affected by the tradeoff that early * *switching more quickly switches files to the faster file system and improves fault tolerance by reduci* *ng the number of file systems in use, but possibly wastes effort. Vnode switching is a waste in all c* *ases when a vnode exists that will not be used again. Early switching also has the disadvantage o* *f placing the entire delay of switching onto the single file reference that is unlucky enough to be * *the next one. We chose the "flexible" approach of having a daemon make a system call int* *o the kernel which then sweeps through the open file table and replaces some of the vnodes which c* *an be replaced. We made this choice for three reasons. First, we lacked data indicating how lon* *g a vnode lingers after its final use. Second, we suspected that such data, if obtained, would no* *t conclusively decide the question in favor of an early or late approach. Third, the daemon solution * *affords much more flexibility, including the possibility of more "intelligent" decisions such as * *making the switch during an idle period. We emphasize that the system call into the kernel switches "some" of the v* *nodes, since it may be preferable to bound the delay imposed on the system by one of these calls. T* *wo such bounding policies that we have investigated are, first, switching only N vnodes per call* *, and, second, switching only vnodes that have been accessed in the past M time units. Assuming that fil* *e access is bursty (a contention supported by statistics [14]), the latter policy reduces the amou* *nt of time wasted switching vnodes that will never be used again. We are currently using this pol* *icy of switching only recently used vnodes; this policy makes use of the v_last_used field that we ad* *ded to the vnode structure. 3.5 Security The NFS security model is the simple uid/gid borrowed from UNIX, and is ap* *propriate only in a "workgroup" situation where there is a central administrative authority. Tran* *sporting a portable computer from one NFS user ID domain to another presents a security threat, sin* *ce processes assigned user ID X in one domain can access exported files owned by user ID X i* *n the second domain. Accordingly, we have altered rfscall() so that every call to a replacement* * file system has its user ID and group ID both mapped to "nobody" (i.e., value -2). Therefore, only * *world-readable files on replacement file systems can be accessed. 3.6 Code Size Counting blank lines, comments, and debugging support, we have written clo* *se to 11,000 lines of C. More than half is for user-level utilities: 1200 lines for the RLP librar* *y and daemon, 3200 for nfsmgrd, 700 lines for checksumd, and 1200 lines for a control utility (called * *nfsmgr_ctl). New kernel code totals 4000 lines, of which 800 are changes to SunOS, mostly in the NFS mo* *dule. The remaining 3200 lines comprise the four modules we have added: 880 lines to deal with stor* *ing and computing medians; 780 lines are the "core NFS management" code, which performs file syst* *em switching, pathname storage and replacement, and out-of-kernel RPC; 540 lines to manage th* *e DFT; and 1000 lines to support the nfsmgr_ctl system call. The nfsmgr_ctl system call allows query and control over almost all data s* *tructures and parameters of the added facility. We chose a system call over a kmem program fo* *r security. This facility was used heavily during debugging; however, it is meant also for syste* *m administrators and other interested users who would like to change these "magic" variables to valu* *es more suitable for their circumstances. 4 Evaluation This system is implemented and is receiving use on a limited number of mac* *hines. The goal of this work is to improve overall file system performance _ unde* *r certain circum- stances, at least _ and to improve it enough to justify the extra complexity. F* *or this method to really work, it must have: 1.Low overhead latency measurement between switches. 2.A quick switch. 3.Low overhead access to the replacement after a switch. 4.No anomalies or instabilities, like ping-pong switching. 5.No process hangs due to server failures. 6.No security or administrative complications. We have carried out several measurements aimed at evaluating how well our mecha* *nism meets these goals. The overhead between switches is that of performance monitoring. The added* * cost of timing every rfscall() we found too small to measure. The cost of computing medians co* *uld be signif- icant, since we retain 300 values. But we implemented a fast incremental median* * algorithm that requires just a negligible fraction of the time in nfs_lookup(). The kernel dat* *a structures are not so negligible: retaining 300 latency measurements costs about 2KB per file syst* *em. The reason for the expansion is the extra pointers that must be maintained to make the inc* *remental median algorithm work. The extra fields in the struct vfs, struct vnode, struct file a* *re small, with the exception of the DFT, which is large. The current size of each (per-filesys* *tem) DFT is 60 slots which occupy a total of 1KB-2KB on average. Our measured overall switch time is approximately 3 sec. This is the time * *between the request for a new file system and when the new file system is mounted (messages 1-8 in * *Figure 3). Three seconds is comparable to the time needed in our facility to mount a file system* * whose location is already encoded in Amd's maps, suggesting that most of the time goes to the mou* *nt operation. The overhead after a switch consists mostly of doing equivalence checks ou* *tside the kernel; the time to access the vfs of the replacement file system and DFT during au_lookupp* *n() is immeasurably small. Only a few milliseconds are devoted to calling checksumd: 5-7 msec if th* *e checksum is already computed. This call to checksumd is done once and need not be done again so lon* *g as a record of equivalence remains in the DFT. A major issue is how long to cache DFT entries that indicate equivalence. * *Being stateless, NFS does not provide any sort of server-to-client cache invalidation informatio* *n. Not caching at all ensures that files on the replacement file system are always equal to those on * *the master copy; but of course this defeats the purpose of using the replacement. We suppose that most * *publicly-exported read-only file systems have their contents changed rarely, and thus one should * *cache to the maximum extent. Accordingly, we manage the DFT cache by LRU. As mentioned above, switching instabilities are all but eliminated by prev* *enting switches more frequently than every 5 minutes. 4.1 Experience 4.1.1 What is Read-Only Most of the files in our facility reside on read-only file systems. Howeve* *r, sometimes one can be surprised. For example, GNU Emacs is written to require a world-writable loc* *k directory. In this directory Emacs writes files indicating which users have which files in us* *e. The intent is to detect and prevent simultaneous modification of a file by different processes. * *A side effect is that the "system" directory in which Emacs is housed (at our installation, /usr/local) m* *ust be exported read-write. Deployment of our file service spurred us to change Emacs. We wanted /usr/* *local to be read- only so that we could mount replacements dynamically. Also, at our facility the* *re are several copies of /usr/local per subnet, which defeats Emacs' intention of using /usr/local as* * a universally shared location. We re-wrote Emacs to write its lock files in the user's home d* *irectory since (1) for security, our system administrators wish to have as few read-write system areas* * as possible and, (2) in our environment by far the likeliest scenario of simultaneous modification i* *s between two sessions of the same user, rather than between users. 4.1.2 Suitability of Software Base Kernel. The vfs and vnode interfaces in the kernel greatly simplified our* * work. The hot replacement, in particular, proved far easier than we had feared, thanks to the* * vnode interface. The special out-of-kernel RPC library also was a major help. Nevertheless, work* * such as ours makes painfully obvious the benefits of implementing file service out of the kernel. * *The length and difficulty of the edit-compile-debug cycle, and the primitive debugging tools available fo* *r the kernel were truly debilitating. RLP. RLP was designed in 1983, when the evils of over-broadcasting were no* *t as deeply appreciated as they are today and when there were few multicast implementations* *. Accordingly, RLP is specified as a broadcast protocol. A more up-to-date protocol would use * *multicast. The benefits would include causing much less waste (i.e., bothering hosts that lack* * an RLP daemon) and contacting many more RLP daemons. Not surprisingly, we encountered considerable* * resistance from our bridges and routers when trying to propagate an RLP request. A multicast RL* *P request would travel considerably farther. NFS. NFS is ill-suited for "cold replacement" (i.e., new opens on a replac* *ement file system) caused by mobility, but is well suited for "hot replacement" because of its sta* *telessness. NFS' lack of cache consistency callbacks has long been bemoaned, and it af* *fects this work since there is no way to invalidate DFT entries. Since we restrict ourselves to* * read-only files, the danger is assumed to be limited, but is still present. Most newer file service * *designs include cache consistency protocols. However, such protocols are not necessarily a panacea. T* *oo much interaction between client and server can harm performance, especially if these interaction* *s take place over a long distance and/or a low bandwidth connection. See [27] for a design that can* * ensure consistency with relatively little client-server interaction. The primary drawback of using NFS for mobile computing is its limited secu* *rity model. Not only can a client from one domain access files in another domain that are made * *accessible to the same user ID number, but even a well-meaning client cannot prevent itself from * *doing so, since there is no good and easy way to tell when a computer has moved into another uid/gid * *domain. 5 Related Work It is a thesis of our work that in order for mobile computing to become th* *e new standard model of computing, adaptive resource location and management will have to beco* *me an automatic function of distributed services software. The notion of constantly-networked, * *portable computers running modern operating systems is relatively new. Accordingly, we know of no * *work other than our own (already cited) on the topic of adaptive, dynamic mounting. The Coda file system [21] supposes that mobile computing will take place i* *n the form of "disconnected operation," and describes in [11] a method in which the user spec* *ifies how to "stash" (read/write) files before disconnection and then, upon reconnection, have the f* *ile service run an algorithm to detect version skew. Coda can be taken as a point of contrast to o* *ur system, since the idea of disconnection is antithetical to our philosophy. We believe trends in w* *ireless communication point to the ability to be connected any time, anywhere. Users may decide not t* *o connect (e.g., for cost reasons) but will not be forced not to connect (e.g., because the network * *is unreliable or not omnipresent). We call this mode of operation elective connectivity. An obvious alternative to our NFS-based effort is to employ a file system * *designed for wide- area and/or multi-domain operation. Such file systems have the advantages of a * *cache consistency protocol and a security model that recognizes the existence of many administrat* *ive domains. Large scale file systems include AFS [6] and its spinoffs, Decorum [10] and IFS (Inst* *itutional File System) [5]. Experiments involving AFS as a "nation-wide" file service have been going * *on for years [23]. This effort has focused on stitching together distinct administrative domains s* *o as to provide a single unified naming and protection space. However, some changes are needed t* *o the present authentication model in order to support the possibility of a mobile client rel* *ocating in a new domain. In particular, if the relocated client will make use of local services,* * then there should be some means whereby one authentication agent (i.e., that in the new domain) woul* *d accept the word of another authentication agent (i.e., that in the client's home domain) regard* *ing the identity of the client. The IFS project has also begun to investigate alterations to AFS in suppor* *t of mobile computers [4]. Specifically, they are investigating cache pre-loading techniques for disc* *onnected operation and transport protocols that are savvy about the delays caused by "cell handoff" _ * *the time during which a mobile computer moves from one network to another. Plan 9's bind command has been designed to make it easy to mount new file * *systems. In particular, file systems can be mounted "before" or "after" file systems alread* *y mounted at the same point. The before/after concept replaces the notion of a search path. Plan* * 9 also supports the notion of a "union mount" [17]. The Plan 9 bind mechanism is a more elegant alt* *ernative to our double mounting plus comparison. However, a binding mechanism _ even an unusual* *ly flexible one such as that of Plan 9 _ addresses only part of the problem of switching betwee* *n file systems. The harder part of the problem is determining when to switch and what to switch to. 6 Conclusion We have described the operation, performance, and convenience of a transpa* *rent, adaptive mechanism for file system discovery and replacement. The adaptiveness of the me* *thod lies in the fact that a file service client no longer depends solely on a static descriptio* *n of where to find various file systems, but instead can invoke a resource location protocol to inspect th* *e local area for file systems to replace the ones it already has mounted. Such a mechanism is generally useful, but offers particularly important su* *pport for mobile computers which may experience drastic differences in response time as a result* * of their movement. Reasons for experiencing variable response include: (1) moving beyond the home * *administrative domain and so increasing the "network distance" between client and server and (* *2) moving between high-bandwidth private networks and low-bandwidth public networks (such movemen* *t might occur even within a small geographic area). While our work does not address how to ac* *cess replicated read/write file systems or how to access one's home directory while on the move* *, our technique does bear on the problems of the mobile user. Specifically, by using our technique, * *a mobile user can be relieved of the choice of either suffering with poor performance or devoting su* *bstantial local storage to "system" files.9 Instead, the user could rely on our mechanism to continuous* *ly locate copies of system files that provide superior latency, while allocating all or most of his* */her limited local storage to caching or stashing read/write files such as those from the home directory. Our work is partitioned into three modular pieces: heuristic methods for d* *etecting performance degradation and triggering a search; a search technique coupled with a method f* *or testing equivalence versus a master copy; and a method for force-switching open files from the use * *of vnodes on one file system to vnodes on another (i.e., "hot replacement"). There is little interrel* *ationship among these techniques, and so our contributions can be viewed as consisting not just of th* *e whole, but also of the pieces. Accordingly, we see the contributions of our work as: 1.The observation that file system switching might be needed and useful. 2.The idea of an automatically self-reconfiguring file service, and of basin* *g the reconfiguration on measured performance. 3.Quantification of the heuristics for triggering a search for a replacement* * file system. 4.The realization that a "hot replacement" mechanism should not be difficult* * to implement in an NFS/vnodes setting, and the implementation of such a mechanism. 6.1 Future Work There are several directions for future work in this area. The major direction is to adapt these ideas to a file service that support* *s a more appropriate security model. One part of an "appropriate" security model is support for cros* *s-domain authenti- cation such that a party from one domain can relocate to another domain and bec* *ome authenticated in that domain. Another part of an appropriate security model should include ac* *counting protocols allowing third parties to advertise and monitor (i.e., "sell") the use of their* * exported file systems. Within the limited context of NFS, a small step in the right direction would be* * a mechanism that allows clients (servers) to recognize servers (clients) from a different domain* *. The most recent ver- sion of Kerberos contains improved support for cross-domain authentication, so * *another step in the right direction would be to integrate the latest Kerberos with NFS, perhaps as * *originally sketched in [24]. Another desirable idea is to convert from using a single method of exact f* *ile comparison (i.e., checksumd) to per-user, possibly inexact comparison. For example, object files* * produced by gcc _______________________________9 One might suppose that a "most common subset" of system files could be desig* *nated and loaded, and this is true. However, specifying such a subset is ever harder as programs depend on mo* *re and more files for configuration and auxiliary information. This approach also increases the user's responsibili* *ty for system administration, which we regard as a poor way to design systems. contain a timestamp in the first 16 bytes; two object files may be equal except* * for the embedded timestamps, which can be regarded as an insignificant difference. Another exam* *ple is that data files may be equal except for gratuitous differences in floating-point format (* *e.g., 1.7 vs. 1.7000 vs. 1.70e01). Source files may be compared ignoring comments and/or white spa* *ce. Intelligent comparison programs like diff or spiff [13] know how to discount certain simple* * differences. Minor extensions to our work include: converting RLP from a broadcast prot* *ocol to a multicast protocol; and reimplementing in an environment (e.g., multi-server Mach 3.0) th* *at supports out-of- kernel file service implementations. 7 Acknowledgements We thank an anonymous member of the program committee for the suggestion t* *o use file checksums. We thank the program committee, especially David Rosenthal and Matt* * Blaze, for valuable advice about presentation and emphasis. This work was supported in part by the New York State Science and Technolo* *gy Foundation's Center for Advanced Technology in Computers and Information Systems; by a Natio* *nal Science Foundation CISE Institutional Infrastructure grant, number CDA-90-24735; and by* * the Center for Telecommunications Research, an NSF Engineering Research Center supported by gr* *ant number ECD-88-11111. 8 References [1]M. Accetta. Resource Location Protocol. RFC 887, IETF Network Working Group* *, December 1983. [2]B. Callaghan and T. Lyon. The Automounter. In Proc. 1989 Winter USENIX Conf* *., pages 43-51, January 1989. [3]D. C. Cox. A Radio System Proposal for Widespread Low-power Tetherless Comm* *unication. IEEE Trans. Communications, 39(2):324-335, February 1991. [4]P. Honeyman. Taking a LITTLE WORK Along. CITI Report 91-5, Univ. of Michiga* *n, August 1991. [5]J. Howe. Intermediate File Servers in a Distributed File System Environment* *. CITI Report 92-4, Univ. of Michigan, June 1992. [6]J. H. Howard et al. Scale and Performance in a Distributed File System. ACM* * Trans. Computer Systems, 6(1):51-81, February 1988. [7]J. Ioannidis et al. Protocols for Supporting Mobile IP Hosts Draft RFC, IET* *F Mobile Hosts Working Group, June 1992. [8]J. Ioannidis, D. Duchamp and G. Q. Maguire Jr. IP-based Protocols for Mobil* *e Internetworking. In Proc. SIGCOMM '91, pages 235-245. ACM, September 1991. [9]C. Juszczak. Improving the Performance and Correctness of an NFS Server. * *In Proc. 1989 Winter USENIX Conf., pages 53-63, January 1989. [10]M. L. Kazar et al. Decorum File System Architectural Overview. In Proc. 1* *990 Summer USENIX Conf., pages 151-163, June 1990. [11]J. J. Kistler and M. Satyanarayanan. Disconnected Operation in the Coda Fil* *e System. ACM Trans. Computer Systems, 10(1):3-25, February 1992. [12]S. R. Kleiman. Vnodes: An Architecture for Multiple File System Types in Su* *n Unix. In Proc. 1986 Summer USENIX Conf., pages 238-247, June 1986. [13]D. Nachbar. Spiff - A Program for Making Controlled Approximate Comparisons* * of Files. In Proc. 1986 Summer USENIX Conf., pages 238-247, June 1986. [14]J. Ousterhout et al. A Trace-Driven Analysis of the UNIX 4.2 BSD File Syste* *m. In Proc. Tenth ACM Symp. on Operating System Principles, pages 15-24, December 1985. [15]J. Pendry and N. Williams. Amd - The 4.4 BSD Automounter. Imperial College * *of Science, Technology, and Medicine, London, 5.3 alpha edition, March 1991. [16]J. Postel. User Datagram Protocol. RFC 768, IETF Network Working Group, Aug* *ust 1980. [17]D. Presotto et al. Plan 9, A Distributed System. In Proc. Spring 1991 EurOp* *en Conf., pages 43-50, May 1991. [18]R. Rivest. The MD4 Message-Digest Algorithm. RFC 1186, IETF Network Working* * Group, April 1992. [19]D. S. H. Rosenthal. Evolving the Vnode Interface. In Proc. 1990 Summer US* *ENIX Conf., pages 107-117, June 1990. [20]R. Sandberg et al. Design and Implementation of the Sun Network Filesystem.* * In Proc. 1985 Summer USENIX Conf., pages 119-130, June 1985. [21]M. Satyanarayanan et al. Coda: A Highly Available File System for a Distrib* *uted Workstation Environment. IEEE Trans. Computers, 39(4):447-459, April 1990. [22]S. Shafer and M. R. Thompson. The SUP Software Upgrade Protocol. Unpublis* *hed notes available by ftp from mach.cs.cmu.edu:/mach3/doc/unpublished/sup/sup.doc [23]A. Z. Spector and M. L. Kazar. Uniting File Systems. UNIX Review, 7(3):61-7* *1, March 1989. [24]J. G. Steiner, C. Neuman, and J. I. Schiller. Kerberos: An Authentication S* *ervice for Open Network Systems. In Proc. 1988 Winter USENIX Conf., pages 191-202, February* *, 1988. [25]C. Tait and D. Duchamp. Detection and Exploitation of File Working Sets. In* * Proc. Eleventh Intl. Conf. on Distributed Computing Systems, pages 2-9. IEEE, May 1991. [26]C. Tait and D. Duchamp. Service Interface and Replica Consistency Algorithm* * for Mobile File System Clients. In Proc. First Intl. Conf. on Parallel and Distributed Inf* *ormation Systems, pages 190-197. IEEE, December 1991. [27]C. Tait and D. Duchamp. An Efficient Variable Consistency Replicated File S* *ervice. In File Systems Workshop, pages 111-126. USENIX, May 1992. [28]F. Teraoka, Y. Yokote, and M. Tokoro. A network Architecture Providing Hos* *t Migration Transparency. In Proc. SIGCOMM '91, pages 209-220. ACM, September 1991. [29]H. Wada et al. Mobile Computing Environment Based on Internet Packet Forwar* *ding. In Proc. 1993 Winter USENIX Conf., pages 503-517, January 1993. 9 Author Information Erez Zadok is an MS candidate and full-time Staff Associate in the Compute* *r Science De- partment at Columbia University. This paper is a condensation of his Master's t* *hesis. His primary interests include operating systems, file systems, and ways to ease system admi* *nistration tasks. In May 1991, he received his B.S. in Computer Science from Columbia's School of En* *gineering and Applied Science. Erez came to the United States six years ago and has lived in * *New York "sin" City ever since. In his rare free time Erez is an amateur photographer, science* * fiction devotee, and rock-n-roll fan. Mailing address: 500 West 120thstreet, Columbia University, New York, NY 1* *0027. Email address: ezk@cs.columbia.edu. Dan Duchamp is an Assistant Professor of Computer Science at Columbia Univ* *ersity. His current research interest is the various issues in mobile computing. For his in* *itial efforts in this area, he was recently named an Office of Naval Research Young Investigator. Mailing address: 500 West 120thstreet, Columbia University, New York, NY 1* *0027. Email address: djd@cs.columbia.edu.