################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Experience Building a File System on a Highly Modular Operating System Michael N. Nelson Yousef A. Khalidi Peter W. Madany Sun Microsystems Laboratories, Inc. Mountain View, CA 94043 USA {yak, mnn, madany}@eng.sun.com Abstract File systems that employ caching have been built for many years. However, most work in file systems has been done as part of monolithic operating systems. In this paper we give our experience with building a high-performance distributed file system on Spring, a highly modular operating system where system services such as file systems are provided as user-level servers. The Spring file system described in this paper supports cache coherent file data and attributes. It uses the virtual memory system to provide data caching and uses the operations provided by the virtual memory system to keep the data coherent. The file system uses a unique dynamic caching algorithm that allows per-machine caching file servers to be located when a file object is passed from one machine to another. A per-machine caching file server utilizes the virtual memory system to provide caching of data for read and write operations, and it has a private protocol with the remote file servers to cache file attributes. The result is an operating system that has all the advantages of modular systems while providing the efficiency of caching that was available in monolithic systems. 1. Introduction Distributed file systems that utilize caching to provide good performance have existed for many years (e.g. Sprite [1], and Andrew [2]). However, until recently all of these file systems were implemented as part of a monolithic operating system. With the advent of microkernel systems (e.g. Mach [3] and CHORUS [4]) file systems are now being implemented outside the kernel in user level servers. Some of the system properties on monolithic systems that were exploited in order to build distributed file systems have changed. These system properties include: 7 Each system component knew about the location of other components. For example, the virtual memory system knew that files could only be implemented by the file systems that were in the kernel. In modular systems the different components could be anywhere, including across the network. 7 File objects were always acquired through the cache manager. For example, files were always opened through the file system, which was the same file system that did the caching. In modular systems objects can be passed between user applications, so a caching server may not be involved when users acquire objects. 7 All components could trust each other. When system components are implemented by user level servers, this is no longer true. 7 Files and directories were only named via the file system. In systems with generic naming systems, files and directories can be bound and resolved via a naming system outside the file system. This paper describes our experience building a file system on Spring-a highly modular, distrib- uted, object-oriented operating system. Spring has several properties that provide unique opportu- nities and challenges when building a file system, including: 7 A powerful VM system with support for external pagers and operations that allow the construc- tion of distributed shared memory systems. 7 A naming system that allows objects of all types, including files, to be bound into the name space. 7 A capability-based security model. 7 An object model that allows objects to be passed freely between domains on the same or differ- ent machines. The Spring File System was designed to take advantage of these and other Spring properties to build a powerful coherent distributed file system. The file system consists of two types of file serv- ers: ones that provide access to files that they implement and ones that cache accesses to files implemented by remote file servers. File servers of the first type (called storage file servers) are responsible for providing access control, coherent access to file data and attributes, and file nam- ing. Data is kept coherent by using the primitives provided by the virtual memory system, and attributes are kept coherent by using a private protocol with caching file servers (see below). The storage file servers name their files by being one of the many name servers that compose the Spring naming system. In addition files can be stored in name servers that are not implemented by the file system. There are actually two different types of storage file servers: one that runs on each Spring machine and provides access to files on the local disk and one that runs on the SunOS? system and pro- vides access to SunOS files. Except for file storage details the two implementations are identical. The second type of file server (called a caching file server or CFS) is responsible for making access to remote data and attributes efficient. One of these file servers runs on each Spring machine that desires to have file caching. The CFS is optional: remote files will be accessible with- out a CFS, but accesses will be slower. The Spring File System utilizes a unique dynamic caching protocol to allow file objects to be cached by the CFS. Under this protocol the CFS is contacted to cache file objects when they first appear in a client domain. The result is that file objects are always cached by the CFS on the same machine as the client that possesses the file object. The resulting file system provides good performance. Once files are cached on the local machine, no remote operations are required to perform any operation on the file. Preliminary measurements show that caching allows basic file operations such as read, write, and map to be executed at least 5 times faster than without caching. The rest of this paper is organized as follows: Section 2 provides an overview of the Spring Oper- ating System; Section 3 discusses the file interface; Section 4 discusses the implementation of the storage file servers; Section 5 describes the implementation of the CFS and discusses the coher- ency protocol used by it and the storage file servers; Section 6 describes some additional file sys- tem functionality; Section 7 discusses performance; Section 8 presents related work; Section 9 discusses the lessons that we learned from building the Spring file system; and Section 10 offers some conclusions. 2. The Spring Operating System Spring is a distributed, multi-threaded, extensible operating system that is structured around the notion of objects. A Spring object is an abstraction that contains state and provides a set of opera- tions to manipulate that state. The description of the object and its operations are specified in an interface definition language (IDL). IDL supports both notions of single and multiple interface inheritance. A Spring domain is an address space with a collection of threads. A given domain may act as the server (implementor) of some objects and the clients of other objects. The server and the client can be in the same domain or in different domains. Spring objects consist of two parts: the object representation that lives in the domain that is using the object and the state kept by the server of the object. The object representation contains at least enough state to allow an invocation on the object to get to the object's server. Figure 1 shows an example of a Spring object where the client of the object and the server of the object are on differ- ent machines. The Spring kernel supports basic cross domain invocations and threads, low-level machine-depen- dent handling, as well as basic virtual memory support for memory mapping and physical memory management [5, 6]. A Spring kernel does not know about other Spring kernels-all remote invoca- tions are handled by a network proxy server. A typical Spring node runs several servers besides the kernel. These include a name server, file servers, a linker domain that manages and caches dynamically linked libraries [7], a network proxy that handles remote invocations, a device server that provides basic terminal handling as well as frame-buffer and mouse support, and a UNIX, server that provides support for running UNIX binaries on Spring [8]. 2.1 Spring Security If the server and the client of an object are in different domains, the representation of the object includes an unforgeable handle managed by the kernel that identifies the server domain. These unforgeable handles have many of the security properties of capabilities in traditional operating systems. If a server determines that a client is entitled to specific access rights to a given piece of state (e.g. a file), it can give the client an unforgeable handle X. Encapsulated in the server side state for handle X will be the granted access rights and possibly the principal name of the client. Whenever a call arrives quoting handle X, the server can permit the given access to the underlying state without further checks. Servers determine if a client is allowed access to a piece of state by consulting an access control list (ACL) that is associated with the state. Each ACL entry contains a principal name and a list of access rights. A server will only believe that a client is a given principal if that client has first been authenticated to be that principal. Once a client has been authenticated to the server as a given principal P, then the server will be willing to return objects that grant specific rights for P as deter- mined by the ACL. 2.2 Spring Naming The Spring name service [9] allows any object to be associated with any name. A name-to-object association is called a name binding. Each name binding is stored in a context. A context is an object that contains a set of name bindings in which each name is unique. An example of a context is a UNIX file directory. An object can be bound to several different names in possibly several dif- ferent contexts at the same time. Since a context is like any other object, it can also be bound to a name in some context. By binding contexts we can create a naming graph. The UNIX file system is a naming graph that is frequently restricted to a tree. Spring contexts provide support for the Spring security model. When an object is bound, an ACL can be given that specifies which principals are allowed which rights for the object. When a name is resolved, a set of desired modes is specified. Modes are a superset of rights. For example, read and write modes correspond directly to read and write access rights; however, append mode implies write access but also indicates the "mode" with which the object should be accessed when writes occur. When a name is resolved, an object with the desired modes is returned if the client doing the resolve is allowed the corresponding rights. 2.3 Virtual Memory A per-node virtual memory manager (VMM) is responsible for handling mapping, sharing, and caching of local memory. The VMM depends on external pagers for accessing backing storage and maintaining inter-machine coherency [6, 10]. Most clients of the virtual memory system only deal with address space and memory objects. An address space object represents the virtual address space of a Spring domain while a memory object is an abstraction of storage (memory) that can be mapped into address spaces. An example of a memory object is a file object (the file interface in Spring inherits from the memory object interface). Address space objects are implemented by the VMM. A memory object has operations to set and query the length, and operations to bind to the object (see below). There are no page-in/out or read/write operations on memory objects (which is in contrast to systems such as Mach [3]). The Spring file interface provides file read and write opera- tions (but not page-in and page-out operations). Separating the memory abstraction from the inter- face that provides the paging operations is a feature of the Spring virtual memory system that we found very useful in implementing our file system. This separation enables the server of the mem- ory object to be different from the server of the pager object that provides the contents of the memory object. We will show uses of this feature in Section 5. 2.3.1 Binding a memory object to a cache object When a VMM is asked to map a memory object into an address space, the VMM must be able to obtain the contents of the memory object, since the memory object itself does not provide opera- tions for obtaining this data. Therefore, the VMM contacts the pager domain that implements the memory object by invoking the bind operation on the memory object. The objective of the bind operation is to point the VMM to a local data cache that provides the contents of the memory object and to tell the VMM what rights are encapsulated by the memory object. The details of the bind operation are given in [10]; in the rest of this section we will give a brief overview of the bind operation. During the bind operation the VMM and the pager domain exchange two objects: a pager object and a cache object. The pager object provides operations to page-in and out memory blocks, and the VMM uses it to populate a local cache. The cache object is implemented by the VMM, and the pager domain uses it to affect the state of the cache. Tables 1 and 2 list the operations of the cache and pager objects, respectively. A given pager object-cache object pair constitutes a two-way communication channel between a pager and a VMM. Typically, there are many such channels between a given pager domain and a VMM (see Figure 2 for an example). As far as the VMM is concerned, each memory object is unique-the VMM relies on the memory object's pager to point it to a data cache from which the VMM obtains the contents of the memory object, and it also relies on the pager to indicate the encapsulated access rights of the memory object. This extra level of indirection allows different memory objects that share the same pages (but perhaps encapsulate different access rights) to share the same cache at the VMM instead of flushing the same pages back and forth between two separate caches. Operation Description flush_back Remove data from the cache and send modi- fied blocks to the pager. deny_writes Downgrade read-write blocks to read-only and return modified blocks to the pager. write_back Return modified blocks to the pager. Data is retained in the cache in the current mode. delete_range Remove data from the cache, return no data. zero_fill Indicate to the VMM that the given range of cache is zero-filled. Data blocks in the range are held by the VMM in read-write mode. populate Introduce data blocks into the cache. TABLE 1. Cache object operations Operation Description page_in Request data be brought into the cache. page_out Write data to the pager and remove data from the cache. write_out Write data to the pager and retain data in read- only mode. sync Write data to the pager and retain data in the current mode. TABLE 2. Pager object operations 3. The File Interface Spring files contain data and attributes and support authentication. The interface provides access to the file's data through two mechanisms. One way is through read and write operations; these oper- ations are inherited from the Spring io interface. The other way is by mapping the file object into an address space; this ability comes by having a file object inherit the memory object interface. Spring files have three attributes: the length of the file, its access time, and its modify time. The file interface provides get_length and set_length operations to retrieve and change the file length; these operations are inherited from the memory object interface. All three attributes can be retrieved via the stat operation; there is no direct way to set the access or modify time. Spring files support Spring authentication by inheriting the authenticated interface. The authenti- cated class provides support for access control lists, encapsulated rights and principals, and it allows new file objects to be created that reference the same underlying file state as the current file, yet contain different encapsulated rights. 4. The Storage File Server In this section we will describe the implementation of the storage file servers. In this description we will ignore the issue of the caching file server since the caching file server is merely an optimi- zation and is not required for the file system to function properly. In the next section when we dis- cuss the caching file server, we will discuss the extra implementation required in the storage file servers to support caching by the CFS. 4.1 Naming Files The Spring file system fits into the overall Spring naming system. Spring files can be accessed via contexts implemented by the storage file servers or via contexts implemented by other domains. The context objects implemented by the storage file servers are only one of the many types of con- texts that together compose the Spring naming system. 4.1.1 The File System Contexts The storage file servers implement a subclass of the context class called fs_context. The fs_context class inherits from the authenticated interface and it adds the create_file operation, which creates a file and binds it to a name. Thus the fs_context objects implemented by the storage file servers contain an encapsulated principal, encapsulated rights, and an ACL. Fs_context objects are nor- mally used to retrieve and bind file and fs_context objects, but other types of objects can be bound and retrieved as well (see Figure 3 for an example Spring name space). The storage file servers export their files by binding fs_context objects into a public Spring name server. Storage file servers read configuration files that determine where to bind their context objects. Each binding in an fs_context has an ACL. When a name resolution is invoked on an fs_context (e.g. someone wants to open a file for read-write), the file system ensures that the encapsulated principal of the context doing the lookup is allowed the desired access to the bound object. The resulting file or fs_context object will encapsulate the principal of the context doing the lookup and will also encapsulate the desired modes. For example, if a client had the root context object in Figure 3 authenticated with principal P and the client invoked the operation resolve("B/F/H", read-write), the client would get back a file object that encapsulated principal P and read-write mode, assuming that P had read access to contexts B and F and read-write access to file H. 4.1.2 Naming Separate From File System File and fs_context objects can be bound into the Spring naming system just like any other object. Thus these objects can be bound into contexts that are not implemented by the file system. When a client retrieves a file or fs_context object from a non-file-system context (e.g. the file named "A/ D" in Figure 3), the context must be able to create a copy of the file or fs_context object that encapsulates the current principal and the desired modes. This is done using a Spring duplication service. A standard Spring naming server does not know how to change the encapsulated principal or modes of an object. Thus any object server that wishes to allow its objects to be stored in name servers and allow the encapsulated access to be changed, must implement a Spring duplication ser- vice object. This object supports the dup operation which takes an object, a principal, and a desired set of modes and returns a copy of the object that encapsulates the given principal and modes. The file servers implement two duplication services: one for files and one for contexts. When a file server is asked to duplicate an object it ensures that the caller has the right to produce an object with the desired principal and modes, and if so returns a copy of the object that encapsulates the given principal and modes. 4.2 The FS Object The fs object can be used to create unnamed files. It supports one operation, get_file, which returns a new unnamed file object. In order for this new file object to be bound to a name it must be bound into some context. 4.3 File Implementation Files are implemented by the storage file servers. In this section we will discuss the interesting details of the file implementation. Note that if a file that is being accessed is implemented by a remote storage file server, all operations invoked on the object will require a network RPC. The CFS that is discussed in Section 5 is able to eliminate most of these network RPCs. 4.3.1 Security The file objects implemented by the storage file servers are authenticated objects. Therefore they have both an encapsulated principal and encapsulated rights. The encapsulated rights are set when a file object is created, and the rights are checked on each access to the object. The encapsulated principal is not currently used for file objects. If we decide at some point to verify the principal on each access then we would use the encapsulated principal. 4.3.2 Mapping Files As we described in Section 2.3, Spring files can be mapped into address spaces because the Spring file class inherits the memory object interface. When a client maps a file object into its address space, the virtual memory system and the file system follow the bind protocol described in Section 2.3. The result is that the cache - pager object connection between the VMM and the file system is set up. Figure 4 gives the state of the system after a file object is bound into a client domain's address space. 4.3.3 Data Coherency There is a potential coherency problem when a particular file is mapped into multiple client's address spaces on several machines at the same time. For example, if two clients on different machines have the same page of a file mapped into their address spaces both readable and writ- able, then some action must be taken to ensure that both clients see a coherent view of the page. One of the goals when building the file system was to give clients a coherent view of files. As a result one of the primary jobs of the file system is to keep files coherent. Since files can be cached a page at a time, coherency is done on a per page level; a file server keeps pages coherent by invoking operations on the cache objects that are associated with each file object. The storage file servers implement a single-writer, multiple-reader per-page coherency algorithm. The file system can guarantee coherency because it gets all page-in requests. Each request indicates whether the page is desired in read-only or read-write mode. 4.3.4 Read and Write Caching Read and write operations are cached by mapping the file that is being read or written into the stor- age file server's address space. Once the file is mapped, then the data is copied to or from the mapped region as appropriate. Since file mapping is used, all of the issues of data caching and coherency are handled by the vm-pager data coherency protocol. 4.3.5 Periodic Data Write Back In order to reduce the amount of data lost in a machine crash, the storage file servers write back all modified data for their files cached at VMMs every 30 seconds. The file servers do this by invok- ing the write_back operation on the cache objects associated with each file. 4.3.6 Coherency Impact of the Length Getting and setting the length may require coherency actions. Getting the length requires that the file server retrieves the length from anyone who is caching it writable. Setting the length requires a coherency action if the length is decreased. In this case the pages at the end of the file need to be eliminated from the file and from all caches of the file. If the pages are not removed from the caches, then clients will not see a consistent view of the file because some clients may be able to access parts of the file that no longer exist. Pages are deleted from caches by invoking the del- ete_range operation with the appropriate data range on all cache objects that possess deleted pages. If a file's length is increased, then nothing has to be done in order to ensure coherency. However, there is an opportunity for an optimization that can best be done by the caching file server. We will discuss this optimization in Section 5.8. 5. The Caching File Server In this section we describe the implementation of the Caching File Server (CFS). The CFS caches the following things in order to provide high performance: 7 Attributes to eliminate remote get_length, set_length, and stat calls. 7 Data to eliminate remote read and write calls. 7 VM cache objects to eliminate remote bind calls and allow an additional optimization that eliminates most zero-fill page faults. 5.1 Basic Architecture In order to allow local file caching to be implemented, the file objects used by client domains must be implemented by the CFS. In addition the CFS must have a special communication channel for caching with the storage file servers whose data and attributes it is caching and a copy of the VMM cache object for the file. The other component of the caching architecture is the virtual memory system. The virtual mem- ory system uses the cache and pager objects described in Section 2.3. In order to make page-ins and page-outs as efficient as possible, the virtual memory manager should be able to communicate directly with the file server that stores the data; in other words, the pager object should be imple- mented by the storage file server, not the file cacher. The desired structure for data caching involv- ing the CFS, the storage file server, and the VMM is given in Figure 5. 5.2 The Caching Subcontract When client domains receive objects from a remote file server, the CFS must somehow be able to interpose on these objects so that caching can occur. This is done through the use of the caching subcontract. Every Spring object has an associated subcontract [11]. Subcontract is responsible for many things including marshaling, unmarshaling, and invoking operations on the object. Subcontract also defines the representation for each object that appears in a client domain's address space. The stan- dard Spring subcontract is called singleton. The representation of a singleton object includes a ker- nel handle that identifies the server domain. When a client invokes an operation on an object that uses singleton, this handle is used to send the invocation to the server domain. File objects use a different subcontract called the caching subcontract. File objects are only one of the users of the caching subcontract. The representation for an object that uses the caching subcon- tract contains: 7 A handle that identifies the server domain (this is the same handle that is in the singleton repre- sentation). 7 An object, called the cached_object, that is implemented by a domain that caches the original object. 7 A name, called cacher_name, that names the cacher to use. Figure 6 shows the configuration after a file object with the caching subcontract is cached by a CFS domain. The cached_object in the caching subcontract representation is used when an invocation occurs on an object that uses the caching subcontract. If the cached_object is non-null, then the invocation is done on the cached_object; if the cached_object is null, then the invocation is done on the server's handle. The cached_object will be null if there is no cacher or the server is on the local machine. The cached_object is obtained using the cacher_name when an object is unmarshaled into a client domain. Each cacher domain (such as the CFS) implements a cacher object. This object provides the operation get_cached_obj that takes an object implemented by a remote server and returns an object implemented by the cacher domain. This cacher object is bound in the local machine's name space under a name that must be agreed upon by the implementor of the cacheable service and the implementors of cacher domains for the service. This is the name that is stored as the cacher_name in the subcontract representation. This name is put there by the server domain that created the cacheable object. When an object is unmarshaled into a client domain the unmarshaling code resolves the cacher_name to a cacher object implemented by a cacher domain. The unmarshaling code then invokes the get_cached_obj operation on the cacher object passing it in a copy of the cacheable object. When the cacher domain receives the object, it creates a new object that it implements and returns this new object to the client domain. The object returned from the cacher is stored as the cached_object in the subcontract representation. 5.3 The CFS Cacher Object The CFS implements a cacher object that it exports in the machine name space under a known name. Whenever a storage file server creates a file object, it sets the cached name in the file object's representation to be the name of the CFS. Thus when a file object is unmarshaled, the CFS's cacher object will be found and the get_cached_obj operation will be invoked on the cacher object. The CFS will then return a file object that it implements. When the CFS receives a file object to cache via the get_cached_obj call it must determine two things. First, it has to determine if it implements the cached_object that is in the file object's repre- sentation; if so it just returns the cached_object. Second, it has to find the internal cache state for the file and the file's encapsulated access rights; this is done by using the same bind protocol that the VMM uses to set up the cache object - pager object connection (see below). 5.4 CFS to Remote File Server Connection The CFS and remote file servers need a connection similar to the connection between the VMM and pagers. The CFS needs to be able to get cached information for files and the remote file server needs to perform callbacks for cache coherency. This connection consists of two objects: an fs_cache object and an fs_pager object. The fs_cache object is a subclass of the VM cache object and is implemented by the CFS. The fs_pager object is a subclass of the VM pager object and is implemented by the storage file servers (see Tables 3 and 4 respectively for the extra operations added by the fs_pager and fs_cache objects). These objects are subclasses of the VM objects for two reasons: 7 It allows the normal bind operation on a file object to be used to set up the connection and dis- cover whether a file is already cached. 7 It allows the storage file servers to keep data coherent while being ignorant of whether they are dealing with a VM system or a CFS - the file servers just use the VM cache object operations for data coherency. The CFS - remote file server connection is set up using the same bind protocol described in Sec- tion 2.3 - it just involves different objects. Operation Description cached_bind Tell server file is cached at VMM. cached_stat Get cached attributes (writable if desired). Result indicates which attributes are cacheable. set_length Set the length. release_cache_info Release cached information. TABLE 3. Fs_pager object operations 5.5 Caching Binds One of the important jobs of the CFS is to cache the results of VM binds since they occur on every map call. When a bind occurs the CFS checks permissions and then checks to see if it already has a VM cache object for the file. If not it gets one in the following manner. The CFS first tells the remote file server that the VMM is caching file data so the remote file server knows that the file's data is being cached; this is done by invoking the cached_bind operation on the appropriate fs_pager_object. The CFS then calls the local VMM with the fs_pager object implemented by the remote storage file server to create a VM cache object. Once the CFS has a cache object, it keeps a copy of it and returns the cache object to the caller of bind (i.e. the local VMM). Operation Description get_back_times Return access and modify times. get_back_length Return the length. A parameter indicates whether the length can still be cached. dont_cache_time Don't cache the time anymore. delete_cache The VM cache is no longer valid. TABLE 4. Fs_cache object operations Figure 7 shows the configuration after a successful cached bind operation. Note that the VMM has a direct pager connection to the remote file server and the remote file server's cache object is actu- ally implemented by the CFS. Thus all cache coherency operations on the cache object will indi- rect through the CFS. This does not significantly degrade performance since we are just adding one extra local call to two remote calls (the coherency call and the page-out operation) and all of the data is being transferred using the direct pager object connection. 5.6 Caching Reads and Writes The CFS caches data for reads and writes on files by mapping the file that is being read or written into the CFS's own address space. Once the file is mapped, then the data can be copied to or from the mapped region as appropriate. Since file mapping is used, all of the issues of data caching and coherency are handled by the virtual memory system and the remote file servers. In order to implement the read and write operations, the file length must be available locally. In particular, for writes that append data to a file, the CFS must be able to modify the length locally. 5.7 Caching Length Caching the length is important because it allows read, write, get_length, and some set_length operations to happen locally. In order to let set_length and write operations happen locally, a CFS must have the ability to modify the length locally. As a result a length coherency algorithm is nec- essary. This coherency algorithm is a simple single-writer, multiple-reader algorithm: a storage file server will allow multiple CFS domains to cache the length readable, but only one to cache it writ- able. A CFS retrieves the length by invoking the cached_stat operation on the appropriate fs_pager object, and a storage file server keeps the length coherent by invoking the get_back_length operation on the appropriate fs_cache objects. The file length has to be retrieved by the storage file servers on page faults because the file server must know the current length of the file to determine if the page fault is legal. Thus, if on a page fault the length is being cached read-write, the file server will fetch the length back from the CFS that is caching the length and revoke write permission. Having the length cached read-write allows a CFS only to increase the length without informing the storage file server. A CFS still has to call through to the file server when a file is truncated so the file server can take necessary coherency actions. 5.8 Zero-filling Cache Objects When a file is lengthened, all of the new pages between the old length and the new length will be read as zeros until the pages are modified. Instead of the remote file server zero-filling these pages on page faults, it would be much more efficient if the virtual memory system could zero-fill these pages itself thus avoiding a cross-machine call and a data transfer. This optimization is imple- mented by the CFS. If the CFS has the length cached writable and the length is increased, the CFS invokes the zero_fill operation on the VM cache object. If the file object hasn't been bound yet, then the CFS will do the zero-fill after the object is bound. The storage file servers have to keep track of pages that are being zero-filled by virtual memory managers. Whenever a storage file server discovers that the length of the file has been extended by a CFS, it assumes that all new pages between the old length and the new length are being zero- filled by the VMM on the CFS's machine. A storage file server can discover that a CFS has length- ened a file in three ways: 7 The length is retrieved for coherency purposes. 7 The CFS gives the length back because it is no longer caching it. 7 A page-out past the end-of-file occurs from a machine that has the length cached writable. In this case the length is set to contain the last byte of the page. 5.9 Caching Time Both the access time and the modify time can be cached by a CFS. Both times can be cached writ- able, but we make no attempt at keeping the access time coherent because it is impossible to keep a cached access time coherent if the file is mapped in multiple caches. Thus if we insisted on a coherent access time, it would require that stat operations on all shared mapped files, even read- only ones, are remote. We do not know of any important application programs that require a coher- ent access time. The modify time is kept coherent so that programs such as make can function properly. A CFS is allowed to cache the modify time if no one has the file cached read-write or the CFS is the only CFS that has the file cached read-write. In the second case, the CFS is allowed to change the mod- ify time. A CFS retrieves the access and modify times by invoking the cached_stat operation on the appropriate fs_pager object and a storage file server keeps the modify time coherent by invok- ing the dont_cache_time operation on the appropriate fs_cache objects. 5.10 Data and Length Write Back Policy Modified data is cached by the VMM for files that are cached by the CFS. If the machine that the data is cached on crashes, this data will be lost. As mentioned before, the storage file servers employ a 30 second write back policy for writing back this cached data. In order to make the data even more secure, the CFS employs its own write back policy: when the last reference to a cached file object is gone, the CFS will write back all modified data for the file. Data is not written back for temporary or anonymous files (see Section 6). Writing back the data is not sufficient - the length must be written back as well. As we discussed in Section 5.8, the storage file servers implicitly lengthen the file when page-outs past the end-of- file occur. Since page-outs are in page-size quantities, the file length is set to include the whole page. Thus the length has to be written back after the data is written back so the file server can know the true length of the file. When a storage file server gets the length from a CFS that is cach- ing the length read-write, it will truncate the file to that length. 5.11 Security The CFS file server is trusted by client domains to cache their files. The CFS needs to ensure that it does not accidently allow some client to attain greater access to some cached file than the client is allowed. This is guaranteed by using the access rights obtained from the secure bind protocol described in Section 2.3.1. These access rights are checked on every operation on file objects to ensure that the client is allowed the desired access. 6. Additional Functionality There are other pieces of functionality in the file system that we have not discussed. First, the file system is the source of anonymous memory objects. These memory objects are used by the system for things such as stacks and heap memory. These objects are acquired by the VMM via objects implemented by storage file servers and the CFS. The details of the anonymous memory object implementation is given in [12]. The other piece of functionality that we have not discussed is cache reclamation. The VMM, the CFS, and the storage file servers all cache information. When any of these services get too many objects in their caches, they need to reclaim some of them. Reclaiming can be complicated since it involves multiple domains. Details of cache reclamation are given in [6, 12]. 7. Current Status and Performance We have implemented the file system described in this paper. The file system that we have imple- mented consists of three file servers: 7 a storage file server that provides coherent access to files stored on the local disk, 7 a CFS that runs on each machine, 7 and a storage file server that runs on the SunOS system and provides access to SunOS files. The Spring File System that we have implemented uses caching extensively to provide high per- formance. In the rest of this section we will examine just how effective this caching could be and how effective it really is. 7.1 Potential Improvements The caching by the CFS provides the ability for substantial increases in performance. Table 5 gives two examples of sequences of operations that clients can do on files and how caching dramatically reduces network accesses. The first example is the use of a 1 Mbyte temporary file accessed via the read-write interface. This shows the effect of the data and length caching done by the CFS. In this example, when caching is used there is virtually no network activity; this file can be read and writ- ten as fast as the local file system can copy data. The second example shows the use of a 1 Mbyte file accessed via memory mapping. This shows the effect of the zero-fill optimization. With the zero-fill optimization and length caching, there are virtually no network operations. 7.2 Measured Improvements In the previous section we discussed the potential benefits from caching. Table 6 gives measure- ments of some common file operations. The client machine is a SPARCstation? 2 running Spring. The operations without caching go to a storage file server that is running on the SunOS system on a SPARCstation 2. These measurements show that caching allows the operations to be executed at least 5 times faster than without caching. Operation Without Caching With Caching read 4K 11 ms 1.9 ms write 4K 51 ms 2.1 ms set_offset 3.4 ms 0.11 ms map/unmap 10.5 ms 2.1 ms TABLE 6. Measured Performance We need to do much more extensive performance evaluation of the Spring File System including comparing its performance to other systems. Perhaps the most interesting comparison would be to compare the performance to that of other non-modular systems such as the SunOS system. How- ever, for now it is encouraging that caching is very effective for these simple measurements. 8. Related Work There have been many instances of file systems that employ caching. Examples are NFS [13], the Sprite File System [1], and the Andrew File System [2]. All three of these file systems provide some level of caching of both data and attributes and some level of coherency. However, none of them provide distributed shared memory (DSM), and they were all built as part of or on top of monolithic operating systems. As a result many of the issues addressed by the Spring File System, such as dealing with external pagers, the separation of naming from the file system, and dynami- cally locating a per-machine cacher, were not addressed by these file systems. There have also been several instances of systems that provide DSM including [14], [15], and [16]. However, these systems also did not address the issues involved in a system like Spring. To our knowledge the only system that has addressed the caching problems in a distributed modu- lar system besides Spring is CHORUS [4]. The CHORUS system implements distributed shared memory by having one global coherency manager that interacts with a per-machine cache man- ager. Each access to a file object is indirected through the local cache manager by using a coherent capability. When a file object is created it contains the known port of the local cache manager. The special coherent capability in CHORUS provides functionality similar to the subcontract mechanism in Spring. However, the Spring subcontract mechanism is more general since it works even when the local cacher does not exist, and the cacher is identified by a name instead of a spe- cific port number. The notions of length and attributes coherency are not mentioned in [4]. Other issues, such as binding to caches, naming, and cache reclamation, are not mentioned either. Thus although CHO- RUS has implemented something similar to the Spring File System, it is unclear if they have solved all of the hard problems solved by the Spring File System. 9. Lessons Learned While building the Spring file system we learned several things about designing a file system for a modular system such as Spring: 7 Splitting the memory object into a memory object and a pager object adds power. We used this feature to allow file operations such as getting attributes to go through the CFS while hav- ing all data transfers go directly to the storage file server. We would not have been able to implement our caching architecture as efficiently if the data had to be paged in via the memory object as was done in Mach [3]. 7 Using the VM system for data caching greatly simplifies things. This is a much better approach than requiring the file system to implement its own buffer cache for data as was done in older systems such as Sprite [1]. 7 Building file systems at user level is a good thing. We found it much easier to build a file sys- tem at user level than building one inside the kernel. We were able to try out new versions of the file system without rebooting the kernel and we were able to debug the file system using normal user level debugging tools. 7 Strong interfaces with subclassing is the right way to build systems. Once we developed the interfaces to our objects we were able to produce many different implementations (including adding caching) without having to change any client code. In addition we were able to utilize interface subclassing so that we could add functionality to the cache and pager object interfaces while still using the standard VM bind protocol. 7 Control over the object invocation mechanism is powerful. The Spring notion of subcon- tract was very useful in allowing us to transparently implement caching. We were able change the marshaling, unmarshaling, and invocation mechanisms for file objects so that they could be cached without programmers of client applications having to do anything. 7 When splitting a system into components work is required to allow good performance. We worked very hard when we developed the interfaces between the VM system and the file sys- tem to allow performance optimizations such as zero-filling to be possible. More work is still necessary in this area so that we can support other optimizations such as input and output clus- tering [18]. 7 A general naming system is good. In Spring, the file system fits into the overall Spring nam- ing system instead of trying to wedge naming for all objects into the file system as was done in other systems like Plan 9 [17]. This made the implementation of naming easier and cleaner. 10. Conclusions File caching is crucial to good system performance in a distributed environment. The Spring File System provides effective caching in an environment different than the previous environments where caching was implemented. The Spring file data and attribute caches not only provide good performance but they are fully coherent as well. The Spring File System demonstrates that caching can be as effective in a highly modular distributed system as it is in monolithic systems such as the UNIX and Sprite operating systems. The one open question about building a file system on a modular system such as Spring is how performance compares to that on monolithic systems. We are currently beginning the process of performance analysis and tuning of Spring, and we believe that with the proper amount of tuning, we can attain performance comparable to monolithic systems. 11. References [1] Nelson, M.N., Welch, B. B., and Ousterhout, J.K, "Caching in the Sprite Network File Sys- tem," ACM Transactions on Computer Systems 6, 1 (Feb. 1988), pp. 134-154. [2] Howard, J.H. ET AL., "Scale and Performance in a Distributed File System," ACM Transac- tions on Computer Systems 6, 1 (Feb. 1988), pp. 51-81. [3] Acceta, M. ET AL, "Mach: A New Kernel Foundation for UNIX Development," Proceed- ings of the USENIX 1986 Summer Conference, June 1986. [4] Abrosimov, V., Armand, F. and Ortega, M.I., "A Distributed Consistency Server for the CHORUS system," Proceedings of Third Symposium on Experiences with Distributed and Multiprocessor Systems, March 1992, pp. 129-148. [5] Hamilton, K.G. and Kougiouris, P., "The Spring Nucleus: A Microkernel for Objects," Pro- ceedings of the 1993 Summer USENIX Conference, June 1993, pp. 147-160. [6] Khalidi, Y.A. and Nelson, M.N., "The Spring Virtual Memory System," Sun Microsystems Laboratories, Technical Report SMLI-92-388, Sept. 1992. [7] Nelson, M. N. and Hamilton, G., "High Performance Dynamic Linking Through Caching," Proceedings of the 1993 Summer USENIX Conference, June 1993, pp. 253-266. [8] Khalidi, Y. A. and Nelson, M. N., "An Implementation of UNIX on an Object-oriented Operating System," Proceedings of the 1993 Winter USENIX Conference, Jan. 1993, pp. 469- 480. [9] Radia, S. R., Nelson, M. N., and Powell, M. L., "The Spring Name Service," Sun Microsys- tems Laboratories, Technical Report. [10] Khalidi, Y. A. and Nelson, M. N., "A Flexible External Pager Interface," Proceedings of the Second Symposium on Microkernels & Other Kernel Architectures, Sept. 1993. [11] Hamilton, G., Powell, M. L., and Mitchell, J. G., "Subcontract: A Flexible Base for Dis- tributed Programming," Proceedings of Fourteenth ACM Symposium on Operating System Principles, to appear Dec. 1993. [12] Nelson, M. N., Khalidi, Y. A., and Madany, P. W., "The Spring File System," Sun Micro- systems Laboratories, Technical Report SMLI 93-10, Feb. 1993. [13] Sandberg, R. ET AL., "Design and Implementation of the Sun Network Filesystem," Pro- ceedings of the 1985 Summer USENIX Conference, June 1985, 119-130. [14] Li, K., Shared Virtual Memory on a Loosely Coupled Multiprocessor, Ph.D. Thesis, Yale University, 1986. [15] Leach, P., Levine, P., Hamilton, J., and Stumpf, B., "The File System of an Integrated Local Network," Proceedings of the 1985 ACM Computer Science Conference, March 1985, 309-324. [16] Ramachandran, U. and Khalidi, Y.A., "An Implementation of Distributed Shared Mem- ory," Software-Practice & Experience 21, 5 (May 1991), pp. 443-464. [17] Pike R., Presotto, D., Thompson, K., and Trickey, H., "Plan 9 from Bell Labs," Proceed- ings of 1990 UKUUG Conference, July, 1990. [18] McVoy, L. W. and Kleiman, S. R., "Extent-like Performance from a UNIX File System," Proceedings of the 1991 Winter USENIX Conference, January 1991. Information on Spring To get technical reports and other information on the Spring project send email to Corrine Dreis- bach at corrine@eng.sun.com. Trademarks Sun, Sun Microsystems, SunOS, and SPARCstation are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX is a registered trademark of UNIX System Laboratories, Inc. Storage File Server VMM file object pager object cache object file object The client has a file object that is implemented by the CFS. The CFS has a private cach- ing communication channel with the storage file server. If the contents of the file object is cached by the VMM, then the VMM has a pager object implemented by the storage file server and the CFS has a copy of the VMM cache object. FIGURE 5. Desired Caching Structure The page size is assumed to be 4 Kbytes. The first example involves reading and writing a 1 Mbyte file in its entirety where each read and write transfers 4 Kbytes of data. With- out caching, 256 network reads and 256 network writes are required. The second exam- ple involves accessing a 1 Mbyte file through the mapping interface. Without the zero- fill optimization 256 network page faults are required if all the pages are touched. Operation Network Operations Without Caching Network Operations With Caching Create file 2 2 Write 1 Mbyte 256 1 (bind) Read 1 Mbyte 256 0 Remove 1 1 Total 515 4 Create file 2 2 Map file 1 1 Set length 1 0 Modify pages 256 0 Total 260 3 TABLE 5. Possible improvements with caching FIGURE 4. State after a file has been mapped. The file server implements the file object. When the file is mapped into the client's address space, a pager object is created at the file server and a cache object is cre- ated at the client's VMM. CFS The client has a file object that is implemented by the CFS. The private caching channel between the CFS and the storage file server shown in Figure 5 is actually the fs_cache object-fs_pager object pair plus the file object. The storage file serv- er's pager object actually has an fs_cache object implemented by the CFS instead of the VM cache object shown in Figure 5. However, the VMM still has a direct pager connection to the storage file server. FIGURE 1. Spring Object The client domain has an object that is implemented by a server domain. The client has a representation for the object that allows the invocations on the object to get to the server domain. The server keeps some state for the object. Client CFS FIGURE 2. Pager-cache object example A VMM and a pager have one or more two-way cache-pager object connections. In this example Pager 1 is the pager for two distinct memory objects cached by VMM 1 so there are two pager-cache object connections, one for each memory object. Pager 2 is the pager for a single memory object cached at both VMM 1 and VMM 2 so there is a pager-cache object connection between Pager 2 and each of the VMMs. FIGURE 3. Sample Name Space A sample Spring name space that consists of fs_contexts and files implemented by the file system and other objects implemented by other domains. Files and fs_context objects can be bound and retrieved from fs_context objects or from other context objects. Although fs_context objects are normally used to store files and fs_context objects, other types of objects can be stored in fs_contexts as well. FIGURE 6. State after object is cached by CFS. The representation of a file object consists of a cached file object that is imple- mented by a CFS domain, a cacher name that names the CFS, and a handle to the storage file server domain. FIGURE 7. State after a cached bind