MULTI-RESIDENT AFS: AN ADVENTURE IN MASS STORAGE Jonathan S. Goldick, Kathy Benninger, Christopher Kirby, Christopher Maher, Bill Zumach Pittsburgh Supercomputing Center 1. Abstract The Pittsburgh Supercomputing Center has been working to integrate distributed file system technology with hierarchical mass storage. We produced a system utilizing the Andrew File System that can be interfaced to many mass storage systems. We retained the semantics of AFS and compatibilitywith standard clients and servers. The architecture has a logical separation between the facility that provides the user interface and access semantics and the management of the storage systems that contain user data. Support for file level replication is provided for high availability to data in a fashion that is transparent to users. This system is called Multi-Resident AFS. 2. Introduction There has been a great deal of work in the last ten years in making distributed file systems faster, more functional, and easier to use. This has lead to their widespread use from the desktop to the supercomputer. The explosive growth of data being stored in these systems, combined with the exponential growth in client capacity, emphasizes a necessary capability, support for hierarchical mass storage. The Pittsburgh Supercomputing Center has been working for several years to address this problem. Our experiences with our system based on the Andrew File System (AFS) from Transarc will be described. While the paper will concentrate on AFS, the lessons learned apply to most, if not all, distributed file systems. 3. The Problem In 1990, the Pittsburgh Supercomputing Center was facing a serious problem in providing access to the large volume of data produced by our users. We had thousands of users spread out across the United States and needed a way to provide them with easy access to the data they produce on the center's supercomputers and as a vehicle for collaboration with other scientists. While we were already running the Los Alamos Common File System (CFS) [1] to service our mass storage needs, it had only a simple get/put file interface and no distributed access. We identified distributed file systems technology as a way to provide our users with both a friendly interface and the ability to access their data from their home sites. However, there were few products available that offered both distributed access and support for high latency, high capacity media like tape. The products we did find offered FTP and NFS interfaces but suffered from several significant problems. When tape mounts weren't completed within some minimum time, users would see NFS time-outs. All of the software had to be run on a single, high capacity server machine; there was no support for distributing the load across several file servers without duplication all of the equipment. This inability to spread the load over multiple servers was made more painful by the fact that the NFS interface offered by these products had to query the servers to check the consistency of any data they might have cached. This was a major performance problem considering that our clients were much faster than any server available. There was also very little in the way of security in these systems. They were designed with the model that every machine and network between the data and users was under a centralized control. Such an environment is becoming increasingly rare and certainly didn't include our site. Some other, more long range, problems we observed in the available products included the lack of options for server platforms. These systems were generally designed for a specific machine and there was little choice in hardware. This was made even worse by the observation that there were also no standards to allow us to move data from one vendor's system to another, thereby locking us into a single product line. While the amount of data we store is much larger than most sites, our problem of not being able to afford as much magnetic disk as our users want is universal. The need for secure, distributed access to data from many clients is not limited to supercomputing centers. 4. Existing Solutions There are a number of ways to approach this problem. Some systems extend the UNIX* file system at the kernel level to intercept file I/O and do the necessary work to bring the data onto a disk. Examples of this method include NAStore from NASA Ames [2], InfiniteStorage from Epoch Systems [3], and CRAY Data Migration Facility from CRAY Research. The file system is periodically scanned for unused files which are moved to slower, cheaper media. Distributed access to the data is generally provided by NFS exporting the local file system with the resulting potential for NFS time-outs mentioned earlier. For those systems not provided by the vendor of the operating system, there is the even bigger problem of maintaining the kernel modifications across versions of the operating system and across hardware platforms. Another approach to solving this problem is to let the mass storage system export an NFS server interface without an explicit UNIX file system underneath it. The file data is spread out among disks and tapes but appears as a single NFS mounted partition to the users. The file name space is kept in some sort of database. UniTree [4], which is provided by several vendors, uses this approach. This has the advantage of a standardized, user-mode interface running on the server. There are none of the problems typically associated with changing a kernel to provide the necessary interface. The down side of this approach is that the mass storage vendor has to reproduce the UNIX file systems technology in user mode that vendors provide in the kernel. The caching strategies and use of kernel memory and buffers are not available to them. This causes a general performance problem when doing operations that don't involve much actual movement of data. More recently, the Center for Information Technology Integration at the University of Michigan proposed the mass store itself be considered the file system, not just a bag on the side of a disk-based file system [5]. This is based on the approach taken by the Plan 9 [6] file system where high-speed RAM acts as a cache for magnetic disk, which in turn acts as a cache for WORM optical disk. Disks and memory would be used as high speed caches in front of the relatively slow tape or optical mass storage system. The idea is to create a transparent byte-level interface to files located in the mass storage system. A major problem with this approach becomes apparent when file system meta-data is considered. File systems generally do many small, random access operations on meta-data files, such as directories. If a directory file is not in the disk or memory cache, performance is very poor when referencing any file below it in the name space. Given that tape systems are rarely random access and have generally poor start/stop behavior, meta-data operations couldn't be handled until the file had first been moved to the disk or memory cache. The implications of this are more clear when one considers what it would be like to fsck such a file system. That operation performs random access reads to every meta-data file in the system to check its integrity. 5. Why AFS? We chose AFS [7][8] for our distributed file system for a number of reasons. From a user's perspective it has several desirable features including a global name space where the path to a file is the same from any client, Kerberos V4 security, access control lists on each directory, semantics reasonably close to UNIX file system semantics, and good performance using client-side caching. Some of the important management features include the volume abstraction, location transparency for volumes, read-only volumes, replication of volumes across multiple servers, and volume quotas. We also observed that AFS scaled well; its servers could support a much larger number of clients per machine than NFS. This is due to its use of callbacks to inform clients when data they had cached became invalid. This eliminates the need for clients to make RPCs to the servers whenever they want to validate some object they have cached. 6. Mass Storage Extensions to AFS The most fundamental goal of our extensions is to preserve the semantics of AFS and not attempt to make it an expert in mass storage systems. By leveraging each type of technology that is available, we provide a framework to allow each component to do what it does best. The file system handles the user interface and access semantics, and the mass storage system handles the bytes. The logical separation is that the file system handles the meta-data, and the mass storage system handles the user data. The system we created to implement this model is named Multi-Resident AFS. Many changes are required to make AFS function as part of a true mass storage system. The main design goals are hierarchical storage management, fault tolerance, and access to existing technology, while still retaining compatibility with standard AFS clients and servers. We achieved the former goals through extensive generalizations of the AFS server code, and the latter by working within the standard AFS data structures. 7. Multiply-Resident Data Our extensions allow AFS to move from a system capable only of serving UNIX disk partitions to a true hierarchical storage system. Where files were once restricted to exist in only one place, they are now able to exist in up to 32 separate locations. These new copies of data, known as residencies, give AFS many advantages. These additional residencies not only protect AFS against media failure, but also provide high availability of user data. If a storage system is unavailable for preventive maintenance or any other reason, AFS will automatically obtain the data from the highest priority storage system that is available. This is all completely transparent to users. In this manner, we can take preventive maintenance time with minimal impact on our users. This functionality is similar, but orthogonal, to AFS volume replication. Here we are talking about a per-file replication, where each file is treated independently. In AFS volume replication, all of the files within a volume are copied at the same time and to the same destination. The ability to have multiple copies of file data becomes more important as support for mass storage is added. A single 8 mm tape can hold 50,000 files with an average size of 100 KBytes. The loss of this single tape for any reason would have an enormous impact on users. As file servers gain the ability to utilize ever larger capacity media, this danger grows. This applies to disks as well as tape, given how fast the storage densities of disks are growing. Figure 1 shows how an AFS volume, the management unit for an aggregate of files, changes with this additional support. Notice that files can be put on different storage systems based on their sizes. 8. Consistency Allowing a file to exist in more than one storage system brings up the problem of insuring that each copy contains the same data. When a multi- resident file is modified, only the new, modified copy remains. All of the references to copies of the previous version of the file are removed. This new version of the file is then treated the same as any new file in terms of data migration. The file server also guarantees that no partially created copy of a file will ever be referenced. Figure 1. Volume layout 9. Data Migration With the addition of multi-resident support, it is necessary to have a data migration system to add and remove these additional residencies. There is a small replicated residency database which contains information about the known storage systems, but not volume or file level information. For each storage system, the information includes a relative priority, a desired file size distribution, the list of machines that offered the storage system, and various data migration policy variables. The goal of the data migrator is to optimize the use of storage by transparently moving data between members of the storage hierarchy. This includes adding redundant copies of data to provide fault tolerance, removing copies of data from storage systems whose free space fell too low, moving hot files to faster media, and moving cold files to slower media. Another important task of the data migrator is to insure that each storage system has the desired file size distribution. This way we can tune each storage system to a specific target range of file sizes, making it much more efficient. The policies governing these actions can be changed easily, either by changing the contents of the residency database or by changing a configuration file. All policy variables in the system are dynamic and can be changed without interruption of service. The data migrator also has the ability to move files or whole volumes to off-line storage. These file residencies are unavailable to users but can be migrated back by the site manager. This feature is intended to handle the case where a collection of data is no longer in use, but may be needed at a later date. It should be noted that while this system moves data between storage systems, it has no knowledge of what those storage systems do with the data. For example, AFS might create a residency in an NFS archival system in which the data initially lands on a disk and is later moved to an optical platter. AFS has no need to know that the file has been moved within the NFS storage system. Another important responsibility of the data migration system is to create random access residencies of files on demand. AFS clients access files from the AFS server in a random access fashion. If the file does not exist on a random access storage system when a fetch request is received, the data migration system must create an additional residency. We call this process "spooling a file", and we call storage systems that do not offer random access "archival" storage systems. This ability to spool a file allows AFS to make use of storage systems for which random access is either not available, or too slow to be practical. Figure 2. 10. Data Flow At this point we will describe how data is moved through the storage systems. Figure 2 shows how a file is created. A storage system is chosen based on the file's size and the free space on the random access storage systems. The residency database is consulted to determine which storage system is the best choice for this new file. This newly created file is also added to a data migration queue for possible creation of additional residencies on archival storage. Figure 3 shows how this data migration queue is processed into requests to copy the file to one or more archival storage systems. The residency database has a policy variable for each archival storage system that indicates how old a file must be before it gets an additional copy made there. Setting the minimum age to a small value reduces the window of vulnerability to media failures, but this also increases the likelihood that a temporary file will be moved to archival storage unnecessarily. For our environment we have found a minimum age of six hours to be a reasonable value. Figure 3. Figure 4 shows how space is made available when there is a shortage. There is a new process that runs on each file server machine which checks storage systems for insufficient free space every five minutes. It processes the volume meta-data and bins up files that are candidates for removal from each storage system. A primary weighting function is used to determine into which bin a file should be placed and a secondary weighting function is used to sort the files within a bin. These weighting functions are configurable, we use seconds since the last user access as the primary weight and file length in bytes as the secondary. The deletion policy also includes the ability to provide a list of free space thresholds to be applied to different primary weight ranges. Once the bins of candidate files are created, the free space reported by the residency database is compared to the threshold that should be applied to the highest weight bin. If there is insufficient free space on a particular storage system, requests to remove candidate files from that storage system are sent to the file server until we reach a stopping threshold. These requests must go through the file server to preserve the AFS locking semantics. One of the major advantages of using last access time as the primary weight is that it eliminates the need to process the volume meta-data every time the checker wakes up. Any bins of candidate files that remain from the last run are still valid. The highest weight bin from the last run is still the highest weight bin because all files age at the same rate. This would not be true if the primary weight was file length. Figure 4. Figure 5 shows how a file is spooled from a non-random access storage system to a random access storage system when a client tries to read it. As mentioned previously, the FetchData RPC requires a random access copy of the file. When the RPC is processed, the file server looks for a random access copy of the file. When one isn't found, a new copy is made on a storage system chosen by the residency database, the decision being based on file size and storage system priority. Once the new copy has been created, the FetchData RPC continues. The new copy is not considered valid, and is not referenced, until it has been made completely. This is intended to prevent a file from having multiple residencies with different data. Figure 5. One of the other interesting features of this data migration architecture is that by adding additional file residencies as soon as a file has reached a minimum age, we can free up space on our random access storage systems at the speed of a delete operation. Many systems don't begin to create additional residencies until the random access storage system starts to get full. That can often lead to a situation where free space can only be made available at the speed at which one can write data to the generally slower archival storage system. Also, unlike most other architectures, when one modifies a file there is no requirement that it be on random access storage before the modification can begin. Most products spool a file to a disk first, which is wasted effort if the modification involves a complete overwrite of the file. Our system treats the new version of a file as a separate copy of data and will only access the previous version in the event that some part of the old file was not overwritten. This also avoids getting into the situation where a file whose old version was small and whose new version is large ends up residing on a storage system reserved for small files. Considering the exponential growth of data being stored, it becomes very difficult, as well as expensive, to do backups. The ability to create additional residencies for each file helps a great deal with this backup problem by offering a new level of fault tolerance. Our extended AFS system essentially provides a continuous, per-file, backup system. This has enormous performance advantages over the usual snapshot backup systems which operate on an aggregte of files. Since every file is being treated independently, there is no need for the system to be quiescent to insure self-consistent meta-data. Additionally, we can control which files are backed-up, and to where, based on characteristics such as size and access history. However, this does not handle the case where a user deletes a file by accident. The backup volume ability in AFS provides some limited recovery from this failure, but backup volumes are often remade each night. If a user does not realize the mistake in time, the backup volume will not help. More will be said on this point in the lessons learned section. Another important advantage of multiply-resident files is that we are able to bring new storage systems into service and take old ones out of service in a user transparent fashion. It is often the case that when replacing a major storage system, there is no simple way of moving the data from the old system to the new one. With only two directives, our AFS system will do the migration automatically. 11. Generic I/O Interface Many of the storage systems we wanted to integrate with AFS did not offer the UNIX file system interface required by the AFS open-by-inode I/O calls. To solve this, we generalized the I/O system of AFS. By doing all I/O through a generic interface, the UNIX file system dependence of AFS is removed. This allows AFS to take advantage of the available technology without having to reinvent it. In order to interface to a given storage system, only a small I/O library is required. AFS can then access a wide variety of storage systems, including UniTree, CRAY Data Migration Facility, Epoch, and a Maximum Strategy RAID disk system. There are three types of routines that must be provided when creating an I/O interface. The first group of routines does file management. Open, Read, Write, Seek, Truncate, CloseIfOpen, Stat, and FsyncFile are generally well understood routines so we won't describe them here, except to say that an interface that doesn't support random access would not need to actually implement a seek routine. The other file management routines include IncDec and BulkIncDec, which are used to change the volume reference count on file data, and GenerateLookupTags, which is used to create a 64-bit identifier for a new file on a storage system. The first two routines are similar to link and unlink, and the last routine is similar to create. The next group of routines are storage system management routines. They include GetFileSystems, GetFsAdvisoryLock, ReleaseFsAdvisoryLock, StatFs, VerifyResidency, ListFiles, ListVolumeHeaders, and Configure. These routines can be thought of as operating at the UNIX partition level. GetFileSystems is similar to returning a mount table. GetFsAdvisoryLock and ReleaseFsAdvisoryLock are used to control access to a storage system during execution of a salvage operation, the AFS equivalent of fsck. StatFs returns the free and total space available on a storage system. VerifyResidency reports whether a storage system is available for use. ListFiles is used to dump the list of files that exist on a storage system, this is equivalent to dumping the inode table on a UNIX file system. ListVolumeHeaders returns the list of AFS volumes that exist on a given local disk storage system. Finally, Configure is used to set up a storage system when it is first brought into service. The last type of routine is the optional Import function. It is necessary to address the problem of how to merge an existing mass storage system with AFS. The Import routine allows one to create a file in the AFS name space and map it to a pre-existing piece of user data. It is intended that this routine will just alter the lookup handle of the user data to conform with the AFS naming scheme. No actual movement of the user data should be involved. This can be thought of as a rename operation, but not a copy. This feature also allows one to store data into a storage system shared with AFS, using some other mechanism, and later add it into the AFS name space. 12. AFS Servers Share Storage Systems Up to this point we have only discussed a single AFS server environment. But in practice we have several servers, each with its own set of storage systems. Because of the expense involved in providing every server with direct access to every storage system, we provide a mechanism that allows the servers to share their storage systems. This mechanism, known as the remote I/O server, allows one AFS server to execute I/O calls on a storage system that is not directly connected to it. The remote I/O server is a small RPC service that allows a machine to offer to AFS servers all of the storage systems to which it has direct access. This has the added advantage of allowing us to bring more machines into the server hierarchy without having to run full AFS file servers on them. As long as a machine is capable of running an AFS RPC service, it can offer storage systems to AFS file servers. It should be noted that each file server has a set of storage systems defined to be its "local disk" which only it may use. This "local disk" system corresponds directly with the standard AFS UNIX disk partitions in that data that is private to a server is kept here. 13. Salvaging Salvaging consists of comparing the meta-data files, including directories, with what actually exists out on the various storage systems. Our system has a rewritten salvager to deal with the new problems that arise in this system. Now that files and directories can be spread out across multiple storage systems and multiple servers, the salvage process becomes much more complex. It is quite possible that when a salvage is being executed, not every machine or storage system is available. The new salvager is smart enough to handle this case without removing references to files whose existence can't be verified. Another major issue is salvaging directories that are not on a random access storage system. The salvager must read the contents of every directory so tha files can be claimed. Any files that have not been claimed need to be added to a lost+found directory. Many hierarchical storage products do not allow directories to be migrated off of the local disk because of the need to have them available for salvaging in a random access fashion. The new salvager gets around that limitation by guaranteeing that directory check operations do no seeks and therefore do not require random access. This allows directories to be migrated off of the disks without the need to spool them back from tape whenever the salvager runs. There is also support in the salvager for doing selective salvages of storage systems. As a rule, we trust our archival storage system not to lose files. Therefore we don't include it in storage systems we salvage by default. This saves a great deal of time. 14. Comparison With Standard AFS In light of what has been described above, we will now take a look at how some fundamental AFS concepts have changed. In standard AFS, the logical unit of files, a volume, is tied to a specific UNIX partition on a specific server. This has been extended to allow the volume to span many storage systems, as shown in Figure 1. By limiting only the volume headers to the original server and partition, we can effectively utilize a hierarchy of storage systems for user data, while retaining backwards compatibility with standard AFS volume utilities. Previously, one had to worry about assigning too many volumes to a single UNIX partition and the degree to which its space was being over-allocated. In the new model, this is no longer a consideration. Standard AFS has the concept of a read-only volume that is a replica of a read- write volume that may exist on another server and/or partition. These read- only volumes are used to distribute the load to a very active, and slowly changing, collection of files among a number of AFS servers. They also provide a limited amount of fault tolerance. If one replica is unavailable, AFS clients will automatically fail over to another replica. The improvement we have made to this system is that the read-only replica(s) and the read-write volume actually share copies of data in the shared storage systems, even if the volumes reside on different servers. In this manner we gain all of the advantages of having replicated volumes without paying the storage penalties. It is important to remember that we no longer use replicated volumes to protect against media failure; the multiple file residencies take care of that. Therefore we lose nothing by having the replicas share copies of data. The replicated volumes are used to spread out the load of AFS name space operations. The lookup information for files in standard AFS is kept in one of two volume meta-data files in a data structure called a vice node (vnode). In order to support multiple residencies of a file, while working within existing data structures, we create chains of vnodes. The original vnode of a file is unchanged except that two previously spare fields are used to point to other vnodes. These other vnodes hold the lookup information for additional file residencies. This approach allows system utilities and dump formats to be backwards compatible. Standard AFS keeps track of very little in the way of a file's usage history. We add a per-file access history with the addition of another vnode to the chain. This history includes details of the most recent accesses and keeps various counters that let us know how well the system is managing each file. This information is used by the data migration system to determine which files should be migrated and to where. We also provide reporting tools that generate reports based on this information. 15. Current System Usage Our current mass storage system is shown in figure 6. We use the local disk and the Maximum Strategy RAID disk banks for random access storage and CRAY Data Migration Facility (DMF) for archival storage. We configured our residency database to put files 64 KBytes or less on the local disk residency and larger files on the RAID disks. Files that are over 4 KBytes will be migrated to DMF. The AFS clients range from hundreds of workstations to a CRAY C90. The bulk of the files are created by the workstations and the bulk of the data is created by the CRAY. The workstations include many personal workstations and a few dozen members of workstation farms doing large scale data processing. Figure 6. Figures 7 though 10 describe how this system is being used. Figure 7 shows that our average file size is somewhat larger than the typical UNIX file [5][9]. Figure 7. Figures 8 and 9 show how the system sends small files to the local disk and larger files to the RAID disk banks. Figure 8. Figure 9. Figure 10. Figure 10 shows that almost seventy percent of the files, about half the data, are written but never read back from the file server. This should not be taken to mean that the file server processes mostly write requests. It demonstrates that a large amount of our files and data are not used for longer than they can remain in our AFS clients' caches. This can be attributed to both the AFS client caching mechanism and the fact that our clients have relatively large cache sizes. As of 7/5/94 the system managed 43434 directories, with 100 MBytes, 591044 files, with 221 GBytes, and 18736 symbolic links, with 338 KBytes. 16. Lessons Learned File servers with integrated mass storage support can service much more data per server than conventional servers. These systems not only save money by reducing the amount of disks needed, but can reduce the number of file servers needed as well. The number of disks that can be attached to each file server becomes less of a consideration. The issue of how many file servers are needed to meet the transaction load remains. Currently, we service as much data from two RS/6000 servers running this software as we can service with six DecStation 5000 servers running standard AFS. As the space available for storing files increases, so does the average file size. Between November 1992 and July 1994 the average file size has grown from 51 KBytes to 338 KBytes. However, the median file size has not changed. This tells us that users are making use of the fact that they can now store more large files than ever before, but the typical file size hasn't changed. File servers have more information on when a file is likely to be used again than the mass storage systems they use to store the bytes. Often the storage systems come with software which attempts to determine when files should be moved between its high and low performance media. The assumptions they use in that determination won't take into account the fact that the file server is caching the file on other storage systems that it may consider higher priority, and that AFS clients are also caching that data. For tape management systems, we find it best to issue a directive to indicate that it is relatively unlikely that it will receive any read requests for a file recently stored there. Such files are already on a random access storage system and probably in an AFS client's cache as well. Many systems block writers when there isn't enough free space to continue. The idea is that files will be migrated to free up space and then the writers will be woken up. This does not work for AFS because there is a limited number of execution threads on the server and it ends up blocking out all client operations while it waits for space to be freed up. In addition, the AFS clients get stuck waiting for RPCs to complete to the server and AFS clients do not have the ability to interrupt Fetch and Store operations. In this architecture, if the server could keep up with the demand, it wouldn't need to block. If it can't, there is no reason to expect that blocking the writer will accomplish anything in a reasonable amount of time. Allowing servers to share storage systems does a great deal in handling bursts in load, but when space runs out, a failure should be returned to the clients. A possible improvement to this would be the ability for clients to retry store operations later. This shouldn't be difficult as the client is already caching the file, but would require modifications to AFS client software. As with all mass storage systems where unused files are not likely to be on low latency storage, users learn not to do certain operations. A user will rarely make the mistake of typing 'file *' across a bunch of old files more than once. This learning process no doubt reduces load to mass storage system and further reduces the likelihood that old files are ever read again. Even though the file server has longer code paths than standard AFS, the performance for most operations is at least as fast. This is true because in our experience AFS performance is limited by the RX transport layer and the ability of the server to process its UDP packets. This has remained true in the latest release of AFS, version 3.3a, when using an FDDI network. The server CPU saturates when trying to send a large number of UDP packets per second over FDDI. Therefore the extra server code does not have a noticeable effect. It is also possible to get better performance than standard AFS because of the ability to use faster media and I/O interfaces that are optimized for a certain file size range. For example, our RAID-3 disks are much faster than our SCSI disks when transfering files over two megabytes in length. These RAID-3 disks can not be accessed from standard AFS because they do not have a UNIX file system structure. When running performance tests with a CRAY C90 client, an RS/6000-580 server, and an FDDI network, there is no measurable performance difference between standard AFS and Multi-Resident AFS when reading and writing files. Writes happen at about 1.2 MBytes/Sec. and reads at about 1 MByte/Sec. There are two operations that are noticeably slower than in standard AFS, one being file deletion and the other salvaging. Since files generally reside in more than one storage system, it takes longer to delete them. We will be addressing this problem by making file deletion asynchronous. The user will get an immediate return once the RPC has been processed, but the actual file deletion will occur at a later time. This should make file deletion faster than in standard AFS, at least from the users' perspective. The salvager is slower for several reasons. There are more storage systems for which the list of files residing there must be obtained. There is more volume meta-data than in standard AFS which must be checked for validity. For the system described earlier, we have found it to take 1.5 hours to salvage all of the volumes in the cell. The bulk of this time is spent running the ListFiles routine across the storage systems. It has been suggested that performance could be improved if the server pre- fetched a whole directory at a time. The idea is that if a user accesses a directory, they are highly likely to read files in that directory. Analysis of our file accesses indicates that while it's true that a file in that directory is likely to be read, most of the files won't be. The net effect of this strategy is to do much more I/O on the server and to use much more high priority storage than is strictly necessary. The overall performance of the system actually goes down. A mechanism that might allow this to work effectively would be the addition of affinity groups. A file could be added to one or more groups of related files to indicate that they are referenced as a group. This ability would have another potential performance improvement when the files are out on tape. Many mass storage systems that offer media with a high latency to mount make attempts to reorder requests so that they are grouped by tape, thereby reducing the number of mounts. Since AFS clients do not send bulk requests for file fetches to the server, there is currently no ability to take advantage of this feature. With the addition of affinity groups, related files could be put on the same tape. When configuring the system, we had to choose how old a file must be before it gets an additional residency on our tape system. The shorter this time, the faster one can free up space when needed. However, if a file is going to be temporary, adding a tape residency is a waste of resources. We found that a minimum age of six hours seems to avoid having most of the temporary files added to archival storage. It was necessary to do something to address the need to backup files in the event that a user accidentally deleted their data. Using the AFS backup volume to provide access to the previous night's file system handled most, but not all, cases. We did not have the option of doing full backups because of the scaling problem. Additionally, most of our data was out in archival storage, requiring a tape mount for almost every file just to dump its contents. There was no possibility of finishing volume dumps within 24 hours. What we chose to do was to dump the meta-data for all files, but only the data that was resident on the local disk. Since most of our files are small, they tended to land on the local disk and were included in the dump. For the larger files, we took advantage of the fact that our archival storage system supported both soft and hard deletion. When AFS files were deleted, they were still on a tape until an administrators did a hard deletion. Having the meta-data for all files in the dumps allowed us to find files that had been deleted by users. Currently, the data must still flow through the file servers. We are adding the ability to do third party transfers to eliminate this potential bottleneck. This will allow data to flow directly between the client and the storage system without passing through the file server's memory. This feature requires AFS client modifications so it will not be available to standard AFS clients. 17. Summary The Multi-Resident AFS architecture successfully integrates a wide variety of mass storage systems with a distributed file system. Its ability to separate the user interface and access semantics from the mechanisms by which data is stored make the architecture very flexible. We believe it addresses the issues involved in bringing mass storage systems into more main stream environments. 18. Acknowledgments UNIX is a trademark of Unix Systems Laboratories. 19. References [11 Collins, M. W., Mexal, C. W., "The Los Alamos Common File System," Tutorial Notes, Ninth IEEE Symposium on Mass Storage Systems, October 1988. [2] Tweten, D., "Hiding Mass Storage Under Unix: NASA's MSS-II Architecture," Tenth IEEE Symposium on Mass Storage Systems, pp 140- 145, May 1990. [3] Foster, A., Habermehl, D., "Renaissance: Managing the Network Computer and its Storage Requirements," Eleventh IEEE Symposium on Mass Storage Systems, pp. 3-10, October 1991. [4] McClain, F., "DataTree and UniTree: Software for File and Storage Management," Tenth IEEE Symposium on Mass Storage Systems, pp. 126- 128, May 1990. [5] Antonelli, C. J., Honeyman, P., "Integrating Mass Storage and File Systems," Twelfth IEEE Symposium on Mass Storage Systems, pp. 133- 138, April 1993. [6] Quinlan, S., "A Cached WORM File System," Software Practice and Experience, Vol 21, No. 12, pp. 1289-1299, December 1991. [7] Morris, J.H., Satyanarayanan, M., Conner, M.H., Howard, J.H., Rosenthal, D.S.H., Smith, F.D., "Andrew: A Distributed Personal Computing Environment," Communications of the ACM, Vol 29, No. 3, pp. 184-201, March 1986. [8] Satyanarayanan, M., "Scalable, Secure, and Highly Available Distributed File Access", IEEE Trans. Computers, pp. 9-21, May, 1990. [9] Ousterhout, J., DaCosta, H.L., Harrison, D., Kunze, J., Kupfer, M., Thompson, J, "A Trace Driven Analysis of the Unix 4.2 BSD File System," Proceedings of the Tenth ACM Symposium on Operating Systems Principles, Orcas Island, December 1985. 20. Suggested Reading "Mass Storage System Reference Model: Version 4," edited by Sam Coleman and Steve Miller, IEEE Technical Committee on Mass Storage Systems and Technology, May 1990. "A Joint European Mass Storage Specification Effort," edited by Dick Dixon, European Centre for Medium-Range Weather Forecasts, Thirteenth IEEE Symposium on Mass Storage Systems, pp. 110-112, June 1994. 21. Biographies Jonathan S. Goldick received a B.S. degree in Physics and a M.S. degree in Electrical Engineering from Carnegie Mellon University, Pittsburgh, PA, USA, in 1988 and 1989, respectively. He is currently a Senior Technical Specialist at the Pittsburgh Supercomputing Center, Pittsburgh, PA, USA. His research interests include the design, development, and analysis of distributed mass storage systems. Kathy Benninger has been a hardware systems engineer with the Scientific Support Group at the PSC for four years. Her focus is on mass storage system specification and integration and on supporting systems for scientific visualization. She received a BSEE from Carnegie Mellon University in 1984. Christopher Kirby is a Senior Research Systems Programmer at the Pittsburgh Supercomputing Center and has worked there since 1990. He received a B.S. in Applied Math/Computer Science at Carnegie Mellon University in 1988 and a M.S. Computer Science from New York University in 1990. He was previously employed at At&T Bell Laboratories. His professional interests include design and development of mass storage and distributed file systems. Christopher J. Maher is Director of Scientific Support at the Pittsburgh Supercomputing Center. He has been with PSC since 1988, as a Scientific Specialist, Scientific Support Manager and Scientific Support Director. At PSC he has over seen the installation of the Centers first mass storage system, supervised the development of multi-resident AFS, and is responsible for most of PSC's software development projects. Maher held postdoctoral fellowships at the Massachusetts Institute of Technology and Carnegie Mellon University prior to joining PSC's staff. He received is B.S., M.S. and Ph.D. from Carnegie Mellon University in Physics in 1980, 1982 and 1986 respectively. Maher has authored and co-authored numerous scientific publications. Bill Zumach is a Senior Research Systems Programmer at the Pittsburgh Supercomputing Center, where he has been employed since 1992. He works on design and development of distributed files systems. He received a B.A in Mathematics from the University of Minnesota in 1987. He previously worked at the Astronomy Department of the University of Minnesota designing data collection and analysis software. His research interests include mass storage and distributed file system design and analysis.