Metadata Logging in an NFS Server Uresh Vahalia, EMC Corporation Cary G. Gray, Abilene Christian University Dennis Ting, EMC Corporation ABSTRACT Over the last few years, there have been several efforts to use logging to improve performance, reliability, and recovery times of file systems. The two major techniques are metadata logging, where the log records metadata changes and is a supplement to the on-disk file system, and log- structured file systems, whose log is their only on- disk representation. When the file system is mainly or wholly accessed through the Network File System (NFS) protocol, it adds new considerations to the suitability of the logging technique. NFS requires that all operations be updated to stable storage before returning. As a result, file system implementations that were effective for local access may perform poorly on an NFS server. This paper analyzes the issues regarding the use of logging on an NFS server, and describes an implementation of a BSD Fast File System (FFS) with metadata logging that performs effectively for a dedicated NFS server. 1. Introduction Recent years have seen improvements in CPU speeds that have not been matched by comparable improvements in disk access speeds. As a result, disk I/O has become the new bottleneck in operating system performance [Oust 90]. Traditional file systems perform poorly when run on fast machines with relatively slow disks. This has motivated the development of new file systems that seek to reduce the frequency and latency of disk access. One important technique is to use logging, which holds the promise of higher performance, greater reliability, and quick crash recovery. Transparent access to files on other machines is a relatively new development. Consequently, most file systems are designed primarily for local access. When the same file system is used for exporting files and directories over a network, many assumptions inherent in its design are invalid, and the file system may perform poorly. In particular, the Network File System (NFS) protocol [Sand 85] requires the server to commit any file system modification to stable storage before returning the results of a request. Consequently, an NFS file system must perform many more disk writes, most of them synchronous, than a file system used for local access. Recent implementations of logging in file systems require extensive changes both to the kernel algorithms and to the on-disk structures. Development costs are high, since not only the kernel code, but also utilities such as newfs(8) and dump(8) must be rewritten. To upgrade to the new file system, the system administrator must back up all partitions, boot the new kernel, reformat the disks, and finally restore all files. Their performance gains, even for local access, are not significant compared with other optimizations such as file system clustering [McVo 91]. Moreover, current logging file systems are optimized for local access, and their advantages are reduced for NFS use. The Calaveras project [Rama 94] was an advanced development effort at Digital Equipment Corporation. Its aim was to build a dedicated, high- performance, multi-protocol file server using commodity hardware and conforming to existing software standards. Its first prototype was an NFS server running on Intel platforms. While designing the Calaveras file system, we wanted to gain the advantages of logging, namely high reliability and quick crash recovery, while retaining the on-disk layout of the BSD Fast File System (FFS) [McKu 84]. It was also critical to optimize the file system for NFS-only access, which required addressing some of the drawbacks of existing logging implementations. We describe here the implementation and performance of the Calaveras file system, which uses metadata logging to enhance a traditional FFS. 2. Background The original UNIX file system [Thom 78] uses simple disk layout and algorithms, but performs very poorly. It uses small, fixed-size blocks that are allocated randomly from the disk. FFS, introduced in the 4.2BSD release, provides a major improvement. FFS uses large block sizes (typically 4K or 8K bytes), and tries to optimize disk access by intelligent placement of blocks. It divides the disk into cylinder groups, comprising a set of contiguous cylinders. Each cylinder group stores file data as well as metadata (inodes, directories, and indirect blocks), and the allocation algorithm tries to place related information in the same cylinder group. It also tries to minimize rotational delays by predicting the amount of disk rotation between consecutive read operations (the rotational delay, or rotdelay), and separating successive blocks of the same file by that amount. There are several limitations to FFS performance. The rotdelay factor optimizes for the case when the file is accessed one block at a time. At the same time, it limits the disk performance to a fraction of raw disk access speed. For instance, if the rotdelay for a disk is set to one (the best case), two successive blocks of a file are separated by one unrelated block. This limits the disk bandwidth to half its raw capacity. Secondly, many file operations require several different I/O operations, some of which have to be done synchronously to preserve file system consistency. For example, a file create operation allocates and writes a new inode, and modifies the parent directory and its inode. This requires three disk accesses (more if the blocks are not already in memory), each of them potentially involving a head seek. Further improvements have come in three major directions. One is to modify file system algorithms to reduce and optimize disk accesses, such as I/O clustering [McVo 91] and write-gathering of NFS operations [Jusz 94]. The second is to use non- volatile read-only memory (NV-RAM) [Mora 90, Hitz 94] to delay and batch writes that normally must be synchronous. The third is to use logging as either a supplement or a substitute for normal file system writes. The logging technique is particularly attractive since its benefits are not restricted to performance. It offers increased reliability, since the log may replicate some or all of the file system data and metadata. It also allows quick recovery after a system crash, since a log playback is usually much faster than the file system checking and patching performed by fsck(8) [Kowa 78] in traditional systems. This is an extremely important consideration for installations that require high availability and cannot tolerate long delays due to crashes. 2.1. Logging file systems A file system can use logging in two ways - it can be log-structured, or log-enhanced. The former approach represents the entire file system as a single, continuous log [Finl 87, Oust 89]. It relies on a large cache to handle most read requests, and tries to write the log in large chunks, sequentially on the disk. Operations are batched as far as possible to avoid small writes. Whenever a block is modified, it is simply rewritten to the tail of the log instead of updating in place, and other data structures are updated and similarly rewritten to reflect the new position. In time, many blocks in the log become invalid, either because they have been rewritten further ahead in the log, or because the corresponding files have been deleted. The file system tracks and garbage collects these blocks, freeing up space needed when the disk is full and the log must wrap around. This requires compacting the free space, rewriting scattered active blocks to the head of the log. [Selt 93] describes an implementation of a log- structured file system (LFS) for 4.4BSD UNIX, based on similar work for Sprite in [Rose 91]. While LFS performs better than traditional FFS, its performance gains are matched, and in some ways bettered, by simpler enhancements to FFS such as the file system clustering work of [McVo 91]. LFS also involves a major code rewrite and on-disk structures that are incompatible with FFS. Moreover, the performance benefits of LFS come directly from the ability to write the log in large chunks. This is generally not possible with NFS, which requires synchronous commits. Finally, the garbage collection and compaction costs substantially reduce the performance; in some benchmarks, the performance of LFS with garbage collection was as much as 20% worse than standard FFS. In log-enhanced file systems, the log is a supplement to, and replicates information in, the normal on-disk structures [Hagm 87]. Typically, the log only records changes to metadata objects (inodes, directories, allocation maps, etc.), perhaps in an ordinary file in the same file system. If the system is shutdown gracefully, the log can be discarded, since the file system is up to date and consistent. In the event of a crash, however, the log is used to rebuild the file system. During normal operation, each metadata write is first written synchronously to the log, and the on-disk structures are updated later during cache flushes. Hence after a crash, the on- disk structures may contain stale data, but the log has a record of all completed operations, and can be played back to recover the file system to a consistent state. Metadata logging may improve performance as well. On one hand, each metadata update is written to disk twice - once to the log, and once to its normal location on disk (we call this write the in- place update). On the other hand, since the in-place updates are delayed, they are often eliminated or batched. For example, the same inode may be modified several times before it is flushed, and multiple inodes in the same disk block are written out together. The log writes are batched as well. For a single operation such as create, the changes to the directory and the two inodes can be combined in a single log entry. Multiple operations that are temporally close to each other can be similarly batched. This reduces the total number of disk writes for metadata blocks. The overall impact on performance depends on the ratio of the metadata operations (such as create, delete, and link) to data writes. If much of the activity in a system is large file writes, the performance improvement is negligible. 2.2. Considerations for NFS access The behavior and performance of a file system are very different when it is accessed locally and when it is accessed by remote clients using a file access protocol such as NFS. NFS is a stateless protocol, and neither the server nor the clients are required to maintain state information about the other (although both usually maintain some state for performance reasons). When a server crashes and recovers, the client has no way of knowing it, and the effect to the client is similar to that of a network delay. For such a protocol to work consistently, the server is required to commit all file system modifications to stable storage before returning the results of an operation. This condition has a great impact on file system behavior. For local file systems, the kernel delays most disk writes until it needs to flush its cache. This has many advantages. Multiple writes to the same block between cache flushes are all committed by a single disk write. The disk driver can effectively reorder the writes to minimize head seeks. Many writes can be eliminated altogether - if a user creates a temporary file, writes some data into it, and deletes it shortly thereafter, the data blocks may never be written to disk. NFS requests, on the other hand, require frequent synchronous disk writes. Each write request that increases the size of a file causes at least two disk writes - one to write the data block, and one to write the updated inode. Additional writes are necessary if an indirect block must be created or modified. NFS access thus generates a lot more disk I/O and offers less room for traditional optimizations such as reordering of disk requests. Consequently, a file system that is suitable for local access may perform poorly if used mostly or wholly for NFS access. NFS server performance is characterized by two metrics - latency and throughput. Latency measures the average time taken for each NFS request, while throughput is the maximum load the server can bear, measured in NFS operations per second. The two are interrelated; as the load on a server increases, so does the average latency. There is also room in the design for a tradeoff. A log-structured file system, for instance, achieves high performance by batching writes. This results in increased throughput. On the other hand, since the write must be committed to disk before replying to the NFS request, such batching can result in unacceptably high latency. High latency causes several problems. Users see the system as slow and unresponsive. If a request takes very long to complete, the client may time out and retransmit the request. If the server cannot detect or handle these retransmissions effectively, it may perform a lot of duplicate processing. This further escalates the performance degradation, and also causes numerous correctness and consistency problems [Jusz 89]. 3. Calaveras file system design We began by porting the UNIX file system (ufs) from DEC OSF/1 to the Calaveras kernel. This required some changes since the storage interface, scheduling, and buffer cache of Calaveras [Rama 94] were different from OSF/1. The on-disk structures were identical to those of FFS. Once we had a working prototype, we decided to enhance it by adding metadata logging. The primary motivation for logging was to provide quick crash recovery. We wanted to eliminate the need for fsck, which can take tens of minutes on a file server with a large number of disks. We also wanted equal or better performance - in particular, it was essential that the addition of logging not increase the latency of the server. An important constraint was that the on-disk layout of the file system remain unchanged, so that users could migrate existing disks to the server without needing to back up and restore the file system. Finally, we wanted to restrict and isolate the changes to the file system, and avoid a large development effort or extensive code rewrite. These considerations precluded a log-structured file system approach. An important decision was whether to use an undo-redo or a redo-only log [Moha 92]. An undo- redo log records, for each modified object, both its old and new values. The advantage of this method is that it has looser consistency requirements governing the order of log writes and in-place updates [Chut 92]. On the other hand, it doubles the size of the log and of each write to it. The recovery algorithm is also more complex, since logged transactions can be either replayed or rolled back. A redo-only log only records the new value of each modified object. This imposes stricter ordering of operations (described below), but has a simple recovery algorithm and smaller log size. We decided to adopt a redo-only log. A related decision was to use physical block addressing, as opposed to logical block addressing or operations logging. Our log entries identify blocks by their physical disk locations, rather than by their logical names in the file system. This makes recovery simple, since the log contains the destination address of each item. For systems that use a logically addressed log, it is generally incorrect to replay the log from the beginning if the system crashes during recovery. Hence the recovery process must itself log its progress. This makes recovery slow and complex. The same problem occurs with systems that log operations rather than the new value of the data. Our basic approach is similar to that in the Cedar file system [Hagm 87]. Instead of storing the log in the same file system, however, we use a separate partition (preferably, a separate disk) for the log. This does not have to be a very large disk, since a metadata-only log does not take up too much space. Further, as we shall demonstrate, the log disk can be relatively slow, yet not degrade overall performance. This allows a system administrator to dedicate a small, inexpensive disk for the log. There is only one log in the system; it records changes to objects in all the file systems. FFS has five different types of metadata objects - inodes, directories, allocation bitmaps, indirect blocks, and cylinder group summaries. We log modified inodes in their entirety. For directories, we record the 512-byte chunk that contains the modified data. In the case of allocation bitmaps, we only log the changed bits. Cylinder group summaries can be quickly computed from the allocation bitmaps, and hence are not logged. Logging indirect blocks would have required a substantial change to several functions, and the benefits are small, since this is a relatively infrequent operation. We decided to defer it in the first implementation. The logging code only affects those NFS requests that can potentially modify the file system. We call these requests (setattr, write, link, symlink, create, remove, mkdir, rmdir, and rename) intrusive. Each intrusive request generates a log entry, which records all metadata changes made by that request. The server must write that entry to the disk before replying to the request. The in-place updates of the metadata wait for the next sync operation, which happens every 30 seconds in Calaveras. When all metadata objects described by a log entry are sync’ed to disk, the entry is obsolete and can be overwritten. The log is circular, wrapping around when it reaches the end. If the log is large enough, the sync operations will keep it clean, and no separate garbage collection is necessary. We found that a 10-Mbyte log was sufficient for a server with 12 disks, each with about 375 Mbytes of active data, running at a load of 650 NFS operations per second. We therefore mandated a minimum log size of 32 Mbytes (way more than enough for the loads our server could support), and restricted garbage collection code to merely track the log usage, and to panic if active data was overwritten. The initial performance results were extremely poor, due to the overhead of writing one log entry for each intrusive NFS operation. We devised a solution that allowed automatic batching of log entries without increasing the latency. We also added batching to the recovery algorithm. These enhancements, described in the following sections, allowed us to meet our performance goals. 3.1. Log consistency Crash recovery with a redo-only log is effected by replaying the log, writing all objects back to their correct locations on disk. This requires that the log copies of the objects are more current than the on- disk copies. Hence during normal operation, any change to a metadata object must be committed to the log before the in-place update of the object. The file system satisfies this condition by not releasing the buffers containing the metadata objects to the cache until the log write completes. There is, in fact, a much stronger requirement. It is incorrect to for an intrusive request to even read any object that has not been written out to the log. Figure 1 illustrates a potential problem. Thread t1 modifies object A, and is about to write it out, first to log and then to disk. Before it can do so, thread t2 reads object A, and based on that, modifies object B. It then writes B to the log, and is about to write it to the disk. If the system were to crash at this instant, the log contains the new value of B, but the new value of A is neither in the log nor on disk. Since the change to B depends on the change to A, this situation is potentially inconsistent. Figure 1: Race condition in a redo-only log To take a concrete example, suppose t1 is deleting a file from a directory, while t2 is creating a file by the same name in the same directory. t1 deletes the file name from block A of the directory. t2 finds that the directory does not have a file by that name, and proceeds to make a directory entry in block B of the directory. When the system recovers from the crash, it has the old block A and the new block B, both of which have a directory entry for the same file name. Figure 2: Data structures to implement logging To ensure consistency, the server must lock all metadata objects until their log entries are written out. This lock only affects intrusive NFS operations. It is perfectly valid to read uncommitted data if no modifications are made based on that. A separate logging lock has been added to vnodes to implement this synchronization. Only the intrusive requests acquire this lock. 4. Implementation Figure 2 describes the data structures used to implement logging. There are five main objects: * an update descriptor tracks all operations and synchronization associated with a single NFS request. * a log entry contains the actual data that is written to the log for one NFS request. * the log use map tracks which parts of the log are active and which are free. * staging buffers allow automatic batching of log writes. * log holds prevent premature release of an update descriptor. 4.1. The update descriptor An update descriptor baby-sits an NFS request, and holds all temporary information required to successfully complete it. Each intrusive NFS request first acquires an update descriptor, and passes it as an additional argument to many of the functions it calls. The descriptor is released not when the request completes, but when all the metadata objects that it has modified have been successfully updated in-place. At that time, the log entry on disk for this request becomes obsolete, and the corresponding bits in the log use map are cleared. This is the last step in processing an NFS request. The fields of the update descriptor include * a list of vnodes whose logging lock is held by this request. When the log entry is committed to disk, these locks are released. * a reference count of the number of log holds pointing to it. The descriptor is released when this count drops to zero. * base and size of the log entry. This determines the bits to clear in the log use map when the descriptor is released. * a list of items to log. Each item is identified by an item type (inode, directory chunk, inode allocation bitmap, or block allocation bitmap), pointer to the cached data, an inode number, the address of the block on disk and the offset of the object in that block. This information is used to create the log entry. 4.2. Log management A log entry holds all the metadata changes for a single NFS request. It comprises a header and a table of contents, followed by the actual inodes, directory chunks, etc. The entry is padded to a 512- byte boundary. The table of contents describes the type and disk location of each object in the entry. The header contains the number of items in the table of contents, as well as the total size of the entry. It has two other fields - recno and curhead. recno is a monotonically increasing record number, such that recno modulo the size of the log (in 512- byte sectors) equals the position of this entry in the log. curhead is the recno of the first active record in the log at the time the entry was generated. These fields are used for recovery. The tail of the log is the entry with the highest recno value. The curhead field of that entry identifies the head of the log. The recovery algorithm replays all entries between the head and the tail. The log use map is a bitmap with one bit for each 512-byte sector of the log. When an entry is active, the bits for the corresponding sectors are checked. The bitmap allocates new entries to NFS requests, and frees them when their update descriptor is released. The map tracks the current head and tail, and returns the information along with each allocated entry. New entries must be allocated at the tail, since the log must always be contiguous. If the bits at the tail are busy (the tail wraps around and catches up with the head), the log is considered full, even if there are free regions in the middle. This results in a panic. A log hold is simply a reference to an update descriptor. It contains pointers to chain the holds, and a pointer to the descriptor. In Calaveras, the file system maintains a dirty block list of modified metadata blocks; this list is periodically traversed and flushed by the sync daemon. The buffer headers for these blocks contain a linked list of log holds. Whenever an NFS request modifies a metadata object, it adds a log hold to the corresponding buffer, and increments the reference count of the update descriptor that the hold points to. When sync successfully writes the block to disk, it traverses the list of log holds, releases each of them, and decrements the reference counts on the update descriptors. When the count reaches zero, the descriptor and its log entry are freed. 4.3. Normal operation When an nfsd thread receives an intrusive request, it first allocates an update descriptor. It passes this descriptor to each vnode operation and onward to other functions that may need it. During the processing of the request, the thread acquires logging locks on the vnodes of any files or directories it accesses. Whenever it modifies a metadata object, it makes changes to the cached copy of the object. It does not release the corresponding buffer to the dirty block list, since it is not yet safe to write it to disk. It adds an entry in the update descriptor (in the list of items to log). This entry identifies the buffer and acts as a reference to it. When all the processing for the request is complete, the thread calls the processUpdateDescriptor() routine, which performs the following tasks: * Goes through the list of items to log, and computes the size of the log entry. * Reserves disk space for the entry from the log use map. * Reserves space in the staging buffer to write the entry. * Traverses the list again, copying each item (inode, directory chunk, or allocation bits) to the staging buffer. * Calls writeLog() to write the entry to disk. * Puts modified blocks on the dirty block list. * Adds a log hold on each modified block, and increments the reference count on the descriptor. * Releases all logging locks. Figure 3: Batching writes through staging buffers Multiple requests may modify a buffer between flushes. Each removes it from the dirty block list and replaces it after the log write completes. This results in multiple holds on each block. A block could also be modified more than once by a single request, for instance, when a request modifies two inodes in the same disk block. This causes multiple holds on the same block referencing the same descriptor. Eventually, the sync daemon removes the block from the list and flushes it to disk. It then releases all holds for the block, and decrements the reference counts on the corresponding descriptors. If the reference count on a descriptor reaches zero, its log entry is marked obsolete (by clearing the bits in the log use map), and the descriptor is released as well. 5. Batching log writes Writing each log entry to disk individually is expensive, and causes overall degradation of server performance. All logging systems rely on batching of log writes to obtain decent throughput. At the same time, the batching requires that some writes be delayed till sufficient data has been collected for a large write. Since NFS requests cannot be replied to until the write completes, this causes an increase in latency, which is usually unacceptable. For this reason, many logging file systems are unsuitable for an NFS server. We devised a solution to this problem, to get automatic batching without any increase in latency. The basic principle is that under heavy load, the log disk should always be busy. No write should be kept waiting if the disk is idle; the batching is restricted to entries that accumulate while another log write is in progress. The staging buffers are used to create and populate the log entry before writing it to disk, and they also provide the mechanism for automatic batching. We need a minimum of two staging buffers, so that one is filled while the other is being written to disk. Normally, we use three buffers, linked on a circular list. The buffer size is large, since it must hold all the data transferred in a single log write. We use 64-Kbyte buffers, since that was the size of a single disk track. In practice, we never write more than 16 Kbytes at a time. Figure 3 describes the organization of staging buffers. Each buffer header contains pointers to the next buffer, to the start and end of the data area, and to the first free byte in the buffer (posn). It also contains a queue of the threads that are blocked waiting for the buffer to be written to disk. There are two global pointers - curBuf points to the buffer currently being populated, and ioBuf points to the buffer currently being written to disk. The processUpdateDescriptor() routine allocates the space needed for the log entry from the buffer pointed to by curBuf, starting at posn. If there is not enough space between posn and end for this entry, the next buffer becomes curBuf, and the entry is allocated from that. processUpdateDescriptor() then copies the metadata into the entry, and calls writeLog() to write the entry to disk. writeLog() checks to see if another write is in progress (ioBuf is non-NULL). If not, it sets ioBuf to this buffer, advances curBuf to the next buffer, and initiates a write for this buffer. This will write out all log entries in this staging buffer. When the write completes, it wakes up all threads blocked on this buffer’s queue. If another write is already in progress, writeLog() adds the thread to the staging buffer’s queue, and blocks till this buffer is written out. In this way, each time a log write completes, all pending writes are batched into a single I/O operation. The batching does not impose any additional latency. This technique has some interesting properties. We define the average batching efficiency as batching efficiency = log entries written / num of write ops = intrusive requests per second / log writes per second If the disk is constantly busy, the number of log disk writes per second is bounded by the raw performance characteristics of the disk, and is usually smaller due to the processing delays between writes. The efficiency then is the average number of intrusive requests that arrive in the time taken for one log write (including the associated processing plus idle time). Under light loads, the efficiency is close to one, since there are few instances of multiple requests arriving in the before a disk write completes. As the incoming load increases, so does the batching efficiency. On a heavily loaded system, we measured efficiencies of up to 1.6. This translates to a 40% reduction in the total number of log writes. The remarkable property of this implementation is that the speed of the log disk does not substantially impact the performance of the system. If the disk is slower, the number of log writes per second decreases, resulting in greater batching efficiency. Hence while a slow log disk may slightly increase the latency of each operation, the overall throughput is not impacted, and may in fact improve. 6. Log recovery Recovering the log is conceptually straightforward - the log must be played back from start to end, writing back each metadata object in it to its in-place location. The two main implementation problems are how to find the start and end of the log, and how to reduce the total I/O needed for the recovery. The log is divided into 64-Kbyte segments, whose only property is that log entries may not cross segment boundaries. The log allocation routines enforce this by padding the last entry of a segment to the boundary. Thus the beginning of a segment always coincides with the start of a new log entry. The entry’s header contains its size, so the recovery algorithm can sequentially traverse entries in a segment. The recno field in the entry is a monotonically increasing record number (such that recno % log size equals the position of the entry in the log). Hence the tail of the log is the entry with the highest recno value (the log may wrap around several times, but recno continues to increase). The recovery procedure first uses a binary search of the segments (looking only at the first entry of each segment) to find the segment containing the highest recno (tail entry). It then reads the segment into memory, and scans it linearly to locate the tail entry. The curhead field in that entry identifies the head of the log. It then recovers the log one segment at a time, starting at the head. To recover a segment, it reads the whole segment into memory, and scans all entries to eliminate items obsoleted by later items in the same segment. For instance, several filenames may have been written to a directory chunk; the last instance of the chunk in the segment contains all the changes (to that point). It then sorts the remaining items based on file system id and block number. Finally, it copies the items back from the sorted list to the in- place disk locations. This way, it can combine all entries in a segment that modify the same block. This algorithm yields substantial savings. For a 32-Mbyte log (512 64-Kbyte segments), the binary search locates the head segment in 9 reads, as opposed to 256 reads for a linear search). By eliminating duplicate items, and sorting and combining the rest, we reduce the number of writes by about a factor of 8 to 10 in typical cases. 7. Performance We made our measurements on a DECpc 560ST, a 60 MHz Pentium machine with 192 Mbytes of RAM, an EISA bus, a DEFEA FDDI controller, and two Adaptec 1740 controllers. For the logging benchmarks, we used 10 Digital RZ-26 disks for file systems, and one more for the log. For the non- logging tests, we used all 11 disks for the file systems. We ran the Spec LADDIS benchmark [Witt 93], using DEC Alpha systems running OSF/1 as clients. We added code to the server to track the peak active log size and batching efficiency. The primary motivation for logging was quick crash recovery, and to measure that, we powered off the system when running at peak load. At that time, the file system had about 3 Gbytes of data over 10 (logging case) or 11 (no-logging case) disks. In absence of logging, the file system was recovered by fsck. In the best case, this took about 450 seconds. Whenever fsck ran into a serious error, it had to be run interactively; this took much longer. By comparison, the log recovery took between 3 and 14 seconds for the entire file system, depending on the size of the log at the time of the crash. This gain is somewhat exaggerated due to the fact that on our server, fsck recovers one disk at a time, while the log recovers all disks simultaneously. Even accounting for that, the log recovery would outperform fsck by a large factor. Figure 4: LADDIS benchmark results Figure 4 describes the results of the LADDIS benchmark. The maximum throughput we could achieve was about 650 NFS operations per second, both with and without logging. Beyond that, the CPU became a bottleneck. At that peak load, the peak active log size was 10 Mbytes, and the batching efficiency was 1.6 (40% reduction in log writes). The NoBatch curve describes the results of the logging implementation without batched writes. Besides uniformly high latency, the throughput peaks out at 400 NFS operations per second, since the log disk becomes a bottleneck. Figure 5: Batching efficiency as a function of server load The NoLogging and Logging curves are not too different. The average latency is a little higher with logging at low loads, but lower at high loads. This reflects the effect of the batching efficiency increasing with load. Figure 5 describes the variation of batching efficiency with incoming load. A close look at the LADDIS test suite explains the similarity between the logging and no-logging results. 80% of the LADDIS requests are non- intrusive (lookup, getattr, read, readlink, and readdir), which are completely unaffected by logging. Another 16% are write and setattr requests, which typically modify only one metadata object (the inode), and are also relatively unaffected by logging (an in-place write is replaced by a log write). Only 4% of the load consists of requests that affect multiple metadata objects (create, remove, rename, link, mkdir, and rmdir); these are the ones that benefit most from logging. To concentrate on these, we ran a “small file benchmark”, in which we ran four types of tests on a server with eight disks. The tests were: * CD: create 16000 directories; each disk has 20 directories with 100 sub-directories each. * RD: remove the above tree (thus, 16,000 rmdirs). * CF: copy a 1-Kbyte file 16000 times (same distribution as for the CD test). * RF: remove the above files. Table 1 shows the results of the benchmark. Logging is faster by a factor of 1.28 to 3.01 in the four tests. The actual improvement of server performance is even better, since the tests take into account client-side processing as well as other requests such as lookup and getattr that are necessary for these operations. Test CD RD CF RF Total No Logging 370 151 339 121 981 Logging 123 66 264 89 542 Gain 3.01 2.28 1.28 1.36 1.81 Table 1: Small file benchmark results. The numbers are the elapsed time in seconds Table 2 shows measurements made on the server of the average elapsed time (in milliseconds) for the four metadata requests tested by the small file benchmark. The improvements are substantial; mkdir, in particular, is improved by more than a factor of six. It has been difficult to compare these results with other NFS servers, for lack of clear criteria to base comparisons on. While many researchers [Hagm 87, Selt 93] have published performance measurements on log-structured or log-enhanced file systems, they have concentrated on benchmarks that deal with local access. Conversely, many others [Hitz 94, Jusz 94] have published measurements of NFS performance using LADDIS and other tests. Some of these do not use logging at all, and the others do not have any measurements that explicitly isolate the effect of logging. [Hitz 94] provides the closest point of reference. Its FAServer runs on the Intel 486 platform, and uses a log-structured file system in conjunction with NV-RAM. The configuration, therefore, is not too different from ours. Their published measurements, for an eight-system cluster, extrapolate to about 400 NFS ops/sec per server, at an average latency of under 15 miliseconds. We compare that to our benchmark on the i486, which peaks at 597 NFS ops/sec. with a latency of 29 miliseconds. The low latency of the FAServer is almost entirely due to their use of NV-RAM, which allows them to complete operations without waiting for disk I/O. 8. Conclusion Operation mkdir rmdir create remove No Logging 106.4 50.1 30.4 36.5 Logging 16.9 17.3 17.1 24.4 Gain 6.30 2.89 1.78 1.50 Table 2: Average turnaround times for metadata operations on server, in msec. The logging enhancements met all their goals. There was a tremendous improvement in crash recovery time. Under a heavy, mixed, load, logging yielded a small gain in latency and equal throughput. For small file and directory operations, the gains were substantial (a factor of 1.50 to 6.30 improvement in server latency). All this was achieved at a low cost. The entire implementation and testing took about twelve man- weeks. We wanted to keep code changes to a minimum, to make it easy to integrate enhancements to the baseline code. We were able to do that; the logging enhancements modify the ufs code in five regards: 1. The update descriptor is passed along as an additional argument to a number of vnode operations and ufs functions. 2. Intrusive operations acquire logging locks on metadata objects they access. 3. The functions that modify the metadata objects do not release the dirty buffers; instead, they add entries to the update descriptor’s item list. 4. Intrusive requests allocate an update descriptor at the beginning and call processUpdateDescriptor() at the end. 5. The sync() routine releases log holds and update descriptors. Besides these changes, we had to write the logging module. This consisted of the functions and data structures that implement logging and recovery. The module has limited interaction with the baseline code, and is accessed through a narrow, well- defined, interface. We left the FFS on-disk structures unchanged. This allows users to migrate existing disks to our server without backing up and restoring all files. The log needs to be kept on a separate disk, but that can be a small, inexpensive disk. The automatic batching technique gives greater savings under heavy load, and ensures that the performance of the log disk does not impact overall system throughput. Some work remains to be done. We need to extend logging to include indirect blocks. We need to integrate performance improvements such as NFS write gathering [Jusz 94] and file system clustering [McVo 91] with our logging framework. We must add a checksum to log entries to guard against a crash leaving behind a partially written log entry. We also need to provide a background fsck-like function to recover from hard disk errors that corrupt one or more sectors. Finally, we must evaluate the suitability of the implementation for an NFS v3.0 server [Pawl 94]. The new version of the protocol, with its inherent ability to batch writes, might change the assumptions made in our design. 9. Acknowledgments Several people have contributed to this effort. In particular, we would like to thank the rest of the Calaveras team - Percy Tzelnic, Steve Glaser, Lev Vaitzblit, Wayne Duso, and K. K. Ramakrishnan for their help and discussions throughout this project, and Chet Juszczak for his insightful discussions during the early presentations of this work. References [Chut 92] Chutani, S., Anderson, O.T., Kazar, M.L., Mason, W.A., and Sidebotham, R.N., “The Episode File System”, Proceedings of the Winter 1992 Usenix Technical Conference, Jan. 1992, pp. 43-59. [Finl 87] Finlayson, R.S., and Cheriton, D.R., “Log Files: An Extended File Service Exploiting Write-Once Storage”, Proceedings of the Eleventh ACM Symposium on Operating Systems Principles, Nov. 1987, pp. 139-148. [Hagm 87] Hagman, R.B., “Reimplementing the Cedar File System Using Logging and Group Commit”, Proceedings of the Eleventh ACM Symposium on Operating Systems Principles, Nov. 1987, pp. 155- 162. [Hitz 94] Hitz, D., Lau, J., and Malcolm, M., “File System Design for an NFS File Server Appliance”, Proceedings of the Winter 1994 Usenix Technical Conference, Jan. 1994, pp. 235-245. [Jusz 89] Juszczak, C., “Improving the Performance and Correctness of an NFS Server”, Proceedings of the Winter 1989 Usenix Technical Conference, Jan. 1989, pp. 53-63. [Jusz 94] Juszczak, C., “Improving the Write Performance of an NFS Server”, Proceedings of the Winter 1994 Usenix Technical Conference, Jan. 1994, pp. 247- 259. [Kaza 90] Kazar, M., Leverett, B., Anderson, O., Vasilis, A., Bottos, B., Chutani, S., Everhart, C., Mason, W.A., Tu, S., and Zayas, E., “Decorum File System Architectural Overview”, Proceedings of the Summer 1990 Usenix Technical Conference, Jun 1990, pp. 151-164. [Klei 86] Kleiman, S.R., “Vnodes: an Architecture for Multiple File System Types in Sun UNIX”, Proceedings of the Summer 1986 Usenix Technical Conference, Jun 1986, pp. [Kow 78] Kowalski, T., “FSCK: The UNIX System Check Program”, Bell Laboratory, Murray Hill, NJ, Mar. 1978. [McKu 84] McKusick, M.K., Joy, W.N., Leffler, S.J., and Fabry, R.S., “A Fast File System for UNIX”, Transactions on Computer Systems, Volume 2, No. 3, Aug. 1984, pp. 181-197. [McVo 91] McVoy, L.W., and Kleiman, S.R., “Extent-like Performance from a UNIX File System”, Proceedings of the Winter 1991 Usenix Technical Conference, Jan. 1991, pp. 1-11. [Moha 92] Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwartz, P., “ARIES: A Transaction Recovery Method Supporting Fine-Grain Locking and Partial Rollbacks Using Write-Ahead Logging”, ACM Transactions on Database Systems, Vol. 17, No. 1, Mar. 1992, pp. 96-162. [Mora 90] Moran, J., Sandberg, R., Coleman, D., Kepecs, J., and Lyon, B., “Breaking Through the NFS Performance Barrier”, Proceedings of the Spring 1990 European UNIX Users Group Conference, Apr 1990, pp. 199-206. [Oust 89] Ousterhout, J.K., and Douglis, F., “Beating the I/O Bottleneck: A Case for Log-Structured File Systems”, Operating Systems Review, 23(1), Jan. 1989, pp. 11-27. [Oust 90] Ousterhout, J.K., “Why Aren’t Operating Systems Getting Faster as Fast as Hardware?”, Proceedings of the Summer 1990 Usenix Technical Conference, Jun 1990, pp. [Pawl 94] Pawlowski, B., Juszczak, C., Staubach, P., Smith, C., Lebel, D., and Hitz, D., NFS Version 3 Design and Implementation”, Proceedings of the Summer 1994 Usenix Technical Conference, Jun 1994, pp. 137-151. [Rama 94] Ramakrishnan, K.K., Vaitzblit, L., Gray, C., Vahalia, U., Ting, D., Tzelnic, P., Glaser, S., and Duso, W., “Operating System Support for a Video- On-Demand File Service”, ACM Multimedia Journal, to appear. [Rose 91] Rosenblum, M., and Ousterhout, J.K., “The Design and Implementation of a Log-Structured File System”, Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, Oct. 1991, pp. 1-15. [Sand 85] Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, R., “Design and Implementation of the Sun Network File System”, Proceedings of the Summer 1985 Usenix Technical Conference, Jun 1985, pp. 119-130. [Selt 93] Seltzer, M., Bostic, K., McKusick, M.K., and Staelin, C., “An Implementation of a Log- Structured File System for UNIX”, Proceedings of the Winter 1993 Usenix Technical Conference, Jan. 1993, pp. 307-326. [Thom 78] Thompson, K., “UNIX Implementation”, Bell System Technical Journal, 57(6), Jul.-Aug. 1978. [Witt 93] Wittle, M., and Keith, B., “LADDIS: The Next Generation in NFS File Server Benchmarking”, Proceedings of the Summer 1993 Usenix Technical Conference, Jun 1993, pp. 111-128. Uresh Vahalia received an MS in Computer Science from Syracuse University in 1985. He has since been working on operating systems and network protocols. He is currently at EMC Corporation, wehre he is developing file and continuous media servers. Uresh can be reached by email at vahalia@emc.com. Dennis Ting received a BS in Computer Science from the University of Utah in 1971. He is currently a principal engineer at EMC Corporation. He has been involved in the development of real-time operating systems, network protocols, and high- performance network file servers. The work in this paper was performed at Digital Equipment Corporation. ufs refers to the implementation of FFS under the vnode/vfs interface [Klei 86] For the allocation bitmaps, the item record contains the offset and value of the modified bits in the map.