The fsck time crunch

Most existing file systems have been designed under the assumption that media errors are rare. In the event that a media error corrupts file system metadata, the file system must be unmounted and repaired with fsck, a program that checks for and repairs file system metadata inconsistencies. At a high level, fsck traverses all file system metadata starting with the root directory, checking inodes and directories for internal consistency, rebuilding inode and block allocation maps, and checking for overall consistency between all metadata. Unfortunately, this process takes time proportional to the amount of file system metadata and involves many seeks to follow links and block pointers. With some exceptions, file system metadata grows roughly as the size of the file system, and the size of the file system grows roughly proportional as the size of disks. As disk capacity grows, fsck time grows proportionally.

Given increasing capacity, lower relative bandwidth, flat seek time, and increasing number of per-disk errors, the implications for fsck are as follows: (1) Fsck time will grow in absolute terms, to be on the order of days or weeks for ``normal'' file systems, (2) Larger file systems will have greater likelihood of suffering corruption of at least one piece of metadata, (3) The fraction of time data is unavailable while running fsck may asymptotically approach unity in the absence of basic architectural changes.

What can be done to combat this trend? Journaling file systems only speed fsck time in the case of a system crash, disconnected disk, or other interruptions in the middle of file system updates. They do not speed recovery in the case ``real'' metadata corruption--a journaling file system must still read all metadata to correct any kind of error other than a half-finished update. The same goes for soft updates[11], copy-on-write[2,8] and log-structured file systems[14], which are again designed to speed fsck after a crash or similar event.

Checksums on file system metadata, such as those added to ext3 in the IRON project[13], will detect errors and improve reliability, but won't change the need to read all file system metadata to repair consistency. File system level replication of metadata combined with checksums, such as in ZFS[2], can eliminate nearly all need to run fsck for file system corruption caused below the file system layer, but uses disk space and does not help with corruption caused by file system bugs as the inconsistency will be present in both checksummed copies. Those inclined to dismiss file system bugs as a significant source of file system corruption are invited to consider the recent XFS bug in Linux 2.6.17[1] requiring repair via the XFS file system repair program, causing significant downtime. While file system bugs may be relatively rare, we must take them into consideration when the cost of such a bug is several hours or days of downtime to repair the file system.

RAID solutions can improve the reliability of the underlying disks, but corruption still occurs frequently enough that it we cannot rely on it alone when, again, the cost is hours or days of downtime. Recently the main server for kernel.org, which hosts several years' worth of Linux kernel archives, suffered file system corruption at the RAID level; running fsck on the (journaling) ext3 file system took over a week, more than the time required to restore the entire file system from backup. In addition, two of our primary design targets, desktops and laptops, so far have not been attractive markets for RAID.

To help understand why existing techniques do not solve the fsck time crunch in and of themselves, consider the case of a file with a corrupted link count. To find the true link count, you must read every directory entry and count up all the entries with links to that file. Fundamentally, answering the question of the correct value of one bit in the block allocation bitmap requires reading every piece of metadata in today's file systems.

One apparent solution is to optimize fsck--perhaps fsck only takes so long because it is (to state it baldly) poorly written. However, when it comes to optimizing fsck, most of the low-hanging fruit has already been plucked. For example, many of the more well-known inefficiencies of fsck[4] have been corrected in the ext2/3 fsck utility. And to compensate for exponential reduction in performance due to hardware trends, we would have to continuously improve fsck performance at the same exponential rate for years--an unsustainable solution.

Another workaround is to divide up the disk into many smaller file systems and manage them individually. This approach has many drawbacks. Each file system must be mounted at a particular point, and all files in each file system must be children of the same top-level directory, constraining the organization of files. Disk space becomes fragmented, so one file system may run out of space while several others have plenty of free space. When this happens, files must be rebalanced between file systems by hand. Most existing file system code is not designed for or tested well for I/O errors, and when one file system suffers a failure, it often causes cascading failures in other supposedly independent file systems sharing the same code or system resources. Simply partitioning the disk into many small file systems produces a brittle, hard-to-manage system with somewhat lower data loss but not much improvement in data availability.

Valerie Henson 2006-10-18