Recovering from Linux Hard Drive Failures

Uh oh! Your hard drive just failed. What do you do? Theodore Ts'o offered a training workshop on Tuesday afternoon to help admins answer just that question. Data loss happens for a variety of reasons, including human error, software failures, and hardware failures. The first thing to do regardless of the cause of loss is to remember the Hippocratic Oath: first, do no harm.

The tutorial began with an explanation of the physical operation of hard drives and the various issues that can lead to failure. Head rashes occur when the read-write head scrapes the drive platters. Drive spin-up causes a little bit of damage to the head, which can lead to failure after many thousand spin-ups. Excessively violent impact of the read-write head on the platter can scrap away the iron oxide coating, resulting in tiny pieces flying around the drive enclosure, leading to further damage.

Although solid state drives (SSDs) are immune to the issues described above, they are not without their own risks. Flash cells trap electrons to store state, but each program and erase cycle forces electrons in and out of the trap, slowly weakening the dielectric material. A brand new cell at 50 degrees Celsius may hold state for 10-20 years, but after 10,000 writes the cell may lose state after only 6-12 months. Modern SSDs, with denser storage, may only last for 5000-10000 writes.

After discussing the mechanics of hard drives, the tutorial moved up one level to discuss partitioning, logical volume managers, and filesystems. Three major kinds of disk-based filesystems exist: FAT-based (e.g. FAT16), inode-based (e.g. ext*), and log-structured (e.g. ZFS). Each of these types has particular benefits and drawbacks when it comes to performance and other features. For example, log-structured filesystems offer excellent write performance, but seeks are expensive without extensive memory caching.

So what do you do when a failure happens? You must first ask yourself what happened, what is the lowest corrupted level (e.g. file, hard drive), how important is the data (if it's not very important, you won't spend too much time trying to recover it), and when was the last backup. Before taking any further steps, it's important to have a plan of attack, which includes making an image of the disk and attempting recovery operations against the image.

Kernel log messages often have indicators of hardware failure. Check to see if it's a one-time failure, or a prelude to major failure. The S.M.A.R.T. reports from the drive can also provide insight into any hardware problems. If the failure is at the hardware level, sending the drive off to a recovery service may be the only way to recover data.

Partition table problems are easier to work around. The partition table information is small enough that it can be easily printed and taped to the side of the machine. Tools like gpart can scan a disk and attempt to reconstruct the partition table.

LVM problems have the fortunate feature of rarely involving metadata loss, since each physical volume in the volume group as two copies of the metadata and an additional copy lives in /etc/lvm. What is more common is the physical failure of a physical volume in a volume group. LVM is no substitute for RAID (though LVM does have striping support, it is not default), so a failed physical volume will likely lead to data loss. The rest of the volume group can be recovered by linking /dev/ioerror to /dev/zero and executing: vgchange --partial -ay

Corruption in the filesystem can be reported by e2fsck (or similar for non-ext filesystems) during boot, or by the kernel. Fixing a filesystem by running fsck (with the -y option for reduced interaction) is generally a workable solution, but it is suggested to save the output of fsck as that may aid in post-repair cleanup.

Desperation searches are also available. An open source tool called PhotoRec searches for file fingerprints on disk to find files. Recovering the files requires that they were stored contiguously. The ever-present grep command can also be of assistance in searching for files if you know a string to look for.

The easiest way to recover from data loss is to not have it occur in the first place. Automated backups are the key part of this plan. It is even possible to back up ext* metadata by using the e2image command. Vigilance against impending hardware failures will also help. Using smartctl to check the health of disks is good, and enabling smartd to report automatically is better.  Most importantly: don't panic!