Recovering From Linux Hard Drive Disasters

Ever had a hard drive failure? Ever kicked yourself because you didn't keep backups of critical files, or you discovered that your regular nightly backup didn't succeed? If this sounds familiar then Theodore Ts'o training "Recovering From Linux Hard Drive Disasters" should be on your LISA schedule because this tutorial covers in depth details on how to recover from disasters caused by software or hardware failures.

After covering the basic types of failures (user goofs, software bugs or hardware failures), Theodore explained the first and most important step you should do when data is lost: DON'T PANIC! Most of the time, the first reaction after a failure causes more damage than the initial failure itself. You should remain calm and try to determine what happened, create a backup of the failed disk image if necessary, (even if that's as simple as using dd:
dd if=/dev/hda1 of=/dev/hdb1 bs=1k conv=sync,noerror) and after that try to recover the data.

In order to understand the different failures Theodore explained how data is stored on disks and also the different hardware components that can fail. Next he moved on to the partitions types and major filesystems, each with their own special characteristics and features:

  • FAT
  • ext2/3/4
  • reiserfs
  • JFS
  • XFS
  • ZFS

Here are some of my takeaways from this awesome session:

- you should monitor your logs for errors that are sent to the console or system/kernel log that might indicate hardware or filesystem failures. If you are alerted promptly for such events this will help you identify and react faster to failures. For example, a hardware failure logged by the kernel:
– hda: dma_intr: status=0x51 { DriveReady SeekCompleteError }
– hda: dma_intr: error=0x40 { UncorrectableError } LBAsect = 408672, sector 1204 end_request

or an ext3 filesystem error:
EXT3-fs error (device md(9,2)): ext3_readdir: bad entry in directory #2670595: rec_len %% 4 != 0 -offset=0, inode=
- there are different tools that can be used while doing backups or recoveries; it is important that you are familiar with them and know their output and tested them out before. If you are using them for the first time during a crisis odds are you will not do very well if you don't have any experience with the tools. Some of the interesting tools recommended are e2image, dd_rescue, gpart, cfv.
- save the output of fsck commands as they might be valuable later during troubleshooting.
- save your partition table. Even if you just take the output of fdisk -l, save it (even in a simple file) as you might need it in case of a partition table corruption.
- LVM is not a substitute for RAID and RAID is not a substitute for backups
- and finally about backups: just do them. Use any tool you like, but just do them. Even a simple tar script will do its job.

In this class, Theodore Ts'o covers all the possible (and impossible) hard drive and filesystem failures and how to deal with them, and if you care about your data you should definitely attend it as it will help you prepare for the time when failure will happen. Finally, don't forget the most important step when data is lost: "DON'T PANIC"; with enough care, you can usually get your data back.