Reliable Storage & Recovery

Next: Conclusion Up: Related Work Previous: Error correcting codes

Reliable Storage & Recovery

Recent studies have shown that failures plague storage and other components of large computing datacenters [36]. As a result, many systems replicate data to reduce risk of data loss [5,14,16,25,23,37]. However, replication alone is not complete without recovery.

Recovery in the face of disaster has been a problem that has received a lot of attention [13,21,22]. In [20], for example, the authors propose a reactive way to solve the data recovery scheduling problem once the disaster has occurred. Potential recovery processes are first mapped onto recovery graphs -- the recovery graphs capture alternative approaches for recovering workloads, precedence relationships, timing constraints, etc. The recovery scheduling problem is encoded as an optimization problem with the end goal of finding the schedule that minimizes some measure of penalty; several methods for finding optimal and near-optimal solutions are given.

Aguilera et. al. [4] explore the tradeoff between the ability to recover and the cost of recovery in enterprise storage systems. They propose a multi-tier file system called TierFS that employs a ``recoverability log'' used to increase the recoverability of lower tiers by using the highest tier.

Both LOCKSS [26] and Deep Store [44] address the problem of reliably preserving large volumes of data for virtually indefinite periods of time, dealing with threats like format obsolescence and ``bit-rot.'' LOCKSS consists of a set of low-cost, independent, persistent cooperating caches that use a voting scheme to detect and repair damaged content. Deep Store eliminates redundancy both within and across files; it distributes data for scalability and provides variable levels of replication based on the importance or the degree of dependency of each chunk of stored data.

Baker et. al. [8] consider the problem of recovery from failure of long-term storage of digital information. They propose a ``reliability model'' encompassing latent and correlated faults, and the detection time of such latent faults. They show that a simple combination of auditing (to detect latent faults) as soon as possible, automatic recovery and independence of replicas yields the most benefit with respect to the cost of each technique.

Next: Conclusion Up: Related Work Previous: Error correcting codes

Hakim Weatherspoon 2009-01-14