Check out the new USENIX Web site.


NSDI '06 Paper    [NSDI '06 Technical Program]

Efficient Replica Maintenance for Distributed Storage Systems

Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
M. Frans Kaashoek, John Kubiatowicz, and Robert Morris

MIT Computer Science and Artificial Intelligence Laboratory,
Rice University/MPI-SWS, University of California, Berkeley



Abstract

This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.

The following insights in designing an efficient replication algorithm emerge from the paper's analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency.

Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

1  Introduction

Wide-area distributed storage systems typically use replication to provide two related properties: durability and availability. Durability means that objects that an application has put into the system are not lost due to disk failure whereas availability means that get will be able to return the object promptly. Objects can be durably stored but not immediately available: if the only copy of an object is on the disk of a node that is currently powered off, but will someday re-join the system with disk contents intact, then that object is durable but not currently available. The paper's goal is to develop an algorithm to store immutable objects durably and at a low bandwidth cost in a system that aggregates the disks of many Internet nodes.

The threat to durability is losing the last copy of an object due to permanent failures of disks. Efficiently countering this threat to durability involves three main challenges. First, network bandwidth is a scarce resource in a wide-area distributed storage system. To store objects durably, there must be enough network capacity to create copies of objects faster than they are lost due to disk failure. Second, a system cannot always distinguish between transient failures and permanent disk failures: it may waste network bandwidth by creating new copies during transient failures. Third, after recovery from transient failures, some replicas may be on nodes that the replica lookup algorithm does not query and are thus effectively lost.

Since transient failures are common in wide-area systems, replication algorithms can waste bandwidth by making unneeded replicas. For example, the initial replication algorithm [6] that the DHash distributed hash table (DHT) [9] turned out to be inadequate to build storage applications such as UsenetDHT [34], Antiquity [11], and OverCite [35, 36].

A problem with DHash was that its design was driven by the goal of achieving 100% availability; this decision caused it to waste bandwidth by creating new replicas in response to temporary failures. Its design and similar ones (such as Total Recall [3]) are overkill for durability. Furthermore, users of many Internet applications can tolerate some unavailability. For example, Usenet readers will see all articles eventually, as long as they are stored durably. Our experience with these DHT applications has led us to the following insights:

  • Durability is a more practical and useful goal than availability for applications that store objects (as opposed to caching objects).
  • The main goal of a durability algorithm should be to create new copies of an object faster than they are destroyed by disk failures; the choice of how replicas are distributed among nodes can make this task easier.
  • Increasing the replication level does not help tolerate a higher average permanent failure rate, but it does help cope with bursts of failures.
  • Reintegrating returning replicas is key to avoiding unnecessary copying.
Using these insights we have developed Carbonite, an efficient wide-area replication algorithm for keeping objects durable. After inserting a set of initial replicas, Carbonite begins by creating new replicas mostly in response to transient failures. However, over time it is increasingly able to ignore transient failures and approaches the goal of only producing replicas in response to permanent failures.

Carbonite's design assumes that the disks in the distributed storage system fail independently of each other: failures of geographically distributed hard drives from different manufacturers are likely to be uncorrelated.

In a year-long PlanetLab failure trace, however, we observe some correlated failures because of coordinated re-installs of the PlanetLab software. Despite this, an evaluation using the PlanetLab failure trace shows that Carbonite is able to keep 1 TB of data durable, and consumes only 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

The rest of this paper explains our durability models and algorithms, interleaving evaluation results into the explanation. Section 2 describes the simulated evaluation environment. Section 3 presents a model of the relationship between network capacity, amount of replicated data, number of replicas, and durability. Section 4 explains how to decrease repair time, and thus increase durability, by proper placement of replicas on servers. Section 5 presents an algorithm that reduces the bandwidth wasted making copies due to transient failures. Section 6 outlines some of the challenges that face practical implementations of these ideas, Section 7 discusses related work, and Section 8 concludes.

2  System environment

The behavior of a replication algorithm depends on the environment in which it is used: high disk failure rates or low network access link speeds make it difficult for any system to maintain durability. We will use the characteristics of the PlanetLab testbed as a representative environment when evaluating wide-area replication techniques. For explanatory purposes, we will also use a synthetic trace that makes some of the underlying trends more visible. This section describes both environments, as well as the simulator we used to evaluate our algorithm.

2.1  PlanetLab characteristics

PlanetLab is a large (> 600 node) research testbed [28] with nodes located around the world. We chose this testbed as our representative environment mainly because it is a large, distributed collection of machines that has been monitored for long periods; we use this monitoring data to construct a realistic trace of failures in a mostly managed environment.

The main characteristics of PlanetLab that interest us are the rates of disk and transient failures. We use historical data collected by the CoMon project [25] to identify transient failures. CoMon has archival records collected on average every 5 minutes that include the uptime as reported by the system uptime counter on each node. We use resets of this counter to detect reboots, and we estimate the time when the node became unreachable based on the last time CoMon was able to successfully contact the node. This allows us to pinpoint failures without depending on the reachability of the node from the CoMon monitoring site.

We define a disk failure to be any permanent loss of disk contents, due to disk hardware failure or because its contents are erased accidentally or intentionally. In order to identify disk failures, the CoMon measurements were supplemented with event logs from PlanetLab Central [28]. This database automatically records each time a PlanetLab node is reinstalled (e.g., for an upgrade, or after a disk is replaced following a failure). The machine is then considered offline until the machine is assigned a regular boot state in the database. Table 1 summarizes the statistics of this trace. Figure 7(a) visualizes how transient and disk failures accumulate over time in this network.


Dates 1 March 2005 – 28 Feb 2006
Number of hosts 632
Number of transient failures 21255
Number of disk failures 219
Transient host downtime (s) 1208, 104647, 14242
Any failure interarrival (s) 305, 1467, 3306
Disk failures interarrival (s) 54411, 143476, 490047
  (Median/Mean/90th percentile)
Table 1: CoMon+PLC trace characteristics.


2.2  Synthetic trace

We also generated synthetic traces of failures by drawing failure inter-arrival times from exponential distributions. Synthetic traces have two benefits. First, they let us simulate longer time periods, and second, they allow us to increase the failure density, which makes the basic underlying trends more visible. We conjecture that exponential inter-failure times are a good model for disks that are independently acquired and operated at geographically separated sites; exponential intervals are possibly not so well justified for transient failures due to network problems.

Each synthetic trace contains 632 nodes, just like the PlanetLab trace. The mean session time and downtime match the values shown in Table 1; however, in order to increase the failure density, we extended the length to two years and reduced the average node lifetime to one year. Each experiment was run with ten different traces; the figures show the averages from these experiments.

2.3  Simulation

We use the failure traces to drive an event-based simulator. In the simulator, each node has unlimited disk capacity, but limited link bandwidth. However, it assumes that all network paths are independent so that there are no shared bottlenecks. Further it assumes that if a node is available, it is reachable from all other nodes. This is occasionally not the case on PlanetLab [14]; however, techniques do exist to mask the effects of partially unreachable nodes [1].

The simulator takes as input a trace of transient and disk failure events, node repairs and object insertions. It simulates the behavior of nodes under different protocols and produces a trace of the availability of objects and the amount of data sent and stored by each node for each hour of simulated time. Each simulation calls put with 50,000 data objects, each of size 20 MB. Unless otherwise noted, each node is configured with an access link capacity of 150 KBytes/s, roughly corresponding to the throughput achievable under the bandwidth cap imposed by PlanetLab. The goal of the simulations is to show the percentage of objects lost and the amount of bandwidth needed to sustain objects over time.

3  Understanding durability

We consider the problem of providing durability for a storage system composed of a large number of nodes spread over the Internet, each contributing disk space. The system stores a large number of independent pieces of data. Each piece of data is immutable. The system must have a way to name and locate data; the former is beyond the scope of this work, while the latter may affect the possible policies for placing replicas. While parts of the system will suffer temporary failures, such as network partitions or power failures, the focus of this section is on failures that result in permanent loss of data. Section 5 shows how to efficiently manage transient failures; this section describes some fundamental constraints and challenges in providing durability.

3.1  Challenges to durability

It is useful to view permanent disk and node failures as having an average rate and a degree of burstiness. To provide high durability, a system must be able to cope with both.

In order to handle some average rate of failure, a high-durability system must have the ability to create new replicas of objects faster than replicas are destroyed. Whether the system can do so depends on the per-node network access link speed, the number of nodes (and hence access links) that help perform each repair, and the amount of data stored on each failed node. When a node n fails, the other nodes holding replicas of the objects stored on n must generate replacements: objects will remain durable if there is sufficient bandwidth available on average for the lost replicas to be recreated. For example, in a symmetric system each node must have sufficient bandwidth to copy the equivalent of all data it stores to other nodes during its lifetime.

If nodes are unable to keep pace with the average failure rate, no replication policy can prevent objects from being lost. These systems are infeasible. If the system is infeasible, it will eventually “adapt” to the failure rate by discarding objects until it becomes feasible to store the remaining amount of data. A system designer may not have control over access link speeds and the amount of data to be stored; fortunately, choice of object placement can improve the speed that a system can create new replicas as discussed in Section 4.

If the creation rate is only slightly above the average failure rate, then a burst of failures may destroy all of an object's replicas before a new replica can be made; a subsequent lull in failures below the average rate will not help replace replicas if no replicas remain. For our purposes, these failures are simultaneous: they occur closer together in time than the time required to create new replicas of the data that was stored on the failed disk. Simultaneous failures pose a constraint tighter than just meeting the average failure rate: every object must have more replicas than the largest expected burst of failures. We study systems that aim to maintain a target number of replicas in order to survive bursts of failure; we call this target rL.

Higher values of rL do not allow the system to survive a higher average failure rate. For examples, if failures were to arrive at fixed intervals, then either rL=2 would always be sufficient, or no amount of replication would ensure durability. If rL = 2 is sufficient, there will always be time to create a new replica of the objects on the most recently failed disk before their remaining replicas fail. If creating new replicas takes longer than the average time between failures, no fixed replication level will make the system feasible; setting a replication level higher than two would only increase the number of bytes each node must copy in response to failures, which is already infeasible at rL=2.

3.2  Creation versus failure rate

It might seem that any creation rate higher than the average failure rate will lead to an unbounded number of replicas, thus satisfying the burst constraint. However, this intuition is false. To see why, let us model the number of replicas of an object as a birth-death process using a continuous time Markov chain, which assumes independent exponential inter-failure and inter-repair times. This assumption is reasonable for independent disk failures.

nsdi001.png
Figure 1: A continuous time Markov model for the process of replica failure and repair for a system that maintains three replicas (rL=3). Numbered states correspond to the number of replicas of each object that are durable. Transitions to the left occur at the rate at which replicas are lost; right-moving transitions happen at the replica creation rate.

An object is in state i when i disks hold a replica of the object. There are thus rL + 1 possible states, as we start with rL replicas and only create new replicas in response to failures. From a given state i, there is a transition to state i+1 with rate μi corresponding to repair, except for state 0 which corresponds to loss of durability and state rL which does not need repair. The actual rate μi depends on how bandwidth is allocated to repair and may change depending on the replication level of an object. There is a transition to the next lower state i−1 with rate iλf because each of the i nodes holding an existing replica might fail. Figure 1 shows this model for the case where rL = 3.

This model can be analyzed numerically to shed light on the impact of rL on the probability of data loss; we will show this in Section 3.3. However, to gain some intuition about the relationship between creation and failure rates and the impact this has on the number of replicas that can be supported, we consider a simplification of Figure 1 that uses a fixed μ but repairs constantly, even allowing for transitions out of state 0. While these changes make the model less realistic, they turn the model into an M/M/∞ queue [19] where the “arrival rate” is the repair rate and the “service rate” is the per-replica failure rate. The “number of busy servers” is the number of replicas: the more replicas an object has, the more probable it is that one of them will fail.

This simplification allows us to estimate the equilibrium number of replicas: it is μ/λf. Given μ and λf, a system cannot expect to support more than this number of replicas. For example, if the system must handle coincidental bursts of five failures, it must be able to support at least six replicas and hence the replica creation rate must be at least 6 times higher than the average replica failure rate. We will refer to μ/λf as θ. Choices for rL are effectively limited by θ. It is not the case that durability increases continuously with rL; rather, when using rL > θ, the system provides the best durability it can, given its resource constraints. Higher values of θ decrease the time it takes to repair an object, and thus the `window of vulnerability' during which additional failures can cause the object to be destroyed.

To get an idea of a real-world value of θ, we estimate μ and λf from the historical failure record for disks on PlanetLab. From Table 1, the average disk failure inter-arrival time for the entire test bed is 39.85 hours. On average, there were 490 nodes in the system, so we can estimate the mean time between failures for a single disk as 490⋅ 39.85 hours or 2.23 years. This translates to λf ≈ 0.439 disk failures per year.

The replica creation rate μ depends on the achievable network throughput per node, as well as the amount of data that each node has to store (including replication). PlanetLab currently limits the available network bandwidth to 150 KB/s per node, and if we assume that the system stores 500 GB of unique data per node with rL = 3 replicas each, then each of the 490 nodes stores 1.5 TB. This means that one node's data can be recreated in 121 days, or approximately three times per year. This yields μ ≈ 3 disk copies per year.

In a system with these characteristics, we can estimate θ = μ/λf ≈ 6.85, though the actual value is likely to be lower. Note that this ratio represents the equilibrium number of disks worth of data that can be supported; if a disk is lost, all replicas on that disk are lost. When viewed in terms of disk failures and copies, θ depends on the value of rL: as rLincreases, the total amount of data stored per disk (assuming available capacity) increases proportionally and reduces μ. If λf = μ, the system can in fact maintain rLreplicas of each object.

To show the impact of θ, we ran an experiment with the synthetic trace (i.e., with 632 nodes, a failure rate of λf = 1 per year and a storage load of 1 TB), varying the available bandwidth per node. In this case, 100 B/s corresponds to θ=1.81 / rL. Figure 2 shows that, as θ drops below one, the system can no longer maintain full replication and starts operating in a `best effort' mode, where higher values of rL do not give any benefit. The exception is if some of the initial rL replicas survive through the entire trace, which explains the small differences on the left side of the graph.


nsdi002.png
Figure 2: Average number of replicas per object at the end of a two-year synthetic trace for varying values of θ, which varies with bandwidth per node (on the x-axis) and total data stored (rL). Where θ < 1, the system cannot maintain the full replication level; increasing rL further does not have any effect.


3.3  Choosing rL


nsdi003.png
Figure 3: Frequency of “simultaneous” failures in the PlanetLab trace. These counts are derived from breaking the trace into non-overlapping 24 and 72 hour periods and noting the number of permanent failures that occur in each period. If there are x replicas of an object, there were y chances in the trace for the object to be lost; this would happen if the remaining replicas were not able to respond quickly enough to create new replicas of the object.


A system designer must choose an appropriate value of rLto meet a target level of durability. That is, for a given deployment environment, rLmust be high enough so that a burst of rL failures is sufficiently rare.

One approach is to set rLto one more than the maximum burst of simultaneous failures in a trace of a real system. For example, Figure 3 shows the burstiness of permanent failures in the PlanetLab trace by counting the number of times that a given number of failures occurs in disjoint 24 hour and 72 hour periods. If the size of a failure burst exceeds the number of replicas, some objects may be lost. From this, one might conclude that 12 replicas are needed to maintain the desired durability. This value would likely provide durability but at a high cost. If a lower value of rLwould suffice, the bandwidth spent maintaining the extra replicas would be wasted.

There are several factors to consider in choosing rLto provide a certain level of durability. First, even if failures are independent, there is a non-zero (though small) probability for every burst size up to the total number of nodes. Second, a burst may arrive while there are fewer than rLreplicas. One could conclude from these properties that the highest possible value of rLis desirable. On the other hand, the simultaneous failure of even a large fraction of nodes may not destroy any objects, depending on how the system places replicas (see Section 4). Also, the workload may change over time, affecting μ and thus θ.

The continuous time Markov model described in Figure 1 reflects the distributions of both burst size and object replication level. The effect of these distributions is significant. An analysis of the governing differential equations can be used to derive the probability that an object will be at a given replication level after a given amount of time. In particular, we can determine the probability that the chain is in state 0, corresponding to a loss of durability.

We show the results of such an analysis in Figure 4; for details, see [7]. To explore different workloads, we consider different amounts of data per node. The graph shows the probability that an object will survive after four years as a function of rL and data stored per node (which affects the repair rate and hence θ).

As rL increases, the system can tolerate more simultaneous failures and objects are more likely to survive. The probability of object loss at rL=1 corresponds to using no replication. This value is the same for all curves since it depends only on the lifetime of a disk; no new replicas can be created once the only replica of the object is lost. To store 50 GB durably, the system must use an rL of at least 3. As the total amount of data increases, the rL required to attain a given survival probability also increases. Experiments confirm that data is lost on the PlanetLab trace only when maintaining fewer than three replicas.


nsdi004.png
Figure 4: Analytic prediction for object durability after four years on PlanetLab. The x-axis shows the initial number of replicas for each object: as the number of replicas is increased, object durability also increases. Each curve plots a different per-node storage load; as load increases, it takes longer to copy objects after a failure and it is more likely that objects will be lost due to simultaneous failures.


4  Improving repair time

This section explores how the system can increase durability by replacing replicas from a failed disk in parallel. In effect, this reduces the time needed to repair the disk and increases θ.

Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We will call the size of that set the node's scope, and consider only system designs in which every node has the same scope. Scope can range from a minimum of rL to a maximum of the number of nodes in the system.

A small scope means that all the objects stored on node n have copies on nodes chosen from the same restricted set of other nodes. The advantage of a small scope is that it makes it easier to keep track of the copies of each object. For example, DHash stores the copies of all the objects with keys in a particular range on the successor nodes of that key range; the result is that those nodes store similar sets of objects, and can exchange compressed summaries of the objects they store when they want to check that each object is replicated a sufficient number of times [6].

The disadvantage of a small scope is that the effort of creating new copies of objects stored on a failed disk falls on the small set of nodes in that disk's scope. The time required to create the new copies is proportional to the amount of data on one disk divided by the scope. Thus a small scope results in a long recovery time. Another problem with a small scope, when coupled with consistent hashing, is that the addition of a new node may cause needless copying of objects: the small scope may dictate that the new node replicate certain objects, forcing the previous replicas out of scope and thus preventing them from contributing to durability.

Larger scopes spread the work of making new copies of objects on a failed disk over more access links, so that the copying can be completed faster. In the extreme of a scope of N (the number of nodes in the system), the remaining copies of the objects on a failed disk would be spread over all nodes, assuming that there are many more objects than nodes. Furthermore, the new object copies created after the failure would also be spread over all the nodes. Thus the network traffic sources and destinations are spread over all the access links, and the time to recover from the failure is short (proportional to the amount of data on one disk divided by N).

A larger scope also means that a temporary failure will be noticed by a larger number of nodes. Thus, more access links are available to create additional replicas while the failure lasts. Unless these links are already fully utilized, this increases the effective replica creation rate, and thus improves durability.


nsdi005.png
Figure 5: Durability for different scopes in a synthetic trace with low θ. Larger scopes spread the repair work over more access links and improve the nodes' ability to monitor replicas and temporary failures, which results in higher durability.

Figure 5 shows how scope (and thus repair time) affects object durability in a simulation on a synthetic trace. To reduce θ, we limit the bandwidth per node to 1000 B/s in this experiment. We vary the repair threshold and the scope, and measure durability after two years of simulated time. Increasing the scope from 5 to 25 nodes reduces the fraction of lost objects by an order of magnitude, independent of rL. By including more nodes (and thus more network connections) in each repair effort, the work is spread over more access links and completes faster, limiting the window of time in which the system is vulnerable to another disk failure. Ideally, by doubling the scope, the window of vulnerability can be cut in half.

A large scope reduces repair time and increases durability; however, implementing a large scope presents two trade-offs. First, the system must monitor each node in the scope to determine the replication levels; when using a large scope, the system must monitor many nodes. This increased monitoring traffic limits scalability. Second, in some instances, a large scope can increase the likelihood that a simultaneous failure of multiple disks will cause some object to be lost.

If objects are placed randomly with scope N and there are many objects, then it is likely that all NCrL potential replica sets are used. In this scenario, the simultaneous failure of any rL disks is likely to cause data loss: there is likely to be at least one object replicated on exactly those disks. A small scope limits placement possibilities that are used, concentrating objects into common replica sets. As a result, it is less likely that a given set of rL failures will affect a replica set, but when data loss does occur, many more objects will be lost. These effects exactly balance: the expected number of objects lost during a large failure event is identical for both strategies. It is the variance that differs between the two strategies.

5  Reducing transient costs

The possibility of transient failures complicates providing durability efficiently: we do not want to make new copies in response to transient failures, but it is impossible to distinguish between disk failures and transient failures using only remote network measurements. This section focuses minimizing the amount of network traffic sent in response to transient failures.

The key technique needed to achieve this is to ensure that the system reintegrates object replicas stored on nodes after transient failures; this means the system must be able to track more than rL replicas of each object. The number of replicas that the system must remember turns out to be dependent on a, the average fraction of time that a node is available. However, we show that the correct number of extra replicas can be determined without estimating a by tracking the location of all replicas, including those that are offline. We introduce the Carbonite algorithm that uses this technique and demonstrate its effectiveness using simulations.

We additionally consider two other techniques for limiting response to transient failures: creating extra replicas in batches and using timeouts as a heuristic for distinguishing transient from disk failures. Both are of limited value: batching is best able to save bandwidth when using erasure codes and, in the presence of reintegration, timeouts work well only if node downtimes are notably shorter than node (and disk) lifetimes.

5.1  Carbonite details

The Carbonite maintenance algorithm focuses on reintegration to avoid responding to transient failures. Durability is provided by selecting a suitable value of rL; an implementation of Carbonite should place objects to maximize θ and preferentially repair the least replicated object. Within these settings, Carbonite works to efficiently maintain rLcopies, thus providing durability.

Because it is not possible to distinguish between transient and disk failures remotely, Carbonite simply responds to any detected failure by creating a new replica. This approach is shown in Figure 6. If fewer than rL replicas are detected as available, the algorithm creates enough new replicas to return the replication level to rL.


// Iterate through the object database
// and schedule an object for repair if needed
Maintain_Replicas ()
    keys = <DB.object_keys sorted number of available replicas>
    foreach k in keys:
	n = replicas[k].len ()
	if n < rL:
	    newreplica = enqueue_repair (k)
	    replicas[k].append (newreplica)
Figure 6: Each node maintains a list of objects for which it is responsible and monitors the replication level of each object using some synchronization mechanism. In this code, this state is stored in the replicas hash table though an implementation may choose to store it on disk. This code is called periodically to enqueue repairs on those objects that have too few replicas available; the application can issue these requests at its convenience.


However, Carbonite remembers which replicas were stored on nodes that have failed so that they can be reused if they return. This allows Carbonite to greatly reduce the cost of responding to transient failures. For example, if the system has created two replicas beyond rL and both fail, no work needs to be done unless a third replica fails before one of the two currently unavailable replicas returns. Once enough extra replicas have been created, it is unlikely that fewer than rL of them will be available at any given time. Over time, it is increasingly unlikely that the system will need to make any more replicas.

5.2  Reintegration reduces maintenance


nsdi006.png nsdi007.png
nsdi008.png nsdi009.png
(a) (b)
Figure 7: A comparison of the total amount of work done by different maintenance algorithms with rL=3 using a PlanetLab trace (left) and a synthetic trace (right). In all cases, no objects are lost. However, rL = 2 is insufficient: for the PlanetLab trace, even a system that could distinguish permanent from transient failures would lose several objects.


Figure 7 shows the importance of reintegrating replicas back into the system by comparing the behavior of Carbonite to two prior DHT systems and a hypothetical system that can differentiate disk from transient failures using an oracle and thus only reacts to disk failures. In the simulation, each system operates with rL=3. The systems are simulated against the PlanetLab trace (a) and a synthetic trace (b). The y-axes plot the cumulative number of bytes of network traffic used to create replicas; the x-axes show time.

Unlike all other synthetic traces used in this paper, whose parameters are different from the PlanetLab trace in order to bring out the basic underlying trends, the synthetic trace used in Figure 7 was configured to be similar to the PlanetLab trace. In particular, the average node lifetime and the median downtime are the same. The result is still an approximation (for example, PlanetLab grew during the trace) but the observed performance is similar. Some of the observed differences are due to batching (used by Total Recall) and timeouts (used by all systems); the impact of these are discussed in more detail in Sections 5.4 and 5.5.

Since the oracle system responds only to disk failures, it uses the lowest amount of bandwidth. The line labeled Cates shows a system that keeps track of exactly rL replicas per object; this system approximates the behavior of DHTs like DHash, PAST and OpenDHT. Each failure causes the number of replicas to drop below rL and causes this system to create a new copy of an object, even if the failure was transient. If the replica comes back online, it is discarded. This behavior results in the highest traffic rate shown. The difference in performance between the PlanetLab and Poisson trace is due to differences in the distribution of downtimes: Poisson is not a particularly good fit for the PlanetLab downtime distribution.

Total Recall [3] tracks up to a fixed number of replicas, controlled by a parameter rH; we show rH = 5 which is optimal for these traces, and rH = 9. As can be seen at the right of the graphs, this tracking of additional replicas allows Total Recall to create fewer replicas than the Cates system. When more than rL replicas are available, a transient failure will not cause Total Recall to make a new copy. However, Total Recall's performance is very sensitive to rH. If rH is set too low, a series of transient failures will cause the replication level to drop below rL and force it to create an unnecessary copy. This will cause Total Recall to approach Cates (when rH = rL). Worse, when the system creates new copies it forgets about any copies that are currently on failed nodes and cannot benefit from the return of those copies. Without a sufficiently long memory, Total Recall must make additional replicas. Setting rHtoo high imposes a very high insertion cost and results in work that may not be needed for a long time.

Carbonite reintegrates all returning replicas into the replica sets and therefore creates fewer copies than Total Recall. However, Carbonite's inability to distinguish between transient and disk failures means that it produces and maintains more copies than the oracle based algorithm. This is mainly visible in the first weeks of the trace as Carbonite builds up a buffer of extra copies. By the end of the simulations, the rate at which Carbonite produces new replicas approaches that of the oracle system.

5.3  How many replicas?

To formalize our intuition about the effect of extra replicas on maintenance cost and to understand how many extra replicas are necessary to avoid triggering repair following a transient failure, consider a simple Bernoulli process measuring R, the number of replicas available at a given moment, when there are r>rL total replicas. The availability of each node is a. Since repair is triggered when the number of available replicas is less than rL, the probability that a new replica needs to be created is the probability that less than rL replicas are available:
Pr[R < rL  | r extant copies ] =
rL − 1
Σ
i = 0

r
i

ai (1 − a)ri.
This probability falls rapidly as r increases but it will never reach zero; there is always a chance that a replica must be created due to a large number of concurrent failures, regardless of how many replicas exist already. However, when a large number of replicas exists, it is extremely unlikely that enough replicas fail such that fewer than rL are available.

By computing the Chernoff bound, it is possible to show that after the system has created 2rL/a replicas, the probability of a new object creation is exponentially small. 2rL/a is a rough (and somewhat arbitrary) estimate of when the probability of a new object creation is small enough to ignore. Figure 8 shows (on the y-axis) the probability that a new object must be created when an increasing number of replicas already exist. As r increases, the probability that a new replica needs to be created falls, and the algorithm creates replicas less frequently. As r approaches 2rL/a, the algorithm essentially stops creating replicas, despite not knowing the value of a.


nsdi010.png
Figure 8: Additional redundancy must be created when the amount of live redundancy drops below the desired amount (3 replicas in this example). The probability of this happening depends solely on the average node availability a and the amount of durable redundancy. This graph shows the probability of a repair action as a function of the amount of durable redundancy, with a = 0.5, a = 0.7 and a = 0.9 for a replication system.


This benefit is obtained only if returning replicas are reintegrated into the appropriate replica set, allowing more than rL to be available with high probability. As a result, the cost of responding to transient failures will be nearly zero. Still, this system is more expensive than an oracle system that can distinguish between disk and transient failures. While the latter could maintain exactly rL replicas, the former has to maintain approximately 2rL/a. The factor of 2/a difference in the cost is the penalty for not distinguishing disk and transient failures.

5.4  Create replicas as needed

Given that the system tends towards creating 2rL/a replicas in order to keep rL of them available, it is tempting to create the entire set—not just rL of them—when the object is first inserted into the system (Total Recall [3] uses a similar technique). However, this approach requires an accurate estimate for a to deliver good performance. If a is overestimated, the system quickly finds itself with less than rL replicas after a string of transient failures and is forced to create additional copies. If a is underestimated, the system creates unneeded copies and wastes valuable resources. Carbonite is simplified by the fact that it does not need to measure or estimate a to create the “correct” number of replicas.

Another idea is to create not only enough copies to bring the number of available replicas back up to rL, but also e additional copies beyond rL (this is similar to Total Recall's lazy repair technique). Creating a batch of copies makes repair actions less frequent, but at the same time, causes more maintenance traffic than Carbonite. The work required to create additional replicas will be wasted if those replicas are lost due to disk failures before they are actually required. Carbonite, on the other hand, only creates replicas that are necessary to keep rL replicas available. In other words, either Carbonite would eventually create the same number of replicas as a scheme that creates replicas in batches, or some replicas created in the batch were unnecessary: batch schemes do, at best, the same amount of work as Carbonite.


nsdi011.png
Figure 9: Total repair cost with extra replicas, and with and without reintegration after repair. Without reintegration, extra replicas reduce the rate at which repair is triggered and thus reduce maintenance cost; there is an optimal setting (here e=8). With reintegration, the cost is lowest if no extra replicas are used.


Figure 9 shows the bytes sent in a simulation experiment using a five-year synthetic trace with a=0.88, rL=3, and an average node lifetime of one year. The graph shows results for different values of e (in Total Recall, e = rHrL) and for two different scenarios. In the scenario with reintegration, the system reintegrates all replicas as they return from transient failures. This scenario represents the behavior of Carbonite when e=0 and causes the least traffic.

In the scenario without reintegration, replicas that are unavailable when repair is triggered are not reintegrated into the replica set even if they do return. Total Recall behaves this way. Extra replicas give the system a short-term memory. Additional replicas increase the time until repair must be made (at which time failed replicas will be forgotten); during this time failed replicas can be reintegrated. Larger values of e give the system a longer memory but also put more data at risk of failure: on this synthetic trace, a value of e=8 is optimal. Taking advantage of returning replicas is simpler and more efficient than creating additional replicas: a system that reintegrates returning replicas will always make fewer copies than a system that does not and must replace forgotten replicas.

For systems that use erasure codes, there is an additional read cost since a complete copy of the object is needed in order to generate a new fragment [32]. The cost of reading a sufficient number of fragments prior to recreating a lost fragment can overwhelm the savings that erasure codes provide. A common approach is to amortize this cost by batching fragment creation but simply caching the object at the node responsible for repair is much more effective. A simulation contrasting both caching and batching (but both with reintegration) shows results similar to Figure 9: caching the object with a 7/14 erasure code uses 85% of the bandwidth that the optimal batching strategy would use.

5.5  Timeouts

A common approach to reduce transient costs is to use long timeouts, as suggested by Blake [4]. Timeouts are a heuristic to avoid misclassifying temporary failures as permanent: failures are considered to be permanent only when the corresponding node has not responded for some number of seconds. Longer timeouts reduce the number of misclassified transient failures and thus the number of repairs. On the other hand, a longer timeout also increases the latency between failure and repair in the event of a true disk failure; if additional permanent failures occur during this larger “window of vulnerability,” data may be lost.

The goal of both reintegrating replicas and use of timeouts is to reduce the number of repairs without decreasing durability. Figure 7 demonstrates that reintegration is effective for Carbonite. However, it also illustrates that timeouts are important in systems without reintegration: on the PlanetLab trace, the timeout used is able to mask 87.7% of transient failures whereas it only masks 58.3% of transient failures on the Poisson trace. If replicas are reintegrated, what extra benefit does a timeout provide?

Timeouts are most effective when a significant percentage of the transient failures can be ignored, which is dependent on the downtime distribution. However, for durability to remain high, the expected node lifetime needs to be significantly greater than the timeout.


[]nsdi012.png []nsdi013.png
Figure 10: The impact of timeouts on bandwidth and durability on a synthetic trace. Figure 10(a) shows the number of copies created for various timeout values; fig10b shows the corresponding object durability. In this trace, the expected downtime is about 29 hours. Longer timeouts allow the system to mask more transient failures and thus reduce maintenance cost; however, they also reduce durability.


To evaluate this scenario where timeouts should have impact, we performed an experiment using a synthetic trace where we varied the repair threshold and the node timeout. Since the system would recognize nodes returning after a permanent failure and immediately expire all pending timeouts for these nodes, we assigned new identities to such nodes to allow long timeouts to expire normally.

Figure 10 shows the results of this simulation: fig10a shows the total bytes sent as a function of timeout while fig10b shows the durability at the end of the trace. As the length of the timeout increases past the average downtime, we observe a reduction in the number of bytes sent without a decrease in durability. However, as the timeout grows longer, durability begins to fall: the long timeout delays the point at which the system can begin repair, reducing the effective repair rate. Thus setting a timeout can reduce response to transient failures but its success depends greatly on its relationship to the downtime distribution and can in some instances reduce durability as well.

6  Implementing Carbonite

While the discussion of durability and efficient maintenance may be broadly applicable, in this section, we focus on our experience in implementing Carbonite in the context of distributed hash tables (DHTs).

In a DHT, each node is algorithmically assigned a portion of the total identifier space that it is responsible for maintaining. Carbonite requires that each node know the number of available replicas of each object for which it is responsible. The goal of monitoring is to allow the nodes to track the number of available replicas and to learn of objects that the node should be tracking but is not aware of. When a node n fails the new node n' that assumes responsibility of n's blocks begins tracking replica availability; monitored information is soft state and thus can be failed over to a "successor" relatively transparently.

Monitoring can be expensive: a node might have to contact every node in the scope of each object it holds. While developing two prototype implementations of Carbonite in the PlanetLab environment, we found it necessary to develop different techniques for monitoring: the monitoring problem is slightly different in systems that use distributed directories and those that use consistent hashing. Figure 11 illustrates the structures of these systems.


[DHT]nsdi014.png   [Directory]nsdi015.png
Figure 11: DHT- and Directory- Based Storage System Architectures.


The Chord/DHash system [8, 9] served as the basis for our consistent hashing implementation. It uses a small scope and thus monitors a small number of nodes. DHash does not need to record the location of failed replicas: a node will return to the same place in the ring and thus the same replica sets, as long as it returns with the same logical identifier.

We used Oceanstore [20, 30] and the BambooDHT [31] to develop a distributed directory system using large scope and random placement. Oceanstore must maintain pointers to all nodes that have ever held data for a given object and has a scope of N. Responsibility for keys is still assigned using consistent hashing: the pointer database for each key is replicated on the successors of the key. In this case, the location of objects is hard state. Unfortunately, it could be easy for this system to have very high monitoring costs: if each node communicate with every other node periodically, the resulting N2 probe traffic may limit the system's scalability.

6.1  Monitoring consistent hashing systems

In systems that use a small scope, it is possible to make an end-to-end check that data is stored on the disk of each node. The naive way to do this is to arrange for nodes to repeatedly exchange key lists, but such an exchange would be extremely costly.

DHash uses a synchronization protocol based on Merkle trees [6] that takes advantage of the fact that most objects are typically correctly placed. In this common case, adjacent nodes store largely similar keys and two nodes can exchange a single message (containing a digest of the stored keys) to verify that they are synchronized.

Carbonite allows replicas to be placed anywhere in the placement scope. This flexibility lets the system avoid moving and replicating objects during most joins (until the system grows dramatically). However, it also causes the Merkle synchronization protocol to operate outside of its common case: adjacent nodes are no longer likely to store nearly identical sets of objects. In this environment the synchronizer “discovers” that nodes in the scope are missing objects each time it is run. Repeatedly exchanging this information can be costly: if the synchronization protocol runs once a minute, the cost of repeatedly transferring the 20-byte key of an 8 KB data object will exceed the cost of transferring the object itself to a newly joined node in about 8 hours.

To avoid this problem, each node maintains, for each object, a list of nodes in the scope without a copy of the object. The node uses this information to adjust its Merkle tree to avoid re-learning the information again during the next run of the synchronizer. For instance, when a node n synchronizes with a replica node n' that is known to be missing an object with key k, n leaves k out of the Merkle tree used for synchronization: this prevents n' from reporting what n already knew. The amount of extra state needed to perform this optimization per object is small relative to the size of storing the object itself, and can be maintained lazily, unlike the object itself which is hard state.

6.2  Monitoring host availability

In a directory-style system, the same synchronization techniques just described can be used to monitor the directory itself (which is replicated on successor nodes); however, it is likely infeasible to explicitly monitor the liveness of objects themselves using the algorithm described above since two nodes are not likely to store the same keys. Instead, node availability can be monitored as a proxy for object availability. Node availability can be monitored using a multicast mechanism that propagates the liveness state of each node to each other node.

The DHT's routing tables are used to establish a unique spanning tree rooted at each node a O(logN) out-degree per node. Each node periodically broadcasts a heartbeat message to its children in the tree; this message includes a generation identifier that is randomly generated when the node is installed or reinstalled following a disk failure. The children rebroadcast the heartbeat to their children until it is received by all nodes.

Over time, each node expects to receive regular notification of node liveness. If a heartbeat is missed, the monitoring node triggers repair for every object stored on the newly down node. When a node returns and its generation identifier has not changed, the monitoring node can conclude that objects stored on that node are again accessible.

7  Related work

7.1  Replication analysis

The use of a birth-death data-loss model is a departure from previous analyses of reliability. Most DHT evaluations consider whether data would survive a single event involving the failure of many nodes [8, 40]. This approach does not separate durability from availability, and does not consider the continuous bandwidth consumed by replacing replicas lost to disk failure.

The model and discussion in this paper is similar to contemporary work that looks at churn [37] and analyzes the expected object lifetime [29]. The birth-death model is a generalization of the calculations that predict the MTBF for RAID storage systems [26]. Owing to its scale, a distributed system has more flexibility to choose parameters such as the replication level and number of replica sets when compared to RAID systems.

Blake and Rodrigues argue that wide-area storage systems built on unreliable nodes cannot store a large amount of data [4]. Their analysis is based on the amount of data that a host can copy during its lifetime and mirrors our discussion of feasibility. We come to a different conclusion because we consider a relatively stable system membership where data loss is driven by disk failure, while they assumed a system with continual membership turnover.

The selection of a target replication level for surviving bursts differs from many traditional fault tolerant storage systems. Such systems, designed for single-site clusters, typically aim to continue operating despite some fixed number of failures and choose number of replicas so that a voting algorithm can ensure correct updates in the presence of partitions or Byzantine failures [23, 5, 17, 33, 24].

FAB [33] and Chain Replication [38] both consider how the number of possible replicas sets affects data durability. The two come to opposite conclusions: FAB recommends a small number of replica sets since more replica sets provide more ways for data to fail; chain replication recommends many replica sets to increase repair parallelism and thus reduce repair time. These observations are both correct: choosing a replica placement strategy requires balancing the probability of losing some data item during a simultaneous failure (by limiting the number of replica sets) and improving the ability of the system to tolerate a higher average failure rate (by increasing the number of replica sets and reconstruction parallelism).

Weatherspoon et al [39] studied the increased costs due to transient failures. Their results quantify the benefits of maintaining extra replicas in reducing these transient costs. However, their analysis focuses on systems that forget about extant replicas that exist when repair is initiated and do not discuss the benefits of reintegrating them.

7.2  Replicated systems

Replication has been widely used to reduce the risk of data loss and increase data availability in storage systems (e.g., RAID [26], System R duplex disks [16], Harp [23], xFS [2], Petal [21], DDS [17], GFS [15]). The algorithms traditionally used to create and maintain data redundancy are tailored for the environment in which these systems operate: well-connected hosts that rarely lose data or become unavailable. As a result they can maintain a small, fixed number of replicas and create a new replica immediately following a failure. This paper focuses on wide-area systems that are bandwidth-limited, where transient network failures are common, and where it is difficult to tell the difference between transient failures and disk failures.

Distributed databases [10], online disaster recovery systems such as Myriad [22], and storage systems [13, 12, 27] use replication and mirroring to distribute load and increase durability. These systems store mutable data and focus on the cost of propagating updates, a consideration not applicable to the immutable data we assume. In some cases, data is replicated between a primary and backup sites and further replicated locally at each site using RAID. Wide area recovery is initiated only after site failure; individual disk failure can be repaired locally.

Total Recall is the system most similar to our work [3]. We borrow from Total Recall the idea that creating and tracking additional replicas can reduce the cost of transient failures. Total Recall's lazy replication keeps a fixed number of replicas and fails to reincorporate replicas that return after a transient failure if a repair had been performed. Total Recall also requires introspection or guessing to determine an appropriate high water mark that Carbonite can arrive at naturally.

Glacier [18] is a distributed storage system that uses massive replication to provide data durability across large-scale correlated failure events. The resulting tradeoffs are quite different from those of Carbonite, which is designed to handle a continuous stream of at small-scale failure events. For example, due to its high replication level, Glacier can afford very long timeouts and thus mask almost all transient failures.

8  Conclusions and future work

Inexpensive hardware and the increasing capacity of wide-area network links have spurred the development of applications that store a large amount of data on wide-area nodes. However, the feasibility of applications based on distributed storage systems is currently limited by the expense of maintaining data. This paper has described a set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data.

These techniques have allowed us to develop and deploy prototypes of UsenetDHT [34], OverCite [35], and Antiquity [11]. These systems must store large amounts of data durably and were infeasible without the techniques we have presented. In the future, we hope to report on our long-term experience with these systems.

Acknowledgments
The authors would like to thank Vivek Pai and Aaron Klingaman for their assistance in compiling the data used for the PlanetLab traces. This paper has benefited considerably from the comments of the anonymous reviewers and our shepherd, Larry Peterson.

References

[1]
Andersen, D. Improving End-to-End Availability Using Overlay Networks. PhD thesis, Massachusetts Institute of Technology, 2004.
[2]
Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. Serverless network file systems. In Proc. of the 15th ACM Symposium on Operating System Principles (Dec. 1995).
[3]
Bhagwan, R., Tati, K., Cheng, Y.-C., Savage, S., and Voelker, G. M. Total Recall: System support for automated availability management. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (Mar. 2004).
[4]
Blake, C., and Rodrigues, R. High availability, scalable storage, dynamic peer networks: Pick two. In Proc. of the 9th Workshop on Hot Topics in Operating Systems (May 2003), pp. 1–6.
[5]
Castro, M., and Liskov, B. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems 20, 4 (2002), 398–461.
[6]
Cates, J. Robust and efficient data management for a distributed hash table. Master's thesis, Massachusetts Institute of Technology, May 2003.
[7]
Dabek, F. A Distributed Hash Table. PhD thesis, Massachusetts Institute of Technology, 2005.
[8]
Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I. Wide-area cooperative storage with CFS. In Proc. of the 18th ACM Symposium on Operating System Principles (Oct. 2001).
[9]
Dabek, F., Li, J., Sit, E., Robertson, J., Kaashoek, M. F., and Morris, R. Designing a DHT for low latency and high throughput. In Proc. of the 1st Symposium on Networked Systems Design and Implementation (Mar. 2004).
[10]
Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., and Terry, D. Epidemic algorithms for replicated database maintenance. In Proc. of the 6th ACM Symposium on Principles of Distributed Computing (1987), pp. 1–12.
[11]
Eaton, P., Weatherspoon, H., and Kubiatowicz, J. Efficiently binding data to owners in distributed content-addressable storage systems. In Proc. of the 3rd International Security in Storage Workshop (Dec. 2005).
[12]
EMC. Centera—content addressed storage system. http://www.emc.com/products/systems/centera.jsp. Last accessed March 2006.
[13]
EMC. Symmetrix remote data facility. http://www.emc.com/products/networking/srdf.jsp. Last accessed March 2006.
[14]
Freedman, M. J., Lakshminarayanan, K., Rhea, S., and Stoica, I. Non-transitive connectivity and DHTs. In Proc. of the 2nd Workshop on Real Large Distributed Systems (Dec. 2005).
[15]
Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proc. of the 2003 19th ACM Symposium on Operating System Principles (Oct. 2003).
[16]
Gray, J., McJones, P., Blasgen, M., Lindsay, B., Lorie, R., Price, T., Putzolu, F., and Traiger, I. The recovery manager of the System R database manager. ACM Computing Surveys 13, 2 (1981), 223–242.
[17]
Gribble, S., Brewer, E., Hellerstein, J., and Culler, D. Scalable, distributed data structures for internet service construction. In Proc. of the 4th Symposium on Operating Systems Design and Implementation (Oct. 2004).
[18]
Haeberlen, A., Mislove, A., and Druschel, P. Glacier: Highly durable, decentralized storage despite massive correlated failures. In Proc. of the 2nd Symposium on Networked Systems Design and Implementation (May 2005).
[19]
Kleinrock, L. Queueing Systems, Volume I: Theory. John Wiley & Sons, Jan. 1975.
[20]
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. OceanStore: An architecture for global-scale persistent storage. In Proc. of the 9th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (Nov. 2000), pp. 190–201.
[21]
Lee, E. K., and Thekkath, C. A. Petal: Distributed virtual disks. In Proc. of the 7th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (1996), pp. 84–92.
[22]
Leung, S.-T. A., MacCormick, J., Perl, S. E., and Zhang, L. Myriad: Cost-effective disaster tolerance. In Proc. of the 1st USENIX Conference on File and Storage Technologies (Jan. 2002).
[23]
Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., Shrira, L., and Williams, M. Replication in the Harp file system. In Proc. of the 13th ACM Symposium on Operating System Principles (Oct. 1991), pp. 226–38.
[24]
Litwin, W., and Schwarz, T. LH* RS : A high-availability scalable distributed data structure using reed solomon codes. In Proc. of the 2000 ACM SIGMOD Intl. Conference on Management of Data (May 2000), pp. 237–248.
[25]
Park, K. S., and Pai, V. CoMon: a mostly-scalable monitoring system for PlanetLab. ACM SIGOPS Operating Systems Review 40, 1 (Jan. 2006), 65–74. http://comon.cs.princeton.edu/.
[26]
Patterson, D., Gibson, G., and Katz, R. A case for redundant arrays of inexpensive disks (RAID). In Proc. of the ACM SIGMOD International Conference on Management of Data (June 1988).
[27]
Patterson, H., Manley, S., Federwisch, M., Hitz, D., Kleiman, S., and Owara, S. Snapmirror: File system based asynchronous mirroring for disaster recovery. In Proc. of the 1st USENIX Conference on File and Storage Technologies (Jan. 2002).
[28]
Peterson, L., Anderson, T., Culler, D., and Roscoe, T. A blueprint for introducing disruptive technology into the Internet. In Proc. of the First ACM Workshop on Hot Topics in Networks (Oct. 2002). http://www.planet-lab.org.
[29]
Ramabhadran, S., and Pasquale, J. Analysis of long-running replicated systems. In Proc. of the 25th IEEE Annual Conference on Computer Communications (INFOCOM) (Apr. 2006).
[30]
Rhea, S., Eaton, P., Geels, D., Weatherspoon, H., Zhao, B., and Kubiatowicz, J. Pond: the OceanStore prototype. In Proc. of the 2nd USENIX Conference on File and Storage Technologies (Apr. 2003).
[31]
Rhea, S., Geels, D., Roscoe, T., and Kubiatowicz, J. Handling churn in a DHT. In Proc. of the 2004 Usenix Annual Technical Conference (June 2004).
[32]
Rodrigues, R., and Liskov, B. High availability in DHTs: Erasure coding vs. replication. In Proc. of the 4th International Workshop on Peer-to-Peer Systems (Feb. 2005).
[33]
Saito, Y., Froelund, S., Veitch, A., Merchant, A., and Spence, S. FAB: building distributed enterprise disk arrays from commodity components. In Proc. of the 11th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, 2004), ACM Press, pp. 48–58.
[34]
Sit, E., Dabek, F., and Robertson, J. UsenetDHT: A low overhead Usenet server. In Proc. of the 3rd International Workshop on Peer-to-Peer Systems (Feb. 2004).
[35]
Stribling, J., Councill, I. G., Li, J., Kaashoek, M. F., Karger, D. R., Morris, R., and Shenker, S. OverCite: A cooperative digital research library. In Proc. of the 4th International Workshop on Peer-to-Peer Systems (Feb. 2005).
[36]
Stribling, J., Li, J., Councill, I. G., Kaashoek, M. F., and Morris, R. Exploring the design of multi-site web services using the OverCite digital library. In Proc. of the 3rd Symposium on Networked Systems Design and Implementation (May 2006).
[37]
Tati, K., and Voelker, G. M. On object maintenance in peer-to-peer systems. In Proc. of the 5th International Workshop on Peer-to-Peer Systems (Feb. 2006).
[38]
van Renesse, R., and Schneider, F. B. Chain replication for supporting high throughput and availability. In Proc. of the 6th Symposium on Operating Systems Design and Implementation (Dec. 2004).
[39]
Weatherspoon, H., Chun, B.-G., So, C. W., and Kubiatowicz, J. Long-term data maintenance in wide-area storage systems: A quantitative approach. Tech. Rep. UCB//CSD-05-1404, U. C. Berkeley, July 2005.
[40]
Weatherspoon, H., and Kubiatowicz, J. D. Erasure coding vs. replication: A quantitative comparison. In Proc. of the 1st International Workshop on Peer-to-Peer Systems (Mar. 2002).

This research was supported by the National Science Foundation under Cooperative Agreement No. ANI-0225660, http://project-iris.net/. Andreas Haeberlen was supported in part by the Max Planck Society. Emil Sit was supported in part by the Cambridge-MIT Institute. Hakim Weatherspoon was supported by an Intel Foundation PhD Fellowship.
This paper was originally published in the Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI '06)
May 8–10, 2006, San Jose, CA


Last changed: 3 May 2006 jel