Check out the new USENIX Web site. next up previous
Next: Latency Up: Reliability During Disaster Previous: Reliability During Disaster

Disaster Test

Figure 7: Data loss as a result of disaster and wide-area link failure, varying FEC param $ c$ (50ms one-way latency, 1% link loss).
\includegraphics[width=0.9\columnwidth]{results/graph6/graph6.eps}

Figure 6 shows the amount of data loss in the event of a disaster for the local-sync, local-sync+FEC, and network-sync solutions; we do not test the remote-sync and remote-sync+FEC solutions in this experiment since these solutions do not lose data.

The rolling disaster, failure of the wide-area link and crash of all primary site processes, occurred two minutes into the experiment. The wide-area link operated at 0% loss until immediately before the disaster occurred, when loss rate was increased for 0.5 seconds, thereafter the link was killed (See Section 2 for a description of rolling disasters). The x-axis shows the wide-area link loss rate immediately before the link is killed; link losses are random, independent and identically distributed. The y-axis shows both the total number of messages sent and total number of messages lost--lost messages were perceived as durable by the application but were not received by the remote mirror. Messages were of size 4 kB.

Figure 8: Latency distribution as a function of wide-area link loss (50ms one-way latency).
\includegraphics[width=0.65\columnwidth]{results/graph3/lat-bins-0-50.eps} \includegraphics[width=0.65\columnwidth]{results/graph3/lat-bins-0.1-50.eps} \includegraphics[width=0.65\columnwidth]{results/graph3/lat-bins-1-50.eps}

The total number of messages sent is similar for all configurations since the link loss rate was 0% for most of the experiment. However, local-sync lost a significant number of messages that had been reported to the application as durable under the policy discussed in Section 3.1. These unrecoverable messages were ones buffered in the kernel, but still in transit on the wide area link; when the sending datacenter crashed and the link (independently) dropped the original copy of the message, TCP recovery was unable to overcome the loss.

Local-sync+FEC lost packets as well: it lost packets still buffered in the kernel, but not packets that had already been transmitted -- in the latter case, the proactive redundancy mechanism was adequate to overcome the loss. The best outcome is visible in the right-most histogram at 0.1%, 0.5%, and 1% link loss: here we see that although the network-sync solution experienced the same level of link-induced message loss, all the lost packets that had been reported as durable to the sender application were in fact recovered on the receiver side of the link. This supports the premise that a network-sync solution can tolerate disaster while minimizing loss. Combined with results from Section 5.4, we demonstrate that the network-sync solution actually achieves the best balance between reliability and performance.

Figure 7 quantifies the advantage of network-sync over local-sync+FEC. In this experiment, we run the same disaster scenario as above, but with 1% link loss during disaster and we vary the FEC parameter $ c$ (i.e. the number of recovery packets). At $ c=0$ , there are no recovery packets for either local-sync+FEC or network-sync--if a data packet is lost during disaster, it cannot be recovered and TCP cannot deliver any subsequent data to the remote mirror process. Similarly, at $ c=1$ , the number of lost packets is relatively high for both local-sync+FEC and network-sync since one recovery packet is not sufficient to mask 1% link loss. With $ c > 1$ , the number of recovery packets is often sufficient to mask loss on the wide-area link; however, local-sync+FEC loses data packets that did not transit outside the local-area before disaster, whereas with network-sync, primary storage servers respond to the client only after receiving a callback from the egress gateway. As a result, network-sync can potentially reduce data loss in a disaster.


next up previous
Next: Latency Up: Reliability During Disaster Previous: Reliability During Disaster
Hakim Weatherspoon 2009-01-14