Discussion

Next: Mirroring Consistency via SMFS Up: Network-Sync Remote Mirroring Previous: Maelstrom: Network-sync Implementation

Discussion

The key metric for any disaster-tolerant remote mirroring technology is the distance by which datacenters can be separated. Today, a disturbing number of New York City banks maintain backups in New Jersey or Brooklyn, because they simply cannot tolerate higher latencies.

The underlying problem is that these systems typically operate over TCP/IP. Obviously, the operators tune the system to match the properties of the network. For example, TCP can be configured to use massive sender buffers and unusually large segments; also, an application can be modified to employ multiple side-by-side streams (e.g. GridFTP). Yet even with such steps, the protocol remains purely reactive--recovery packets are sent only in response to actual indications of failure, in the form of negative acknowledgments (i.e. fast retransmit) or timeouts keyed to the round-trip-time (RTT). Consequently, their recovery time is tightly linked to the distance between communicating end-points. TCP/IP, for example, requires a minimum of around RTTs to recover lost data, which translates into substantial fractions of a second if the mirrors are on different continents. No matter how large we make the TCP buffers, the remote data stream will experience an RTT hiccup each time loss occurs: to deliver data in order, the receiver must await the missing data before subsequent packets can be delivered.

Network-sync evades this RTT issue, but does not protect the application against every possible rolling disaster scenario. Packets can still be queued in the local-area when disaster strikes. Further, the network can partitioned in the split second(s) before a primary site fails. Neither proactive redundancy or network-level callbacks will prevent loss in these cases. Accordingly, we envision that applications will need a mixture of remote-sync and network-sync, with the former reserved for particularly sensitive scenarios, and the latter used in most cases.

Another issue is failover and recovery. Since the network-sync option enhances remote mirroring protocols, we assume that a complete remote mirroring protocol will itself handle failover and recovery directly [19,22,20]. As a result, in this work, we focus on evaluating the fault tolerant capabilities of a network-sync option and do not discuss failover and recovery protocols.

**Figure 4:** Format of a log after writing a file system with two sub directories `/dir1/file1` and `/dir2/file2`.
$\includegraphics[width=0.9\columnwidth]{figs/log.eps}$

Next: Mirroring Consistency via SMFS Up: Network-Sync Remote Mirroring Previous: Maelstrom: Network-sync Implementation

Hakim Weatherspoon 2009-01-14