Failure Model and Assumptions:

Next: Data Loss Model: Up: What's the Worst that Previous: What's the Worst that

Failure Model and Assumptions:

We assume that failures can occur at any level -- including storage devices, storage area network, network links, switches, hubs, wide-area network, and/or an entire site. Further, we assume that they can fail simultaneously or even in sequence: a rolling disaster. However, we assume that the storage system at each site is capable of tolerating and recovering from all but the most extreme local failures. Also, sites may have redundant network paths connecting them. This allows us to focus on the tolerance of failures that disable an entire site, and on combinations of failures such as the loss of both an entire site and the network connecting it to the backup (what we call a rolling disaster). Figure 2 illustrates some points of failure.

With respect to wide-area optical links, we assume that even though industry standards essentially preclude data loss on the links themselves, wide-area connections include layers of electronics: routers, gateways, firewalls, etc. These components can and do drop packets, and at very high data rates, so can the operating system on the destination machine to which data is being sent. Accordingly, our model assumes wide-area networks with high data rates (10 to 40 Gbits) but sporadic packet loss, potentially bursty. The packet loss model used in our experiments is based on actual observations of TeraGrid, a scientific data network that links scientific supercomputing centers and has precisely these characteristics. In particular, Balakrishnan et al. [10] cite loss rates over 0.1% at times on uncongested optical-link paths between supercomputing centers. As a result, we emulate disaster with up to 1% loss rates in our evaluation of Section 5.

Of course, reliable transmission protocols such as TCP are typically used to communicate updates and acknowledgments between sites. Nonetheless, under our assumptions, a lost packet may prevent later received packets from being delivered to the mirrored storage system. The problem is that once the primary site has failed, there may be no way to recover a lost packet, and because TCP is sequenced, all data sent after the lost packet will be discarded in such situations -- the gap prevents their delivery.

Next: Data Loss Model: Up: What's the Worst that Previous: What's the Worst that

Hakim Weatherspoon 2009-01-14