Next: Recovering from temporary failures Up: Taming aggressive replication in Previous: Controlling replica divergence

Failure recovery

Failure recovery in Pangaea is simplified due to three properties: 1) the randomized nature of replica graphs that tolerate operation disruptions; 2) the idempotency of update operations; including NFS requests; and 3) the use of a unified logging module that allows any operation to be re-started.

We distinguish two types of failures: temporary failures and permanent failures. They are currently distinguished simply by their duration: a crash becomes permanent when a node is suspected to have failed continuously for more than two weeks. Given that the vast majority of failures are temporary [11,3], we set two different goals. For temporary failures, we try to reduce the recovery cost. For permanent failures, we try to clean all data structures associated with the failed node so that the system runs as if the node had never existed in the first place.

Subsections

Yasushi Saito 2002-10-08