Check out the new USENIX Web site. next up previous
Next: Controlling replica divergence Up: Propagating updates Previous: Optimization 3: exploiting physical

Conflict resolution

With optimistic replication, concurrent updates are inevitable, although rare [35,31]. We use a combination of version vectors and the last-writer-wins rule to resolve conflicts.

First, recall that when delta timestamps mismatch, servers revert to full-state transfer. We then use version vectors [23] to separate true conflicts from other causes (e.g., missing updates) that can be fixed simply by overwriting the replica. This simplifies conflict resolution.

For conflicts on the contents of a regular file, we currently offer users two options. The first is the ``last-writer-wins'' rule using update timestamps (attribute ts in Figure 3). In this case, the clocks of servers should be loosely synchronized, e.g., using NTP, to respect the users' intuitive sense of update ordering. The second option is to concatenate two versions in the file and let the user fix the conflict manually. Other options, such as application-specific resolvers [36,18,32], are certainly possible, but we have not implemented them yet.

Conflicts regarding file attributes or directory entries are more difficult to handle. They fall into two categories. The first is a conflict between two directory-update operations; for example, Alice does ``mv /foo /alice/foo'' and Bob does ``mv /foo /bob/foo'' concurrently. In the end, we want one of the updates to take effect, but not both. The second category is a conflict between ``rmdir'' and any other operation; for example, Alice does ``mv /foo /alice/foo'' and Bob does ``rmdir /alice''. These problems are difficult to handle, because files may be replicated on different sets of nodes, and a node might receive only one of the conflicting updates and fail to detect the conflict in the first place.

We only outline our solution here, as it is fully described in [28]. Our principle is always to let the child file (``foo'' in our example), rather than its parent (``alice'' or ``bob''), dictate the outcome of the conflict resolution using the ``last-writer-wins'' rule. We thus let the file's backpointer (Section 3.3) authoritatively define the file's location in the file-system namespace. We implement directory operations, such as ``mv'' and ``rm'', as a change to the file's backpointer(s). When a replica receives a change to its backpointer, it also reflects the change to its parents by creating, deleting, or modifying the corresponding entries.5The parent directory will, in turn, flood the change to its replicas. In practice, we randomly delay the directory-entry patching and subsequent flooding, because there is a good chance that other replicas of the file will do the same. Figure 6 illustrates how Pangaea resolves the first conflict scenario. The same policy is used to resolve the mv-rmdir conflict: when a replica detects the absence of the directory entry corresponding to its backpointer, it re-creates the entry, which potentially involves re-creating the directory itself and the ancestor directories recursively, all the way to the root.

A directory in Pangaea is, in effect, merely a copy of the backpointers of its children. Thus, resolving conflicts on directory contents is done by applying the ``last-writer-wins'' rule to individual entries. If a file is to be removed from a directory, the directory still keeps the entry but marks it as ``dead'' (i.e., it acts as a ``death certificate'' [7]), so that we can detect when a stale change to the entry arrives in the future.

Figure 6: Example of conflict resolution involving four files, ``/'' (FileID=50), ``/foo'' (FileID=51), ``/alice/'' (FileID=52), and ``/bob/'' (FileID=53). ``ts=2'' shows the replica's timestamp. ``bp=[50,foo]'' shows that the backpointer of the replica indicates that the file has the name ``foo'' in the directory 50 (``/''). ``d={[51,foo,4]}'' means that the directory contains one entry, a file ``foo'' with ID of 51 and timestamp of 4. Entries marked ``*foo'' are death certificates. (1) Two sites initially store the same contents. (2) Alice does ``mv /foo /alice/foo''. (2') Bob concurrently does ``mv /foo /bob/foo'' on another node. Because Bob's update has a newer timestamp (ts=9) than Alice's (ts=8), we want Bob's to win over Alice's. (3) When Alice's node receives the update from Bob's, the replica of file 51 will notice that its backpointer has changed from [52, foo] to [53, foo]. This change triggers the replica to delete the entry from /alice and add the entry to /bob.

next up previous
Next: Controlling replica divergence Up: Propagating updates Previous: Optimization 3: exploiting physical
Yasushi Saito 2002-10-08