Non-Transitive Connectivity and DHTs

Michael J. Freedman, Karthik Lakshminarayanan, Sean Rhea, and Ion Stoica

New York University and University of California, Berkeley
mfreed@cs.nyu.edu, {karthik,srhea,istoica}@cs.berkeley.edu

1  Introduction

The most basic functionality of a distributed hash table, or DHT, is to partition a key space across the set of nodes in a distributed system such that all nodes agree on the partitioning. For example, the Chord DHT assigns each node a random identifier from the key space of integers modulo 2160 and maps each key to the node whose identifier most immediately follows it. Chord is thus said to implement the successor relation, and so long as each node in the network knows its predecessor in the key space, any node can compute which keys are mapped onto it.
An implicit assumption in Chord and other DHT protocols is that all nodes are able to communicate with each other, yet we know this assumption is unfounded in practice. We say a set of three hosts, A, B, and C, exhibit non-transitivity if A can communicate with B, and B can communicate with C, but A cannot communicate with C. As we show in Section 2, 2.3% of all pairs of nodes on PlanetLab exhibit transient periods in which they cannot communicate with each other, but in which they can communicate through a third node. These transient periods of non-transitivity occur for many reasons, including link failures, BGP routing updates, and ISP peering disputes (e.g., [13]).
Such non-transitivity in the underlying network is problematic for DHTs. Consider for example the Chord network illustrated in Figure 1. Identifiers increase from the left, so node B is the proper successor to key k. If nodes A and B are unable to communicate with each other, A will believe that C is its successor. Upon receiving a lookup request for k, A will return C to the requester. If the requester then tries to insert a document associated with k at node C, node C would refuse, since according to its view it is not responsible for key k.
While this example may seem contrived, it is in fact quite common. If each pair of nodes with adjacent identifiers in a 300-node Chord network (independently) has a 0.1% chance of being unable to communicate, then we expect that there is a 1-0.999300 » 26% chance that some pair will be unable to communicate at any time. However, both nodes in such a pair have a 0.9992 chance of being able to communicate with the node that most immediately precedes them both.
Collectively, the authors have produced three independent DHT implementations: the Bamboo [18] implementation in OpenDHT [19], the Chord [23] implementation in i3 [22], and the Kademlia [11] implementation in Coral [8]. Moreover, we have run public deployments of these three DHTs on PlanetLab for over a year.

Figure 1: Non-transitivity in Chord. The dashed lines represent predecessor links.

While DHT algorithms seem quite elegant on paper, in practice we found that a great deal of our implementation effort was spent discovering and fixing problems caused by non-transitivity. Of course, maintaining a full link-state routing table at each DHT node would have sufficed to solve all such problems, but would also require considerably more bandwidth than a basic DHT.1 Instead, we each independently discovered a set of "hacks" to cover up the false assumption of full connectivity on which DHTs are based.
In this paper, we categorize the ways in which Bamboo, Chord, and Kademlia break down under non-transitivity, and we enumerate the ways we modified them to cope with these shortcomings. We also discuss application-level solutions to the problem. Many of these failure modes and fixes were quite painful for us to discover, and we hope that-at least in the short term-this work will save others the effort. In the longer term, we hope that by focusing attention on the problem, we will encourage future DHT designers to tackle non-transitivity head-on.
The next section quantifies the prevalence of non-transitivity on the Internet and surveys related work in this area. Section 3 presents a brief review of DHT terminology. Section 4 discusses four problems caused by non-transitivity in DHTs and our solutions to them. Finally, Section 5 concludes.

2  Prevalence of Non-Transitivity

The Internet is known to suffer from network outages (such as extremely heavy congestion or routing convergence problems) that result in the loss of connectivity between some pairs of nodes [14,3]. Furthermore, the loss of connectivity is often non-transitive; in fact, RON [3] and SOSR [10] take advantage of such non-transitivity-the fact that two nodes that cannot temporarily communicate with one another often have a third node that can communicate with them both-to improve resilience by routing around network outages.
Gerding and Stribling [9] observed a significant degree of non-transitivity among PlanetLab hosts; of all possible unordered three tuples of nodes (A,B,C), about 9% exhibited non-transitivity. Furthermore, they attributed this non-transitivity to the fact that PlanetLab consists of three classes of nodes: Internet1-only, Internet2-only, and multi-homed nodes. Although Internet1-only and Internet2-only nodes cannot directly communicate, multi-homed nodes can communicate with them both.
Extending the above study, we have found that transient routing problems are also a major source of non-transitivity in PlanetLab. In particular, we considered a three hour window on August 3, 2005 from the all-pairs ping dataset [1]. The dataset consists of pings between all pairs of nodes conducted every 15 minutes, with each data point averaged over ten ping attempts.
We counted the number of unordered pairs of hosts (A,B) such that A and B cannot reach each other but another host C can reach both A and B. We found that, of all pairs of nodes, about 5.2% of them belonged to this category over the three hour window. Of these pairs of nodes, about 56% of the pairs had persistent problems; these were probably because of the problem described above. However, the remaining 44% of the pairs exhibited problems intermittently; in fact, about 25% of the pairs could not communicate with each other only in one of the 15-minute snapshots. This suggests that non-transitivity is not entirely an artifact of the PlanetLab testbed, but also caused by transient routing problems.

Figure 2: Two styles of DHT routing for source node S to perform a lookup that terminates at root node R.

3  DHT Background

Before moving on to the core of this paper, we first briefly review basic DHT nomenclature. We assume the reader has some familiarity with basic DHT routing protocols. For more information, see [23,11,21].
The DHT assigns every key in the identifier space to a node, which is called the root (or the successor) of the key. The main primitive that DHTs support is lookup, in which a node can efficiently discover a key's root. The lookup protocol greedily traverses the nodes of the DHT, progressing closer to the root of the key at each step.
Each node maintains a set of neighbors that it uses to route packets. Typically, such neighbors are divided into (a) short links chosen from the node's immediate neighborhood in the ID space to ensure correctness of lookups, and (b) long links chosen to ensure that lookups are efficient (e.g., take no more than O(logn) hops for a network with n nodes). In Chord and Bamboo, the set of short links is called the node's successor list and leaf set, respectively, and the long links are called fingers and routing table entries. While Kademlia uses a single routing table, one can still differentiate between its closest bucket of short links and farther buckets of long links.
DHT routing can be either iterative or recursive [7] (see Figure 2). Consider a simple example, in which source node S initiates a lookup for some key whose root is node R. In iterative routing, node S first contacts node A to learn about node B, and then S subsequently contacts B. In recursive routing, S contacts A, and A contacts B in turn.
Both routing techniques have different strengths. For example, recursive routing is faster than iterative routing using the same bandwidth budget [7,17] and can use faster per-node timeouts [18]. On the other hand, iterative routing gives the initiating node more end-to-end control, which can be used, for instance, for better parallelization [11,17]. We discuss the impact of both approaches in the following section.

4  Problems and Solutions

This section presents problems caused by non-transitivity in DHTs and the methods we use to mitigate them. We present these problems in increasing order of how difficult they are to solve.

4.1  Invisible Nodes

One problem due to non-transitivity occurs when a node learns about system participants from other nodes, yet cannot directly communicate with these newly discovered nodes. This problem arises both during neighbor maintenance and while performing lookups.
For example, assume that a node A learns about a potential neighbor B through a third node C, but A and B cannot directly communicate. We say that from A's perspective B is an invisible node. In early versions of both Bamboo and i3-Chord, A would blindly add B as a neighbor. Later, A would notice that B was unreachable and remove it, but in the meantime A would try to route messages through B.
A related problem occurs when nodes blindly trust failure notifications from other nodes. Continuing the above example, when A fails to contact B due to non-transitivity, in a naive implementation A will inform C of this fact, and C will erroneously remove B as a neighbor.
A simple fix for both of these problems is to prevent nodes from blindly trusting other nodes with respect to which nodes in the network are up or down. Instead, a node A should only add a neighbor B after successfully communicating with it, and A should only remove a neighbor with whom it can no longer directly communicate. This technique is used by all three of our DHTs.

Figure 3: Invisible nodes. S learns about M and N from A while trying to route to R, but S has no direct connectivity to M. By sending lookup messages to M and N in parallel, S avoids being stalled while its request to M times out.

Invisible nodes also cause performance problems during iterative routing, where the node performing a lookup must communicate with nodes that are not its immediate neighbors in the overlay. For example, as shown in Figure 3, a node S may learn of another node M through its neighbor A, but may be unable to directly communicate with M to perform a lookup. S will eventually time out its request to M, but such timeouts increase the latency of lookups substantially.
Three techniques can mitigate the effect of invisible nodes on lookup performance in iterative routing. First, a DHT can use virtual coordinates such as those computed by Vivaldi [6] to choose tighter timeouts. This technique should work well in general, although we have found that the Vivaldi implementations in both Bamboo and Coral are too inaccurate on PlanetLab to be of much use.2
Second, a node can send several messages in parallel for each lookup, allowing requests to continue towards the root even when some others time out. As shown in Figure 3, S can send lookup messages to M and N in parallel. This technique was first proposed in Kademlia [11].
Third, a node can remember other nodes that it was unable to reach in the past. Using this technique, which we call a unreachable node cache, a node S marks M as unreachable after a few failed communication attempts. Then, if M is encountered again during a subsequent lookup request, S immediately concludes that it is unreachable without wasting bandwidth and suffering a timeout.
OpenDHT and i3 both use recursive routing, but Coral implements iterative routing using the above approach, maintaining three parallel RPCs and a unreachable node cache.

4.2  Routing Loops


Figure 4: Routing loops. In Chord, if a lookup passes by the correct successor on account of non-transitivity, a routing loop arises. The correctness of lookup can be improved in such cases by traversing predecessor links.

In Chord, non-transitivity causes routing loops as follows. The root for a key k in Chord is the node whose identifier most immediately succeeds k in the circular key space. In Figure 4, let the proper root for k be R. Also, assume that P cannot communicate with R. A lookup routed through P thus skips over R to N, the next node in the key space with which P can communicate. N, however, knows its correct predecessor in the network, and therefore knows that it is not the root for k. It thus forwards the lookup around the ring, and a loop is formed.
Bamboo and Kademlia avoid routing loops by defining a total ordering over nodes during routing. In these networks, a node A only forwards a lookup on key k to another node B if |B-k| < |A-k|, where "-" represents modular subtraction in Bamboo and XOR in Kademlia.
Introducing such a total ordering in Chord is straightforward: instead of blindly forwarding a lookup towards the root, a node can stop any lookup that has already passed its root. For example, when N receives a lookup for k from P, it knows something is amiss since P < k < N, but N is not the direct successor of k.
Stopping a lookup in this way avoids loops, but it is often possible to get closer to the root for a key by routing along predecessor links once normal routing has stopped. i3's Chord implementation backtracks in this way. For example, the dashed lines from N back to R in Figure 4 show the path of the lookup using predecessor links. To guarantee termination when backtracking, once a packet begins following predecessor links it is never again routed along forward links.

4.3  Broken Return Paths

Figure 5: Broken return paths. Although S can route a put or get request to R through the overlay, there may be no direct IP route back from R to S. One alternative is to route the result back along the path taken from S to R; the other is to route through a random neighbor T.

Often an application built atop a DHT routing layer wants to not only route to the root of a key but also to retrieve some value back. For example, it may route a put request to the root, in which case it expects an acknowledgment of its request in return. Likewise, with a get request, it expects to receive any values stored under the given key. In one very important case, it routes a request to join the DHT to the root and expects to receive the root's leaf set or successor list in return.
As shown in Figure 5, when a source S routes a request recursively to the root R, the most obvious and least costly way for R to respond is to communicate with S directly over IP. While this approach works well in the common case, it fails with non-transitivity; the existence of a route from S to R through the overlay does not guarantee the existence of the direct IP route back. We know of two solutions to this problem.
The first solution is to source route the message backwards along the path it traveled from S to R in the first place, as shown by the dotted line in Figure 5. Since each node along the path forwarded the message through a neighbor that had been responding to its probes for liveness, it is likely that this return path is indeed routable. A downside of this solution is that the message takes several hops to return to the client, wasting the bandwidth of multiple nodes.3
A less costly solution is to have R source route its response to S through a random member of its leaf set or successor list, as shown by the dashed line in Figure 5. These nodes are chosen randomly with respect to R itself (by the random assignment of node identifiers), so most of them are likely to be able to route to S. Moreover, we already know that R can route to them, or it would not have them as neighbors.
A problem with both of these solutions is that they waste bandwidth in the common case where R can indeed send its response directly to S. To avoid this waste, we have S acknowledge the direct response from R. If R fails to receive an acknowledgment after some timeout, R source routes the response back (either along the request path or through a single neighbor). This timeout can be chosen using virtual coordinates, although we have had difficulty with Vivaldi on PlanetLab as discussed earlier. Alternatively, we can simply choose a conservative timeout value: as it is used only in the uncommon case where R cannot route directly to S, it affects the latency of only a few requests in practice. Bamboo/OpenDHT routes back through a random leaf-set neighbor in the case of non-transitivity, using a timeout of five seconds.
We note that iterative routing does not directly suffer from this problem. Since S directs the routing process itself, it will assume R is down and look for an alternate root R' (i.e., the node that would be the root if R were actually down). Of course, depending on the application, R' may not be a suitable replacement for R, but that reduces to the inconsistent root problem, which we discuss next.

4.4  Inconsistent Roots


Figure 6: Inconsistent roots. A put from S1 is routed to the root, R, which should replicate it on R',C,D. But since R cannot communicate with R', it replicates it on C-E instead. R' will later acquire a replica during synchronization with C-E.

The problems we have discussed so far are all routing problems. In this section, we discuss a problem caused by non-transitivity that affects the correctness of the partitioning of the DHT key space.
Most DHT applications assume that there is only one root for a given key in the DHT at any given time. As shown in Figure 6, however, this assumption may be invalid in the presence of non-transitivity. In the figure, node R is the proper root of key k, but since R and R' cannot communicate, R' mistakenly believes it is the root for k. A lookup from S1 finds the correct root, but a lookup from S2 travels through node I, which also cannot communicate with R, and terminates instead at R'.
Prior work has explored the issue of multiple roots due to transient conditions created by nodes joining and leaving the overlay, but has not explored the effects of misbehavior in the underlying network [4].
Given a complete partition of the network, it is difficult to solve this problem at all, and we are not aware of any existing solutions to it. On the other hand, if the degree of non-transitivity is limited, the problem can be eliminated by the use of a consensus algorithm. The use of such algorithms in DHTs is an active area of research [12,20].
Nonetheless, consensus is expensive in messages and bandwidth, so many existing DHTs use a probabilistic approach to solving the problem instead. For example, FreePastry 1.4.1 maintains full link-state routing information for each leaf set, and a node is considered alive if any other member of its leaf set can route to it [2]. Once routability has been provided in this manner, existing techniques (e.g., [4]) can be used to provide consistency.
An alternative approach used by both DHash [5] and OpenDHT [16] is to solve the inconsistent root problem at the application layer. Consider the traditional put/get interface to hash tables. As shown in Figure 6, DHash sends a put request from S1 for a key-value pair (k,v) to the r closest successors of k, each of which stores a replica of (k,v).4 In the figure, R cannot communicate with R', and hence the wrong set of nodes store replicas.
To handle this case, as well as normal failures, the nodes in each successor list periodically synchronize with each other to discover values they should be storing (see [5,16] for details). As shown in the figure, R' synchronizes with C-E and learns about the value put by S1. A subsequent get request from S2 which is routed to R' will thus find the value despite the non-transitivity.
Of course, if R' fails to synchronize with C-E between the put from S1 and the get from S2, it will mistakenly send an empty response for the get. To avoid this case, for each get request on key k, DHash and OpenDHT query multiple successors of k. For example, in the figure, R' would send the get request to C-E, and all four nodes would respond to S2, which would then compile a combined response. This extra step increases correctness at the cost of increased latency and load; OpenDHT uses heuristics to decide when this extra step can be eliminated safely [17].

5  Conclusion

In this paper, we enumerated several ways in which naive DHT implementations break down under non-transitivity, and we presented our experiences in dealing with the problems when building and deploying three independent DHT-based systems-OpenDHT [19] that uses Bamboo [18], i3 [22] that uses Chord [23], and Coral [8] that uses Kademlia [11]. While we believe that the ultimate long-term answer to dealing with issues arising from non-transitivity is perhaps a fresh DHT design, we hope that, at least in the short term, this work will save others the effort of finding and fixing the problems we encountered.

References

[1]
PlanetLab All-Pairs Pings. http://pdos.lcs.mit.edu/~strib/pl_app/.
[2]
Freepastry release notes. http://freepastry.rice.edu/FreePastry/README-1.4.1.html, May 2005.
[3]
D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris. Resilient Overlay Networks. In Proc. SOSP, 2001.
[4]
M. Castro, M. Costa, and A. Rowstron. Performance and dependability of structured peer-to-peer overlays. Technical Report MSR-TR-2003-94, Dec. 2003.
[5]
J. Cates. Robust and efficient data management for a distributed hash table. Master's thesis, Massachusetts Institute of Technology, May 2003.
[6]
F. Dabek, R. Cox, F. Kaahoek, and R. Morris. Vivaldi: A decentralized network coordinate system. In Proc. SIGCOMM, 2004.
[7]
F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris. Designing a DHT for low latency and high throughput. In Proc. NSDI, 2004.
[8]
M. J. Freedman, E. Freudenthal, and D. Mazières. Democratizing content publication with Coral. In Proc. NSDI, Mar. 2004.
[9]
S. Gerding and J. Stribling. Examining the tradeoffs of structured overlays in a dynamic non-transitive network, 2003. Class project: http://pdos.csail.mit.edu/~strib/docs/projects/networking_fall2003.pdf.
[10]
K. P. Gummadi, H. V. Madhyastha, S. D. Gribble, H. M. Levy, and D. Wetherall. Improving the reliability of internet paths with one-hop source routing. In Proc. OSDI, 2002.
[11]
P. Maymounkov and D. Mazieres. Kademlia: A peer-to-peer information system based on the XOR metric. In Proc. IPTPS, 2002.
[12]
A. Muthitacharoen, S. Gilbert, and R. Morris. Etna: A fault-tolerant algorithm for atomic mutable DHT data. Technical Report MIT-LCS-TR-993, MIT-LCS, June 2005.
[13]
D. Neel. Cogent, Level 3 in standoff over Internet access. TechWeb, Oct. 2005.
[14]
V. Paxson. Measurements and Analysis of End-to-End Internet Dynamics. PhD thesis, U.C. Berkeley, 1997.
[15]
P. Pietzuch, J. Ledlie, and M. Seltzer. Supporting network coordinates on PlanetLab. 2005.
[16]
S. Rhea. OpenDHT: A public DHT service. PhD thesis, U.C. Berkeley, Aug. 2005.
[17]
S. Rhea, B.-G. Chun, J. Kubiatowicz, and S. Shenker. Fixing the embarrassing slowness of OpenDHT on PlanetLab. In Proc. WORLDS, Dec. 2005.
[18]
S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling churn in a DHT. In USENIX Annual Tech. Conf., June 2004.
[19]
S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica, and H. Yu. OpenDHT: A public DHT service and its uses. In Proc. SIGCOMM, Aug. 2005.
[20]
R. Rodrigues and B. Liskov. Rosebud: A scalable byzantine-fault-tolerant storage architecture. Technical Report TR/932, MIT CSAIL, Dec. 2003.
[21]
A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proc. IFIP/ACM Middleware, Nov. 2001.
[22]
I. Stoica, D. Adkins, S. Zhuang, S. Shenker, and S. Surana. Internet Indirection Infrastructure. In Proc. SIGCOMM, Aug. 2002.
[23]
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proc. SIGCOMM, Aug. 2001.

Footnotes:

1For some applications, link-state routing may in fact be the right solution, but such systems are outside the scope of our consideration.
2We note, however, that neither of our Vivaldi implementations include the kinds of filtering used by Pietzuch, Ledlie, and Seltzer to produce more accurate coordinates on PlanetLab [15]; it is possible that their implementation would produce more accurate timeout values.
3A similar approach, where R uses the DHT's routing algorithm to route its response to S's identifier, has a similar cost but a lower likelihood of success in most cases, so we ignore it here.
4DHash actually stores erasure codes rather than replicas, but the distinction is not relevant to this discussion.