Home About USENIX Events Membership Publications Students
OSDI '04 Paper    [OSDI '04 Technical Program]

Improving the Reliability of Internet Paths with One-hop Source Routing

Krishna P. Gummadi, Harsha V. Madhyastha,
Steven D. Gribble, Henry M. Levy, and David Wetherall
Department of Computer Science & Engineering
University of Washington
{gummadi, harsha, gribble, levy, djw}@cs.washington.edu

Abstract

Recent work has focused on increasing availability in the face of Internet path failures.  To date, proposed solutions have relied on complex routing and path-monitoring schemes, trading scalability for availability among a relatively small set of hosts.

This paper proposes a simple, scalable approach to recover from Internet path failures.  Our contributions are threefold.  First, we conduct a broad measurement study of Internet path failures on a collection of 3,153 Internet destinations consisting of popular Web servers, broadband hosts, and randomly selected nodes.  We monitored these destinations from 67 PlanetLab vantage points over a period of seven days, and found availabilities ranging from 99.6% for servers to 94.4% for broadband hosts.  When failures do occur, many appear too close to the destination (e.g., last-hop and end-host failures) to be mitigated through alternative routing techniques of any kind. Second, we show that for the failures that can be addressed through routing, a simple, scalable technique, called one-hop source routing, can achieve close to the maximum benefit available with very low overhead. When a path failure occurs, our scheme attempts to recover from it by routing indirectly through a small set of randomly chosen intermediaries.

Third, we implemented and deployed a prototype one-hop source routing infrastructure on PlanetLab. Over a three day period, we repeatedly fetched documents from 982 popular Internet Web servers and used one-hop source routing to attempt to route around the failures we observed.  Our results show that our prototype successfully recovered from 56% of network failures. However, we also found a large number of server failures that cannot be addressed through alternative routing.

Our research demonstrates that one-hop source routing is easy to implement, adds negligible overhead, and achieves close to the maximum benefit available to indirect routing schemes, without the need for path monitoring, history, or a-priori knowledge of any kind.

1  Introduction

Internet reliability demands continue to escalate as the Internet evolves to support applications such as banking and telephony. Yet studies over the past decade have consistently shown that the reliability of Internet paths falls far short of the "five 9s" (99.999%) of availability expected in the public-switched telephone network [11]. Small-scale studies performed in 1994 and 2000 found the chance of encountering a major routing pathology along a path to be 1.5% to 3.3% [17,26].

Previous research has attempted to improve Internet reliability by various means, including server replication, multi-homing, or overlay networks. While effective, each of these techniques has limitations. For example, replication through clustering or content-delivery networks is expensive and commonly limited to high-end Web sites. Multi-homing (provisioning a site with multiple ISP links) protects against single-link failure, but it cannot avoid the long BGP fail-over times required to switch away from a bad path [12]. Overlay routing networks, such as RON, have been proposed to monitor path quality and select the best available path via the Internet or a series of RON nodes [2]. However, the required background monitoring is not scalable and therefore limits the approach to communication among a relatively small set of nodes.

This paper re-examines the potential of overlay routing techniques for improving end-to-end Internet path reliability. Our goal is to answer three questions:

  1. What do the failure characteristics of wide-area Internet paths imply about the potential reliability benefits of overlay routing techniques?
  2. Can this potential be realized with a simple, stateless, and scalable scheme?
  3. What benefits would end-users see in practice for a real application, such as Web browsing, when this scheme is used?

To answer the first question, we performed a large-scale measurement study that uses 67 PlanetLab vantage points to probe 3,153 Internet destinations for failures over seven days. Of these destinations, 378 were popular Web servers, 1,139 were broadband hosts, and 1,636 were randomly selected IP addresses. During the course of our 7-day study we observed more failures than the 31 node RON testbed saw in 9 months [7].

Our results show that end-to-end path availability varies substantially across different destination sets. On average, paths to popular Web servers had 99.6% availability, but paths to broadband hosts had only 94.4% availability. The vast majority of paths experienced at least one failure. Unfortunately, many failures are located so close to the destination that no alternative routing or overlay scheme can avoid them: 16% of failures on paths to servers and 60% of failures on paths to broadband hosts occur were last-hop or end-system failures. Effective remedies for these failures are increased end-system reliability and multi-homing.

To answer the second question, we use our measurement results to show that when an alternative path exists, that path can be exploited through extremely simple means. Inspired by the Detour study [21] and RON [2], we explore the use of a technique we call one-hop source routing. When a communication failure occurs, the source attempts to reach the destination indirectly through a small set of intermediary nodes. We show that a selection policy in which the source node chooses four potential intermediary nodes at random (called random-4) can obtain close to the maximum possible benefit. This policy gives well-connected clients and servers the ability to route around failures in the middle of the network without the need for complex schemes requiring a priori communication or path knowledge.

To answer the third question, we built and evaluated a prototype one-hop source routing implementation called SOSR (for Scalable One-hop Source Routing). SOSR uses the Linux netfilter/iptables facility to implement alternative packet routing for sources and NAT-style forwarding for intermediaries - both at user level. SOSR is straightforward to build and completely transparent to destinations. On a simple workload of Web-browsing requests to popular servers, SOSR with the random-4 policy recovered from 56% of network failures. However, many failures that we saw were application-level failures, such as server timeouts, which are not recoverable through any alternative routing or overlay schemes. Including such application-level failures, SOSR could recover from only 20% of the failures we encountered. The user-level perception of any alternative routing scheme is ultimately limited by the behavior of the servers as well as of the network.

The rest of the paper is organized as follows. We present our measurement study characterizing failures in the next section. Section 3 then shows the potential effectiveness of different practical policies for improving Internet path reliability. Section 4 describes the design, implementation, and evaluation of our prototype one-hop source routing system. We end with a discussion of related work (Section 5) and our conclusions (Section 6).

2  Characterizing Path Failures

This section describes our large-scale measurement study of Internet path failures. Our goals were: (1) to discover the frequency, location, and duration of path failures, and (2) to assess the potential benefits of one-hop source routing in recovering from those failures.

2.1  Trace methodology

From August 20, 2004 to August 27, 2004 we monitored the paths between a set of PlanetLab [18] vantage points and sets of destination hosts in the Internet. We periodically sent a probe on each path and listened for a response. If we received a response within a pre-defined time window, we declared the path to be normal. If not, we declared a loss incident on that path. Once a loss incident was declared, we began to probe the path more frequently to: (1) distinguish between a "true" path failure and a short-lived congestion episode and (2) measure the failure duration. We used traceroute to determine where along the path the failure occurred. In parallel, we also sent probes from the vantage point to the destination indirectly through a set of intermediary nodes. This allowed us to measure how often indirect paths succeeded when the default path failed.

2.1.1  Probes and traceroutes

Our probes consist of TCP ACK packets sent to a high-numbered port on the destination. Correspondingly, probe responses consist of TCP RST packets sent by the destination. We used TCP ACK packets instead of UDP or ICMP probes for two reasons. First, many routers and firewalls drop UDP or ICMP probes, or treat them with lower priority than TCP packets, which would interfere with our study. Second, we found that TCP ACK probes raise fewer security alarms than other probes. Before we included a candidate destination in our study, we validated that it would successfully respond to a burst of 10 TCP ACK probes. This ensured that destinations were not rate-limiting their responses, avoiding confusion between rate-limiting and true packet loss.

To determine where a failure occurred along a path, we used a customized version of traceroute to probe the path during a loss incident. Our version of traceroute uses TTL-limited TCP ACK packets, probing multiple hops along the route in parallel. This returns results much faster than the standard traceroute and permits us to determine the location of even short-lived failures.

2.1.2  Node selection

Initially, we selected 102 geographically distributed PlanetLab nodes as vantage points. Following the experiment, we examined the logs on each node to determine which had crashed or were rebooted during our trace. We then filtered the trace to remove any nodes with a total downtime of more than 24 hours, reducing the set to 67 stable vantage points for our analysis. We similarly selected 66 PlanetLab nodes to use as intermediaries, but only 39 of these survived crash/reboot post-filtering.

Using our vantage points, we monitored paths to three different sets of Internet hosts: popular Web servers, broadband hosts, and randomly selected IP addresses. A full list of IP addresses in each set and additional details describing our selection process are available at http://www.cs.washington.edu/homes/gummadi/sosr.

The members of each set were chosen as follows:

  • We culled our popular server set from a list of the 2,000 most popular Web sites according to www.ranking.com. Removing hosts that failed the TCP ACK rate-limit test and filtering duplicate IP addresses left us with 692 servers. The path behavior to a popular server is meant to be representative of the experience of a client when contacting a well-provisioned server.
  • We selected our broadband hosts from an IP address list discovered through a 2002 crawl of Gnutella [20]. From that set, we removed hosts whose reverse DNS lookup did not match a list of major broadband providers (e.g., adsl*bellsouth.net) and again filtered those that failed the rate-limit test. Finally, we selected 2,000 nodes at random from those that survived this filtering. The path behavior to a broadband host is meant to be representative of a peer-to-peer application or voice-over-IP (VoiP).
  • The random IP address set consists of 3,000 IP addresses that were randomly generated and that survived the rate-limit test. We use this set only as a basis for comparison.

We partitioned the destination sets across our initial vantage points such that each destination node was probed by only one vantage point. Because some of the vantage points were filtered from the trace due to failure or low availability, some of the destinations were consequently removed as well. Following this filtering, 378 servers, 1,139 broadband hosts, and 1,636 random IP addresses remained in the trace. Note that while we filtered vantage points and intermediaries for availability, we did not filter any of the destination sets beyond the initial TCP ACK rate-limit test. As a consequence, some destinations crashed or otherwise shut down during the trace, causing last-hop path failures.

2.1.3        Probe timing

Figure 1: Probe timing. (a) The sequence of probes that are sent along each path during the trace. (b) A loss incident begins with a single probe loss, and ends after 10 consecutive successful probe responses. (c) For each of the first 10 probe intervals of a loss incident, we issued indirect probes through each of 66 intermediaries.

During the 7-day trace period, we probed each path every 15 seconds. If the vantage point failed to receive a response within 3 seconds, we declared a loss to have occurred. A single loss transitioned the path into a loss incident - an event initiated by a single probe loss and ended by the reception of ten consecutive probe responses (Figure 1a). While a path is in the midst of a loss incident, we probed every 5 seconds (Figure 1b). We also issued a traceroute from the vantage point to the destination at the start of the loss incident.

For each of the first 10 probe intervals during a loss incident, we also attempted to probe the destination indirectly through each of the 66 PlanetLab intermediaries selected at the beginning of the experiment. Thus, during one of these probe intervals, the vantage point emits a probe to an intermediary every 5/66th of a second (Figure 1c). We allow six seconds for a response to flow back from the destination through the intermediary to the vantage point; if no response is received in this time we declare a loss through that intermediary.

2.1.4  Failures vs. loss incidents

In principle, it may seem simple to declare a path failure when some component of a path has malfunctioned and all packets sent on that path are lost. In practice, however, failures are more complex and difficult to define. For example, packet loss may be due to a true long-term failure or a short-term congestion event. In general, any operational definition of failure based on packet loss patterns is arbitrary.

We strove to define failure as a sequence of packet losses that would have a significant or noticeable application impact. We did not want to classify short sequences of packet drops as failures, since standard reliability mechanisms (such as TCP retransmission) can successfully hide these. Accordingly, we elevated a loss incident to a failure if and only if the loss incident began with three consecutive probe losses and the initial traceroute failed. We defined the failure to last from the send of the first failed probe until the send of the first of the ten successful probes that terminated the loss incident. For example, the loss incident shown in Figure 1b corresponds to a failure that lasted for 30 seconds.

2.2  Failure characteristics

Table 1 summarizes the high-level characteristics of the failures we observed, broken down by our three destination sets. In the table we show as classifiable those failures for which our modified traceroute was able to determine the location of the failure; the remainder we show as unclassifiable. The classifiable failures are further broken down into last-hop failures, which are failures of either the end-system or last-hop access link (we cannot distinguish the two), and non-last-hop failures, which occurred within the network.

For the popular servers, we saw 1,486 failures spread over 294 paths and 337 links along these paths. Of the 962 classifiable failures, 811 (84%) occurred within the network, while 16% were last-hop failures. On average, a path experienced 3.9 failures during the week-long trace, of which 0.4 were last-hop failures, 2.1 were non-last-hop failures, and 1.4 were unclassifiable.

Table 1: High-level characterization of path failures, observed from 08/20/04 to 08/27/04. We obtained path information using traceroutes. A failure is identified as a last-hop failure when it is attributable to either the access link connecting the destination to the network or the destination host itself.

For the broadband hosts we saw 7,560 failures of which 5,723 were classifiable. On average, a broadband path experienced 6.6 failures over the week, nearly double that of the popular server set. Of these 6.6 failures, 3.0 were last-hop, 2.0 non-last-hop, and 1.6 were unclassifiable. Comparing server and broadband paths, we saw approximately the same rate of non-last-hop failures, but broadband paths showed a much higher rate of last-hop failures (0.4 per path per week for servers, and 3.0 per path per week for broadband). Therefore, the network behaved similarly for broadband and popular server hosts, except over the last-hop.

2.2.1  Location of failures

To describe the failure locations in a meaningful way, we divide each path into four parts: last_hop, middle_core, src_side, and dst_side. Last_hop are either end-system failures or last-hop access-link failures. Middle_core failures occur in the "core of the Internet," which we define as the Tier1 domains. These are the few important domains, such as AT&T and Sprint, through which the vast majority of all Internet paths pass. We identify them using the methodology of Subramanian et al. [23]. Src_side and dst_side are therefore the remaining path segments between the core and source, or core and destination, respectively. If traceroute could not classify the failure location, we labeled it "unclassifiable."

Figure 2: Location of failures. Failures are spread throughout the Internet for all three destination sets. Last-hop failures dominate other failures for broadband hosts, but for popular servers, last-hop failures are rare.

Figure 2 shows the distribution of failures across these categories. Failures are spread throughout the Internet, and all three data sets observe approximately equal source-side and core failures. For popular servers, there are relatively few last-hop failures, and in fact the last-hop appears to be more reliable than the rest of the Internet! This is strong evidence that techniques such as one-hop source routing can improve end-to-end availability for server paths, as it targets these non-last-hop failures. For broadband hosts, however, last-hop failures dominate all other failures, and accordingly we should expect less of a benefit from one-hop source routing.

Not surprisingly, the random IP destination set behaves in a manner that is consistent with a blend of server-like and broadband-like hosts. Somewhat surprisingly, however, the random IP set sees a greater rate of destination-side failures than both servers and broadband hosts. We do not yet have an explanation for this.

2.2.2  Duration of failures

In Table 2, we show high-level statistics that characterize failure duration and path availability in our trace. The average path to a server is down for 2,561 seconds during our week long trace, which translates into an average availability of 99.6%. In comparison, the average path to a broadband host is down for 33,630 seconds during trace, leading to an average availability of 94.4%. Paths to broadband hosts are an order of magnitude less available than paths to server hosts. This is unsurprising, of course, since broadband hosts are less well maintained, tend to be powered off, and likely have worse quality last-hop network connections.

Table 2: Path downtime and failure durations. This table shows average and median path downtime, as well as average and median failure durations, for our three destination sets. The downtime for a path is the sum of all its failure durations.

The median path availability is significantly better than the average path availability, suggesting that the distribution of path availabilities is non-uniform. Figure 3 confirms this: for all three destination sets, more than half of the paths experienced less than 15 minutes of downtime over the week. As well as being generally less available than server paths, a larger fraction broadband paths suffer from high unavailability: more than 30% of broadband paths are down for more than an hour, and 13% are down for more than a day (not shown in graph).

Figure 3: Availability of paths (CDF). The cumulative distribution of total downtime experienced by the paths during our trace, for each of our three destination sets.

Table 2 also shows the average and median failure durations. On paths to servers, the average failure lasted for just under 11 minutes; in comparison, on paths to broadband hosts, the average failure lasted for 84 minutes. For both destination sets, last-hop failures lasted approximately an order of magnitude longer than non-last-hop failures. Unfortunately, this reduces the potential effectiveness of one-hop source routing. Last-hop failures can last for a long time, and they are also hard to route around. Like path availability, failure duration has a non-uniform distribution - median failure durations are significantly lower than average failure durations.

2.2.3  Frequency of failures

We can compute the number of failure-free paths by subtracting the number of failed paths from the number of paths probed (both shown in Table 1). This reveals that only 22% of server paths, 12% of broadband paths, and 15% of random paths were failure-free: most paths in each destination set experienced at least one failure.

Figure 4: Frequency of path failures. Paths are ordered by the number of failures experienced. Most paths experience few failures, but a small number of paths experience many failures. Note that both axes are in log-scale.

Figure 4 plots the number of failures each path experienced, using a log-scale on both axes. Each destination set is sorted in rank order from most-failing to least-failing path. This graph demonstrates two points: (1) a small number paths experience a very large number of failures, and (2) most paths experienced a small but non-zero number of failures. Additional analysis (not shown) also demonstrates that for broadband hosts, the failure-prone paths tend to fail on the last-hop, while for servers, the failure-prone paths tend to fail uniformly across the Internet, favoring neither the source-side, nor the core, nor the destination-side.

2.3  The potential of one-hop source routing

As previously described, during a loss incident we probed the destination indirectly through each of 39 intermediaries. If any of these indirect probes were successful, we considered the path to be recoverable using one-hop source routing. If not, we considered it to be unrecoverable. Note that this definition of recoverable provides an upper bound, since in practice an implementation is not likely to try such a large number of intermediaries when attempting to route around a failure.

Table 3: Potential effectiveness of one-hop source routing. Source routing can help recover from 66% of all failures on paths to servers, but fewer on paths to broadband hosts. Last-hop failures tend to confound recovery, while core failures are more recoverable.

The results of this experiment, shown in Table 3, indicate that 66% of all failures to servers are potentially recoverable through at least one intermediary. A smaller fraction (39%) of broadband failures are potentially recoverable. For all destination sets, one-hop source routing is very effective for failures in the Internet core, but it is less effective for source-side or destination-side failures. Somewhat surprisingly, some last-hop failures are recoverable. In part, this is due to multi-homing: i.e., there may be a last-hop failure on the default path to a destination, but a different last-hop link may be accessible on a different path through a destination. However, this is also due in part to our failure definition. If a last-hop link is not dead but merely "sputtering," sometimes probes along the default path will fail while an intermediary will be more "lucky" and succeed.

2.4  Summary

Our study examined failures of Internet paths from 67 vantage points to over 3,000 widely dispersed Internet destinations, including popular servers, broadband hosts, and randomly selected IP addresses. Overall, we found that most Internet paths worked well: most paths only experienced a handful of failures, and most paths experienced less than 15 minutes of downtime over our week-long trace. But failures do occur, and when they do, they were widely distributed across paths and portions of the network. However, broadband hosts tend to experience significantly more last-hop failures than servers, and last-hop failures tend to last long.

These failure characteristics have mixed implications for the potential effectiveness of one-hop source routing. Since server path failures are rarely on the last hop, there should be plenty of opportunity to route around them. Indeed, our initial results suggest that one-hop source routing should be able to recover from 66% of server path failures. In contrast, since broadband path failures are often on the last hop, there is less opportunity for alternative routing. Our results show than one-hop source routing will work less than 39% of the time in this case.

In the next section of the paper, we will examine one-hop source routing in greater detail, focusing initially on its potential for improving server path availability. Our goal in the next section is to use our trace to hone in on an effective, but practical, one-hop source routing policy. By effective, we mean that it successfully routes around recoverable failures. By practical, we mean that it succeeds quickly and with little overhead.

Figure 5: One-hop source routing. (a) The source (src) experiences a path failure to destination dst1, but it successfully routes through intermediary i5. (b) The source experiences a path failure to destination dst2. It uses a more aggressive recovery policy by simultaneously routing to intermediaries i2, i3, i4, and i5. The path to intermediary i2 experiences a path failure of its own, as does the path from intermediary i3 to the destination. Fortunately, the source is able reach dst2 through both i4 and i5.

3  One-Hop Source Routing

We have seen that 66% of all popular server path failures and 39% of all broadband host path failures are potentially recoverable through at least one of the 39 preselected intermediary nodes. This section investigates an obvious implication of this observation, namely, that one-hop source routing is a potentially viable technique for recovering from Internet path failures.

One-hop source routing is conceptually straightforward, as shown in Figure 5. After a node detects a path failure, it selects one or more intermediaries and attempts to reroute its packets through them. If the resulting indirect path is sufficiently disjoint from the default route, the packets will flow around the faulty components and successfully arrive at the destination. Assuming that the reverse path through the intermediary also avoids the fault, end-to-end communication is restored.

This approach raises several questions. Given a set of potential intermediaries for a failed path, how many of them on average will succeed at contacting the destination? What policy should the source node use to select among the set of potential intermediaries? To what extent does the effectiveness of one-hop source routing depend on the location of the failure along the path? Does a priori knowledge of Internet topology or the ability to maintain state about previous failures increase the effectiveness of a policy? When should recovery be initiated and when should recovery attempts be abandoned? The remainder of this section answers these questions.

3.1  Methodology

To answer these questions we rely on the data described in Section 2. As previously noted, following each failure we sent probe messages from the source to 39 PlanetLab intermediaries. The intermediaries then probed the destination and returned the results. If the source heard back from an intermediary before it heard back directly from the (recovered) destination, then we considered that intermediary to be successful. Thus, for each default-path failure, we were able to determine how many of the 39 PlanetLab intermediaries could have been used to route around it.

From this data we can analyze the effectiveness of policies that route through specific subsets of the intermediaries. For example, one policy might route through a single, randomly chosen intermediary; another policy might route through two preselected intermediaries in parallel, and so on. We can therefore compare various policies by simulating their effect using the data from our intermediate-node measurements.

3.2     What fraction of intermediaries help?

Figure 6: Disjoint paths. (a) The default path to the destination fails due to a faulty link. (b) Routing through intermediary i2 would succeed, since the diverted path is disjoint from the faulty link. (c) Routing through intermediary i1 would fail, since the diverted path rejoins the default path before the faulty link.

How many of the intermediaries succeed in routing around a particular failure depends on a number of factors, including the positions of the source, the destination, and the intermediaries in the network. For example, some intermediaries may not divert the packet flow sufficiently, either failing to pull packets from the default path before the fault or failing to return them to the default path after the fault. This can be seen in Figure 6a, where the route from src to dst fails due to the failure marked "X." An attempt to re-route through intermediary i2 would succeed (Figure 6b). However, routing through i1 would fail (Figure 6c), because i1's path to dst joins src's path to dst before the failure.

                   

Figure 7: The number of useful intermediaries. For each failure, we measured the number of useful intermediaries in our candidate set of 39. (a) This histogram shows the aggregated results for popular servers only. For example, we observed 168 failures (165 non-last-hop and 3 last-hop) for which there were exactly 21-25 useful intermediaries. (b) A CDF of the same data for the recoverable failures only.

As described above, for each detected failure we counted the number of "useful intermediaries" through which the source node could recover. Note that we continue attempting to recover until either an intermediary succeeds or the default path self-repairs, up to a maximum of 10 probe intervals. If the default path self-repairs before any intermediary succeeds, we do not classify the event as a recovered failure.

Figure 7(a) shows the results for popular servers, grouped by the number of useful intermediaries on the x-axis. Out of the 1486 failures, 503 (34%) could not recover through any intermediary, as shown by the left-most bar. Last-hop failures accounted for 89 of those unrecoverable failures.

Figure 7(b) presents a CDF of this data for the remaining 983 failures (66%) that were recoverable. The figure shows that 798 (81%) of these failures could be recovered through at least 21 of the 39 intermediaries. It's clear, then, that a significant fraction of failures are recoverable through a large number of intermediaries. However, there are also failures for which only a small number of intermediaries are useful. For example, 55 (5.6%) of the 983 recoverable failures could be recovered through only 1-5 nodes. Thus, some recoverable failures require a careful choice of intermediary.

None of the failures were recoverable through more than 36 of the 39 intermediaries. Investigating further, we found that four PlanetLab intermediaries were subject to a routing policy that prevented them from communicating with the vast majority of our destinations. If we exclude these nodes from consideration, many failures would be recoverable through all 35 of the remaining intermediaries. However, in many cases, there were still a few intermediaries that should be avoided.

3.3  Is random-k an effective policy?

The results in Figure 7 suggest that a very simple strategy for selecting intermediaries may work well. Similar in spirit to randomized load-balancing [15,6], a source should be able to avoid failures by randomly picking k intermediaries through which to attempt recovery. The source could send packets through all k intermediaries in parallel and then route through the intermediary whose response packet is first returned.

To evaluate this strategy, we examine a policy in which the random selection is done once for each failure instance. When a failure is detected, the source selects a set of k random intermediaries. During the failure, if none of the k intermediaries succeed on the first attempt, the source continues to retry those same intermediaries. At the next failure, the source selects a new set of k random intermediaries. We call this policy random-k.

Since many intermediaries can be used to avoid most recoverable faults, even a single random selection (random-1) should frequently succeed. By selecting more than one random intermediary, the source ensures that a single unlucky selection is not fatal. However, as there are some failures for which only a few specific intermediaries are helpful, picking a small number of random intermediaries will not always work.

Figure 8: The effectiveness of random-k. This graph shows the effectiveness of random-k at recovering from failures as a function of k (the number of randomly selected intermediaries the source tries concurrently).

Figure 8 shows the effectiveness of random-k as a function of k. For popular servers, random-1 can route around 43% of all failures we observed. By definition, random-39 can route around all recoverable failures (66% of all failures for popular servers). The "knee in the curve" is approximately k=4: random-4 can route around 61% of all failures (92% of all recoverable failures) for popular servers. From this, we conclude that random-4 makes a reasonable tradeoff between effort (the number of concurrent intermediaries invoked per recovery attempt) and the probability of success.

3.4  Which failures can random-k handle?

As we showed in Section 2, the location of a failure has a significant impact on the likelihood that one-hop source routing can recover. For example, last-hop failures are much harder to recover from than core failures. To understand the impact of failure location on random-k's ability to recover, we classified recovery attempts according to the failure location. Figure 9 shows the same data as Figure 8 broken down according to this classification, but for popular servers only.

Figure 9: Failures random-k can handle. This graph breaks Figure 8 data down according to the classes of failures from which random-k recovers, for servers. Random-k can recover from most failures in the core, and near the destination, but it is less capable at handle failures near the source or on the last hop.

Random-k recovers poorly from near-source and last-hop failures, as shown in the figure. For example, random-4 recovers from only 37% of last-hop failures and 50% of near-source. However, random-4 is very successful at coping with the other failure locations, recovering from 89% of middle_core and 72% of near-destination failures. Intuitively, the Internet core has significant path diversity, therefore a failure in the core is likely to leave alternative paths between many intermediaries and the source and destination. However, the closer the failure is to the source or the destination, the more intermediaries it will render ineffective.

3.5  Are there better policies than random-k?

Figure 8 shows that random-k is a very effective policy: there is little room to improve above random-k before "hitting the ceiling" of recoverable failures. But can we be smarter? This subsection explores two alternative policies that use additional information or state to further improve on random-k. These policies might not be practical to implement, as they require significant amounts of prior knowledge of Internet topology or state. Nonetheless, an analysis of these policies offers additional insight into one-hop source routing. Like random-k, each of the alternative policies selects k intermediaries through which to route concurrently for each recovery attempt. The two additional policies we consider are:

  1. History-k. In this policy, we assume that the source node remembers the intermediary that it most recently used to recover from a path failure for each destination. When the source experiences a path failure, it selects k-1 intermediaries at random, but it chooses this recently successful intermediary as the kth intermediary. If the source has never experienced a failure for the destination, this policy reverts to random-k. The rationale for this policy is that an intermediary that previously provided a sufficiently disjoint path to a destination is likely to do so again in the future.