USENIX '04 Paper
[USENIX '04 Technical Program]
A Transport Layer Approach for Improving End-to-End Performance and Robustness Using Redundant Paths
Ming Zhang, Junwen Lai
8,27]. In this paper, we propose an end-to-end transport layer protocol, mTCP, which can aggregate the available bandwidth of those redundant paths in parallel. By striping one flow's packets across multiple paths, mTCP can not only obtain higher end-to-end throughput but also be more robust under path failures. When some paths fail, mTCP can continue sending packets on other paths, and the recovery process normally takes only a few seconds. Because mTCP could obtain an unfair share of bandwidth under shared congestion, we integrate a shared congestion detection mechanism into our system. It allows us to dynamically detect and suppress paths with shared congestion so as to alleviate the aggressiveness problem. mTCP can also passively monitor the performance of several paths in parallel and discover better paths than the path provided by the underlying routing infrastructure. We also propose a heuristic to find disjoint paths between pairs of nodes using traceroute. We have implemented our system on top of overlay networks and evaluated it in both PlanetLab and Emulab.
Recent work on Internet measurement and overlay networks has shown that redundant paths are common between pairs of hosts . One can often achieve better end-to-end performance by adaptively choosing an alternate path other than the direct Internet path . At the same time, stub networks are increasingly resorting to multihoming to improve the reliability of their network connectivity . The reliability is usually achieved by having sufficiently disjoint paths to the destinations of interest. Moreover, with the rapid growth of wireless coverage, mobile users can often have access to multiple communication channels simultaneously [17,21]. All of the above means redundant paths are quite common between pairs of hosts.
Our goal is to design an end-to-end transport layer protocol (mTCP) which can not only aggregate the bandwidth of several paths concurrently but also enhance the robustness under path failures by taking advantage of redundant paths. Compared with the conventional single-path TCP flows, mTCP stripes a flow's packets across several paths. It can be viewed as a group of single-path subflows, with each subflow going through a separate path. A number of challenges arise when one attempts to develop a transport layer protocol that utilizes multiple paths simultaneously.
First, the traditional congestion control mechanism in TCP needs to be modified in order to fully exploit the aggregate bandwidth of the multiple paths used by mTCP. If one were to use the traditional congestion control mechanism on the whole flow (wherein the congestion window for the entire flow is halved for every packet loss), the mTCP connection would remain under-utilized in the following scenario. Assume that one of the sub-flows is using a heavily congested path. Severe packet losses on that path will result in repeated shrinkage of the window, resulting in sub-optimal use of the other sub-flows that remain uncongested. mTCP therefore needs to perform congestion control on each subflow independently so as to minimize the negative influence of congested subflows.
Second, paths may fail during data transmission. mTCP should not stall as long as there exists one operational path. It should be able to quickly detect failed paths and continue sending or retransmitting packets on live paths.
Third, when subflows of a mTCP flow share congested links, the whole flow can obtain an unfairly larger share of bandwidth than other single-path TCP flows since each subflow behaves as a single-path TCP flow. To alleviate this aggressiveness problem, we integrate a shared congestion detection mechanism into our system so as to identify and suppress subflows that traverse the same set of congested links.
Finally, there might exist many alternate paths between a pair of source and destination nodes. We want to select a small number of candidate paths for mTCP flows since it is impractical to use all the paths simultaneously. We use a heuristic to identify and select disjoint paths using traceroute. This can minimize the possibility of shared congestion and concurrent path failures.
We do note that mTCP provides performance benefits only when there are multiple paths that do not share congested physical links. Although in today's Internet, many congested or bottleneck links lie at the edge of the network and limit the performance benefits of mTCP, this is likely to change with the growing popularity of high speed Internet access. In , Akella, Seshan and Shaikh measured a diverse set of paths traversing Tier-1, Tier-2, Tier-3, and Tier-4 ISPs. They discovered that about 50% of the paths have bottleneck links that are located within ISPs or between neighboring ISPs. The available capacity of those bottleneck links are less than 50Mbps, well below the 100Mbps Ethernet speed that could be attained within local area networks. In such situations, mTCP is likely to find multiple paths that are not constrained by the bandwidths attainable at the edge of the network. Other scenarios where mTCP could find an useful set of disjoint paths to transmit data include multi-homed clients .
We also note that our focus is on improving the performance and robustness for large data transfers. Although most flows on the Internet are small, most of the traffic on the Internet is contributed by a small percentage of big flows [34,13]. Therefore, improving the performance of such big flows is important. Additionally, while small flows might not necessarily benefit from throughput aggregation, they might still benefit from mTCP's robustness that would enable them to quickly detect and recover from path failures.
To the best of our knowledge, we are the first to implement and evaluate a transport layer protocol that can utilize redundant paths concurrently in real systems. We try to provide a comprehensive design that addresses the inter-related issues of sub-flow congestion control, unfair use of congested links, path selection, and recovery from path failures. We believe that it is beneficial to tackle all of these issues in a single tightly-coupled system. For instance, suboptimal decisions from the path selection mechanism could be corrected by a mechanism that detects the use of shared congested links. Alternately, shared congestions could be detected easily by monitoring TCP events (such as fast retransmits) without requiring separate probe messages. Furthermore, the system could quickly recover from failures by maintaining and transmitting along multiple paths. Finally, a mTCP flow can passively monitor the performance of several paths in parallel and estimate their available bandwidths. The bandwidth estimates are typically more accurate than the estimates provided by the underlying overlay routing mechanisms. This in turn can help select better paths.
The rest of the paper is organized as follows: Section 2 will describe related work. Section 3 will discuss the specific design problems of mTCP in detail. Section 4 briefly describes the implementation of our system. Section 5 demonstrates the results from experiments conducted on PlanetLab  and Emulab . Finally, Section 6 concludes.
The general idea of using multiple paths in a network to obtain better performance has been explored in a number of different research efforts. We briefly discuss how our work relates to previous research in this area.
One area of related work is the use of striping  or inverse-multiplexing in link-layer protocols to enhance the throughput by aggregating the bandwidth of different links. Adiseshu et al , Duncanson et al  and Snoeren  provide link-striping algorithms that address the issues of load-balancing over multiple paths and preserving in-order delivery of packets to the receiver. These efforts propose transparent use of link-level striping without requiring any changes to the upper layers of the protocol stack.
Another area of related work is the use of multiple paths by transport protocols to enhance reliability [10,24,20,9]. Banerjea  proposed the use of redundant paths in his dispersity routing scheme to improve reliable packet delivery for real-time applications. Nguyen and Zakhor  also propose the use of multiple paths to reduce packet losses for delay-sensitive applications. They employ UDP streams to route data whose redundancy is enhanced through forward error correction techniques.
The most directly relevant related work is the use of multiple paths for improving the throughput or robustness of end-to-end connections. Several application-layer approaches have been proposed to improve throughput by opening multiple TCP sockets concurrently [7,14,19,28], but the multiple TCP connections utilize the same physical path. These approaches obtain an unfair share of the throughput of congested links and seem to primarily benefit from increased window sizes over long-latency connections. SCTP  is a reliable transport protocol which supports multiple streams across different paths. However, it does not provide strict ordering across all the streams, and it cannot utilize the aggregate bandwidth on multiple paths as we do. The systems that are closest to what is described in this paper is R-MTP  and pTCP . R-MTP provides bandwidth aggregation by striping packets across multiple paths based on bandwidth estimation. It estimates the available bandwidth by periodically probing the paths. As a result, its performance greatly relies on the accuracy of the estimation and the probing rate. It could suffer from bandwidth fluctuation as shown in . pTCP uses multiple paths to transmit TCP streams and describes mechanisms for striping packets across the different paths. They however assume the existence of a separate mechanism that identifies what paths to use for their pTCP connections, and they also do not address the issues of recovering from path failures or obtaining an unfair share of the throughput of congested links if the paths are not disjoint. Their study is also limited to simulations using ns.
The design of our system seeks to satisfy three goals. First, given several paths, mTCP should be able to make full use of the available bandwidth on those paths. Second, when mTCP uses paths with shared congested links, it should be able to alleviate the aggressiveness problem by suppressing some of the paths. Third, when some paths fail, mTCP should quickly detect and recover from the failures.
mTCP provides the same semantics to applications as TCP. It preserves properties such as reliability and congestion control. Because mTCP uses several paths in parallel, it has to decide how to stripe packets across the paths and how to manage congestion control for each subflow.
In mTCP, all subflows share the same send/receive buffer. Packets are assigned sequence numbers in the same way as in TCP. But it does congestion control independently on each subflow. Each subflow maintains a congestion window as in TCP. The congestion window changes independently as the subflow adapts to the network state. When there are no packet losses in the subflow, it linearly increases. Upon detecting packet losses, it is halved. When timeout occurs, it is reset to one and the subflow enters slow-start.
mTCP strives to keep all subflows independent from each other. Suppose we had used only one global congestion window for the entire flow that limits the total amount of outstanding data across all subflows. The packet losses on any one of the paths will cause the global congestion window to be halved. If one subflow happens to traverse a heavily congested path, it can keep the global congestion window small, and the other subflows will not be able to utilize the available bandwidth on other good paths. In fact, this can sometimes decrease the throughput of the whole flow to be even lower than that of a single-path TCP flow on a single good path. This phenomenon was also studied in .
TCP uses (sndnxt - snduna) to estimate the number of outstanding packets in the network. (For convenience, we assume packets are of the same size and use packets instead of bytes for discussion.) Here sndnxt is the next packet to be sent and snduna is the next packet for which an ACK is expected. The difference should be no more than the congestion window ( cwnd). In mTCP, since packets are striped across different paths, we need to keep track of how many outstanding packets are in each path to ensure that the number does not exceed the cwnd of that path.
Our mTCP is based on TCP SACK , which is an extension of TCP Reno. In Reno, the receiver only reports the greatest packet number that arrives in-order. But in mTCP, different paths have different latencies. Many packets can arrive at the receiver out-of-order. We want to accurately know which packets have been received, irrespective of whether or not they arrived in-order. This helps us to compute the number of outstanding packets on each path, which is essential for performing congestion control separately on each subflow. In SACK, the sender maintains a scoreboard data structure to keep track of which packets have or have not been received. An acknowledgement (ACK) packet may carry several SACK blocks, where each SACK block reports a non-contiguous set of packets that has been received. The first SACK block reports the most recently received packet and additional SACK blocks repeat the most recently reported SACK blocks. The SACK blocks allows the sender to identify what packets have been newly received irrespective of whether or not the data packets arrive in-order.
We augment the scoreboard data structure so that it records the path over which each packet is transmitted or retransmitted. For each , we maintain a to represent the number of outstanding packets on . is incremented by 1 when the sender either sends or retransmits a packet over . It is decremented when an incoming ACK indicates that a packet previously sent on has been received. New packets are allowed to be sent over only when . Retransmitted packets require special handling. Suppose the original packet is sent over and the retransmitted packet is sent over . When the retransmitted packet is ACKed, the sender decrements both and by 1, because it represents two packets having left the network: the original one on , which is assumed to be lost, and the retransmitted one on , which has been received. We want to emphasize that the original and retransmitted packets do not have to be sent over the same path. We will discuss this in more detail in Section 3.1.4. Finally, if times-out, will be reset to 0.
Since mTCP sends packets along several paths with different latencies, packets can arrive at the receiver out-of-order. This can cause duplicate acknowledgement packets ( dupack), which will trigger fast retransmits. These fast retransmits are caused by packet reorderings and not by packet losses, therefore we want to avoid them.Although packets sent through different paths can be received out-of-order, packets within each subflow will still mostly arrive in-order. Each therefore maintains the following path-specific state: , the number of dupack along that path and , the next packet requiring an ACK. If an incoming ACK indicates the receipt of a packet sent through and if that packet is , this packet is considered to be in-order within that subflow. If that packet is greater than , is incremented by 1. When reachs , will enter fast retransmit and fast recovery.
mTCP separates the decisions of when to send a packet, which packet to send, and which path to use to send the packet. The sender is allowed to send a new packet when there exists at least one satisfying . The packet to send is usually determined by , which represents the next packet to send as in TCP. But if there is a with packets to retransmit, i.e. is in fast recovery, the sender has to retransmit those packets inferred to be lost before sending any new data packets. Once again the scoreboard is consulted to determine whether there are any such packets that need to be retransmitted. Otherwise, a new data packet referenced by will be sent.
Next, the sender needs to decide the path over which the packet will be sent. There may be several candidate paths. We associate a with each . We choose the path with the minimum score. This form of proportional scheduling results in a fair striping of packets and avoids sending a burst of packets on one path.
Because mTCP separates the decisions about when to send, which to send and which path to use for sending, it has more flexibility in striping packets. By postponing the decision about which path to use until just before sending out the packet, it can quickly adapt to dynamic variations in path characteristics. If a path encounters congestion or fails, its will be reduced. The mTCP flow does not have to wait for the re-opening of the on that path to retransmit the outstanding packets. It can retransmit those outstanding packets on other paths. We want to emphasize that, unlike the re-striping scheme used in pTCP , our scheme will not retransmit packets that have already been received, because we can precisely infer missing packets from the scoreboard data structure. In pTCP, such re-striping overhead becomes more significant when fast retransmit occurs more frequently.
In our design, despite the fact that data packets are striped over several paths, all ACKs return over the same path. There are two advantages of using one path for ACKs. First, it preserves the ACK ordering for all the subflows. If ACKs return from different paths, this may introduce ACK reorderings even within one subflow, which will increase the burstiness of the sender. Second, striping ACKs across different paths makes our system more complicated. Receiver has to maintain additional states about which ACKs going through which paths. We try to keep the receiver side as simple as possible, following the design principle of TCP.
The advantage of using multiple reverse path is that it is more robust to congestion or failure. Note that ACKs normally are small and are less likely to cause congestion as data packets do. In addition to this, we try to avoid congestion on the reverse path by selecting the best path among all the candidate paths with the help of underlying overlay router. This will be discussed in Section 3.3. To recover from path failures using single reverse path is a little more complicated. But if we stripe ACKs across multiple paths, we have to constantly suffer from ACK ordering problem although it is more robust to failures. Since failures generally will not occur very often, by using single reverse path, we sacrifice a little performance during failure recovery but obtain better performance in most time during data transmission. In Section 3.5.2 of our paper, we show that we can still quickly recover from path failures even if we use single reverse path.
We could have avoided much complexity by opening separate TCP sockets for each path and then striping packets at the application layer . But this approach has some limitations which can prevent us from efficiently aggregating the bandwidth over multiple paths, To fully utilize the bandwidth on multiple paths, we must ensure the amount of data striped over a paths is proportional to the bandwidth of that path. This can be very difficult to achieve in current Internet where the conditions of paths are quite dynamic. Modifying TCP directly gives us more flexibility on striping data streams across multiple paths. We can decide, for each packet, an appropriate path the packet should traverse and this decision is made just before the packet is sent out. This is especially useful for migrating the outstanding packets on bad paths to good paths when the quality of paths changes dynamically or during path failures. Striping at the application layer cannot adapt to changes in path quality responsively. The pTCP study  has shown in simulations that such scheme cannot fully utilize multiple paths when the number of paths exceeds two.
When mTCP uses paths that are not completely disjoint and if some of the shared physical links are congested, the whole mTCP flow will obtain more bandwidth than other single-path TCP flows along those congested links, since each of the subflows behaves as a TCP flow. mTCP tries to alleviate the aggressiveness problem by detecting shared congestion among its subflows and suppressing some of them. Previous work [26,15,18,33] on shared congestion detection is based on the observation that if two single-path flows share congestion, packets from two flows traversing a congested link at about the same time are likely to be either dropped or delayed. Rubenstein et al.  actively inject probing packets through the two paths to compute the correlation of packet losses or packet delays and thereby identify shared congestions.
Certainly, we can directly use one of the above approaches in our system since shared congestion detection is quite independent from other parts of the system. We however take a simpler approach based on the following observations. mTCP transmits a steady stream of packets through different paths. In this setting, there is no need to send probing packets. Instead, one can passively monitor the subflows by studying the behavior of the data packets.Furthermore, since individual packet drops will result in fast retransmits along the corresponding subflows, the sender can detect shared congestions by examining the correlations between the fast retransmit times of the subflows. Since data packets also double as probe packets and since there are a large number of data packets transmitted through a subflow, our passive monitoring strategy requires little overhead and generates a continuous stream of information resulting in fast detection of shared congestion.
Let us focus on detecting shared congestion between a pair of subflows. For more than two subflows, we need to detect shared congestion between every pair of them. For abbreviation, if two subflows or paths share congestion, we say that they are correlated, otherwise, they are independent. We first assume that two paths have the same latency so that we do not have to worry about the time synchronization problem between them. Later, we will extend our algorithm so that it can deal with paths with different latencies.
Each time that a subflow enters fast retransmit, the sender records a timestamp in the subflow's list of fast retransmit events. After some time, we have two lists of timestamps, and , from two flows: and . Each timestamp represents a fast retransmit event. Then we try to match a timestamp in with in . If , we call a match. Intuitively, a match means the two subflows enter fast retransmit around the same time. This also means packets from the two flows are dropped at about the same time, so it is likely they share the same congested link. We define to be the maximum number of pairs , such that matches . Please note that each cannot be matched with multiple . Finally, two subflows are considered to be correlated if:
is intended to identify what fraction of fast retransmits occur at about the same time in the two subflows. Since some of the fast retransmits are due to congestion on disjoint links, reflects the level of shared congestion. We consider two subflows to be correlated when is greater than some threshold .
Our method uses fast retransmits instead of individual packet losses to infer shared congestion. This is because when a data flow encounters congestion, there normally will be a burst of packet losses. All these losses are caused by one congestion period at some link. Therefore, the congestion period corresponds more directly to a fast retransmit other than any individual packet loss. We would like to declare to be a match only when packets from two subflows are dropped at one link during the same congestion period. So cannot be too small, otherwise even if and occur in the same congestion period, the system will not detect the match. On the other hand, cannot be too large, otherwise the system would consider to be a match even when they are not due to shared congestion. Although the shared congestion detection may not work well under active queue management schemes, most routers on today's Internet use drop-tail queues, which lead to periods of bursty losses during congestion. In , the authors find that 95% of the duration of bursty losses are less than 220ms. So should be on that time scale. We will study how to choose and in more detail in Section 5.4.
We need to emphasize that our goal is to suppress correlated subflows in order to alleviate the aggressiveness problem. We need to detect shared congestion as quickly as possible. Other efforts focus more on the accuracy of shared congestion detection, and they may take several hundred seconds to reach a decision. This does not work well for our purpose, because a mTCP flow could have ended before shared congestion is detected.
Our algorithm works as follows. After some number of fast retransmit events have been observed, we will check for shared congestion between the two subflows. If there is shared congestion, we can suppress one of them. Otherwise, we will wait until the occurrence of the next fast retransmit to check for shared congestion again. The question we now address is determining the number of fast retransmit events that we need to observe before we start checking for shared congestion.
We use a heuristic to estimate the probability of two fast retransmit events from two independent flows accidentally occurring within a small period of time. Suppose the fast retransmit events of two subflows, S and T, are completely independent when two subflows are independent, we compute the average interval of two consecutive fast retransmit events in S: , where is the current time when shared congestion detection is invoked. is computed in a similar way. Then we define . Suppose , we have , and . For each , if there exists a match , must be in the . Because we assume and are independent events, the probability that matches some is roughly . So the total expected number of matches is roughly . Because , we will misinterpret S and T to share congestion if . According to Chernoff bound :
where . So we need to wait for fast retransmit events to ensure that the probability of a false positive is less than . We will see in Section 5.4 the convergence time is mostly within 15 seconds in our Emulab and PlanetLab experiments. We want to emphasize that even if false positive does occur, it will only degrade a mTCP flow into a single-path flow.
This heuristic might encounter problems when . Because is small (200ms in our experiments), this can only occur when a path is so heavily congested that fast retransmit happens almost every 400ms. The mTCP flows will try to suppress such paths, because using them will not bring much benefit. This is discussed in Section 3.4.1.
Finally, when two paths have different latencies, there is a time-lag, , between them. We estimate by shifting one sequence, say , by in time and calculating on sequences (, , ..., ) and (, , ..., ) as described before. Because the between two paths can be at most one ( is the larger round trip time of the two paths), we go through all possible value in incrementally using some fundamental step , then choose that maximizes as . This is similar to calculate the correlation between two signals.
In the previous sections, we assumed that flows have a number of candidate paths. Now we describe how they obtain such information. We use Resilient Overlay Networks (RON)  as our underlying routing layer. RON is an application-layer overlay.When mTCP starts, it queries RON to obtain multiple paths between a source-destination pair. For each pair, RON provides the direct Internet path and alternate single-hop indirect paths through other RON nodes. With a RON of nodes, there are totally paths between each pair. RON uses a score to represent the quality of each path based on latency, loss rate or throughput. RON can effectively bybass performance failure or path faults by using an alternate path with higher score. In the following, we only use the throughput score.
Since can be large (greater than 10 in our experiments), mTCP will only select at most (5 in our experiments) paths from them. A single-path flow will normally select the path with the best score, which we call the RON path. mTCP could select the best paths. But this simple strategy may select paths with many overlapping physical links. This leads to two disadvantages: First, paths are more likely to fail simultaneously, which is bad for the robustness. Second, paths are more likely to share congestion, which is bad for performance. To avoid these problems, we want to select sufficiently disjoint paths.
We use a heuristic based on traceroute to estimate the disjointness of paths. Using traceroute, we can obtain the IPs of the routers along a path and the latency of each physical link. Due to IP aliases, the same router might have different IPs in different paths. We use ``Ally'', a tool from Rocketfuel , to resolve IP aliases and assign a unique IP to each router. Although some routers may not respond to traceroute probes and the alias resolution may not be completely accurate, we only use the traceroute information as a hint to estimate path disjointness and eliminate many of the significantly overlapping paths. We also rely on the techniques described in Section 3.2 to further detect shared congestion.
After alias resolution, suppose we have the IPs of two paths and . Let be the set of overlapping links of and , we define the overlapping between and as: . An alternative is to use the size of to quantify the degree of overlap. We use latency instead because we hope to distinguish among different types of link. Most nodes on PlanetLab are connected through ethernet links to backbones. Those ethernet links usually have smaller latency than backbone links. Because the sharing of the local ethernet links are almost unavoidable, we focus on finding disjoint paths that traverse different backbone links. By using link latencies, will be mostly determined by the shared backbone links instead of ethernet links. This argument might not be true if nodes are connected through modem or wireless links that have high latency. Using traceroute to find disjoint paths is only suitable for small-scale overlay networks. As the number of nodes increases, we need a more scalable way to discover disjoint paths. In , Nakao, Peterson and Bavier propose to use BGP information to find disjoint Autonomous System (AS) paths, which incur little cost. Athough disjoint AS paths are not as fine-grained as disjoint router-level paths, it would greatly simplify disjoint path search by providing a small set of promising candidate paths which we can further verify using traceroute.
Finally, we estimate the disjointness of and by:
We say that and are disjoint if . Using the disjointness metric between each pair of paths, we select at most paths from paths using a greedy algorithm as follows: (1) Initialize the set of selected paths to be empty. (2) Pick the path with the highest score from the set of paths and check if it is disjoint from all the previously selected paths. (3) If so, select this path, otherwise pick the path with the next highest score and repeat step (2) until we find paths or we have tried all paths. The first selected forward path and the reverse path will always be the RON path, which is optimized for throughput in RON.
Figure 1 plots the cumulative distribution function (CDF) of the number of disjoint paths between 630 node pairs based on traceroute among 36 PlanetLab nodes that are used in our experiments. When decreases, the number of disjoint paths between node-pairs increases. We want a such that there are sufficient number of disjoint paths which we can choose from while eliminating most significantly overlapping paths. When (the value used in our experiments), 90% of node-pairs have more than 4 disjoint paths but less than 16 disjoint paths. If we use a larger , many node pairs will not have enough candidate disjoint paths.
In mTCP, a subflow on may be suppressed because of one of the three reasons: First, shares congestion with another subflow and its throughput is lower than . This is because we want mTCP not to be too aggressive to other single-path TCP flows. Second, suppose has the highest throughput among all the subflows and . This is because is too poor and using it does not bring much benefit. Third, fails.
We define a family of mTCP flows, called flows. An flow will try to use at least () paths, which means we will not suppress any path because of shared congestion when the number of paths being used is less than or equal to . For example, this avoids all paths getting suppressed in flows. The value of is a tradeoff between performance/robustness and friendliness. With a larger , mTCP can obtain more bandwidth because it uses more paths. And it is more reliable because the probability that paths fail simultaneously normally gets smaller as increases. But it can be more aggressive to single-path flows under shared congestion. The aggressiveness problem can be alleviated by suppressing some subflows. But when there are only subflows, no subflow would be suppressed. The actual value of should be decided by the application. Applications that want higher performance and more reliability should choose a larger . Applications that care more about friendliness should choose a smaller . In our experiments, we choose to demonstrate how much performance improvement mTCP can obtain without being too aggressive to other TCP flows.
An flow can dynamically add new paths because of two reasons: First, some paths that are not being used become better than those paths being used. Second, it is using less than paths because some paths are too bad or have failed. mTCP will periodically update the information about all the paths by querying RON. If an unused path has much higher a score than a path being used, it can start using the new path. Then it runs the path suppression algorithm on all the paths to suppress any possible paths with shared congestion. By doing this, mTCP can gradually replace bad paths with good ones. This is especially useful for long-lived flows.
mTCP may encounter path failures during transmission. If all the paths fail simultaneously, we call it a fatal path failure, otherwise we call it a partial path failure. We will focus on partial failures in this section. To recover from fatal failures, mTCP rely on the routing layer to establish new paths just like single-path flows.
When a path fails, the data packets sent over it will no longer be acknowledged (ACKed) because the packets have been dropped. We maintain one failure detection timer, , for each . When a data packet sent over is ACKed, is reset. is considered to have failed when expires.
We need to decide a timeout value for . On one hand, we want a small so that failures can be detected quickly. On the other hand, cannot be too small, otherwise it may misinterpret good path to have failed. The retransmission timeout () provides a good base for computing . First, during timeout, the sender will go into idling and no packets will be ACKed in that period. So should be at least greater than . Second, several consecutive timeouts means either the path has failed or it is heavily congested. In either case, we would like to abandon . So we choose . Here reflects how many consecutive retransmission timeouts mTCP is willing to tolerate before it consider a path to have failed. In our experiments, we choose , because we have observed that consecutive retransmission timeouts rarely occur on good paths. We should emphasize that even if a good path is misinterpreted as a failed one, it will only degrade the performance of mTCP to that of a single-path flow in the worst case. The path addition technique described in Section 3.4.2 allows us to reclaim a path if it had been previously misinterpreted to be a failed path.
We now describe how to recover from failure after expires. Since all ACKs return over the same path, we call that path a primary path. The other paths are auxiliary paths. We need to distinguish between primary and auxiliary path failures. When an auxiliary fails, the sender will mark as failed and retransmit the outstanding packets of over other paths. When a primary path fails, the situation is more complicated. Because all the ACKs are lost, it may appear to the sender that all paths have failed. To deal with this problem, sender records the time when is detected to have failed. Suppose at time , the primary is also detected to have failed and let the timeout of be . We know that must have failed at some point between and . For an auxiliary , if , its failure is possibly due to the failure of . In this case, we will change the status of to be active and the status of to be failed. After doing this for all the paths, the sender starts to send packets over all active paths. During this period, these data packets serve as ``probing'' packets that solicit ACKs from the receiver. All timers are stopped to prevent any auxiliary path from being misinterpreted as failed due to the lack of an active primary path during this period. The receiver will also detect the primary path failure because it no longer receives any data packets over that path. Then it elects a new primary path and sends ACKs along that path in response to those ``probing'' packets from sender. It chooses the best path (based on the path score in RON) among all the active paths to be the new primary path. Later, when the sender receives the ACKs and knows that a new primary path has been elected, it restarts all the timers and proceeds as normal.
Typically, is one second, therefore the is two seconds. The total detection and recovery time will be between two and three seconds in most cases. The interruption due to partial path failures will be fairly short. Furthermore, partial path failure does not cause mTCP to stall since packets will continue to be transmitted through living paths. Since mTCP uses several paths concurrently and since it typically employs disjoint paths, the probability of fatal path failures is much lower than that of single-path failure. mTCP is therefore more robust than single-path flows.
Our system is implemented at the user-level and is composed of a Portable User-Level TCP/IP stack (PULTI) and an overlay router/forwarder modified from RON. RON is an application-layer overlay on top of the Internet. PULTI and RON run in two separate processes and we change RON so that it can communicate with PULTI using UDP sockets and export the multiple paths between a source-destination pair. The whole system does not require any root privilege, which can easily be deployed on shared distributed platforms such as PlanetLab. Currently, it runs on Linux, NetBSD and FreeBSD. Note that although our system in build on top of overlay networks, it will work in a non-overlay setting as well. The only requirement of our system is there exists an underlying routing system that can provide multiple paths between a source-destination pair and let us select paths for sending packets. Overlay network is just one of such systems. Our system will also work in other Internet settings, such as multi-homed hosts.
PULTI is a full user-level TCP/IP stack based on FreeBSD 4.6.2. We extract the network-related code from the kernel source and wrap it with some basic kernel environment support, such as timing, timer, synchronization and memory allocation. We do not modify any network-related code. Because FreeBSD 4.6 does not support SACK, we also add SACK-related code in PULTI which is required by our system. OS dependent information is hidden by device drivers. With different device drivers, PULTI can send or receive through UDP socket, IP_QUEUE in Linux or divert socket in FreeBSD. PULTI provides standard socket interface and supports multiple applications through multithreading. It can query RON to learn about multiple paths between a source-destination pair. The mTCP code only affects a few files in PULTI. It can be easily moved into FreeBSD kernel.
In this section, we validate our protocol in both emulation and real-world deployment. The emulations are run on Emulab , which is a time- and space-shared network emulator. Emulab consists of several hundred PCs, which can be configured to emulate different network scenarios. Users can specify parameters such as packet loss rate, latency, and bandwidth. While an experiment is running, the experiment gets exclusive use of the assigned machines. While Emulab provides a controlled environment for our experiments, we further conduct experiments on PlanetLab, a wide-area distributed testbed for running large-scale network services . The experiments on the PlanetLab allow us to study our protocol for Internet settings, where latency, bandwidth and background traffic are more realistic and unpredictable.
In this experiment, we study whether mTCP can obtain the total available bandwidth over multiple independent paths. We use the topology in Figure 2 on Emulab. Because each PC in Emulab has four Ethernet cards, each node can have at most four links. There are six endhosts () and ten routers (). RON is running on the six endhosts to construct an overlay network. All routers have drop-tail queues. The source and destination nodes are and respectively. Each of the remaining endhosts provides an alternate path. For example, we can use to construct an alternate path . So the topology contains five independent paths, which include one direct path and four alternate paths. We use the direct path as the reverse path for ACKs. The capacity of all the paths is 16Mbps and their RTTs vary from 52-147ms. The figure annotates each link with its corresponding bandwidth and latency. The arrows represent background flows. We use Iperf  to generate 25 TCP and 25 1Mbps UDP flows as background traffic, with 5 TCP and 5 UDP flows on each path. Each experiment runs for 40 seconds and the results are obtained by averaging three runs.
Figure 3 shows the results when the number of paths used by mTCP increases from 1 to 5. In this figure, ``combined'' represents mTCP flows with congestion control performed on the entire flow, ``separate'' represents regular mTCP flows with congestion control performed separately on each subflow, and ``NxSingle-path flow'' is the throughput of a single-path flow on one path multiplied by the number of paths. Because each path has the same available bandwidth, ``NxSingle-path'' throughput represents the ideal throughput of a mTCP flow. The results verify that mTCP can effectively aggregate the available bandwidth on multiple independent paths. The results also show that higher throughput can be achieved only when congestion control is performed for each subflow separately.
We conduct similar experiments on PlanetLab. We use one node in Princeton and one node in Berkeley as source and destination nodes. As shown in Table 1, the four Intel nodes serve as intermediate nodes for the alternate paths. We only use the four alternate paths in this experiment, because they do not share any congestion links. To verify this, we examined the traceroute data to find that any pair of the alternate paths only share the initial and final hops, which are unavoidable. The capacity of these two links are 100Mbps, which is far greater than the total throughput of the single-path TCP flows on these four paths. Therefore, we conclude that the initial and final hops are not congested and the four alternate paths are independent.
Each experiment measures the throughput of flows lasting for 60 seconds. The average throughput of three runs is reported. For convenience, we use to denote the throughput of a single-path flow on . Similarly, denotes the throughput of a mTCP flow using and . In Figure 4, (i,j) on the x-axis means and are used in that experiment. We first run single-path flows on and respectively, then run a mTCP flow on both paths simultaneously. The corresponding column compares the percentage that the throughput of an individual flow, , or , contributes to the total throughput of these flows. Ideally, we expect , so the percentage of should be around 50%. With the exception of the experiment involving and , which suffered from unexpected bandwidth variations, the rest of the experiments indeed provide the expected throughputs. The last column in Figure 4 shows the result of the experiment using , , and . Again, the net throughput of is close to the sum of , and . We have conducted experiments between different source-destination pairs on PlanetLab. The results are similar. We omit them due to space constraints.
Now we will study whether mTCP can quickly recover from partial path failures using experiments on PlanetLab. Because path failures on the Internet are unpredictable, we intentionally introduce failures by killing the appropriate RON agent. The source and destination nodes are still the Princeton and Berkeley nodes. The paths are shown in Table 2.
The two graphs in Figure 5 show how the congestion window () of the primary and auxiliary paths changes over time. As shown in the first graph in Figure 5, the primary path fails at about 20s. It is quickly detected so that the of the subflow on this path is reduced to 0. At the same time, the of the subflow on the auxiliary path also decreases to 0, because the auxiliary path was misinterpreted to have failed (as explained in Section 3.5.2). But a few seconds later, the subflow on the auxiliary path recovers from this false decision by restoring its to the previous value with slow start. Finally the auxiliary path becomes the new primary path and the whole flow proceeds using only one path. The behavior of mTCP during auxiliary path failures is similar, and we omit the corresponding results.
The total recovery time of mTCP during partial path failures is only about 3s, which is negligible for most applications. In contrast, a TCP flow will completely stall when its path fails, and it typically takes about 18s for RON to establish a new path. RON is optimized for quickly recovering from path failures. On wide area network that uses BGP to detect failures, recovery could take several minutes. Hence, mTCP is more responsive and robust than single-path flows.
In this section, we will evaluate shared congestion detection. We first use experiments on Emulab to study the behavior of our algorithm with different parameters in a controlled environment. Then we further validate it using experiments on PlanetLab. The topologies for the Emulab experiments are shown in Figures 6 and 7. Between the source node and the destination node , there is one direct path and one alternate path through the intermediate node .
In Figure 6, The two paths only share the initial and final hops with link capacities of 100Mbps. We generate 12 TCP flows and 18 1Mbps UDP flows as background traffic, with 2 TCP flows and 3 UDP flows on each link between each pair of neighboring routers. With this scheme, we ensure that congestion only occurs on the links between pairs of routers and not on the links between endhosts and routers. As a result, the two paths and are independent.
In Figure 7, The two paths share the four links between and . We generate 8 TCP flows and 8 1Mbps UDP flows as background traffic, with 2 TCP and 2 UDP flows on each of the four shared links. By doing this, we ensure that congestion only occurs on the four shared links. As a result, paths and share congested links.
We run mTCP flows for 300s using the two paths in Figure 7. The results in Figure 8 compare the estimated of shared congestion with different values of 5ms, 10ms, 25ms, 50ms, 100ms, 200ms, and 400ms. Each data point represents the average of five runs. As increases from 5 to 100ms, increases quickly from 0.4 to 0.8 as expected. When increases beyond 100ms, only increases slightly. When is 400ms, reaches 0.96. The ideal is 1 because the two paths share all the congestion.
We next run mTCP flows for 300s on the two paths in Figure 6. The results are shown in Figure 9, which plots over time for different values of 5ms, 25ms, 100ms, 200ms and 400ms. As explained in Section 3.2, a smaller will lead to smaller estimated values of . At the end of the experiments, drops quickly from 0.39 to 0.28 when decreases from 400 to 200ms. When deceases further, drops more slowly until it reachs 0.14 when . The ideal is 0 because the 2 paths are independent. We also notice that the curve for a smaller value decreases faster than that for a larger .
According to the above experiment results, an value between 100 and 200ms seems to balance the goal of minimizing both false negatives and false positives. Consequently, the threshhold should fall between 0.3 and 0.8. If it is less than 0.3, it is very likely to cause false positives when the is 200ms. If it is greater than 0.8, it can easily cause false negatives when the is 100ms. By setting and , we successfully detect shared congestion between the two paths in all five runs for the topology in Figure 7. For the topology in Figure 6, no shared congestion is detected and the two paths are determined to be independent as expected.
Next, we go on to evaluate the shared congestion detection on PlanetLab. As explained in Section 3.2, by setting to be no less than the congestion period during which bursty losses occur, we can avoid false negatives. In , the authors find that 95% of the duration of bursty losses on the Internet are very short-lived (less than 220ms). By choosing an around that value, we should be able to avoid most false negatives. At the same time, the average time between consecutive fast retransmits is mostly on the order of several seconds or more, much greater than 220ms. (Otherwise, mTCP will suppress such path because the path is too lossy.) Therefore, this value will also allow us to avoid most false positives, as long as we wait for enough number of fast retransmits. In the following experiments, we report the results using and .
We first need to choose paths such that we can be reasonably sure as to whether they share congestion or not. Then, we can compare the measured results with the expected results. We conduct two sets of experiments. The mTCP flow is running on a pair of paths for 60 seconds in each experiment. As explained in Section 3.2, the probability of false positive decreases very fast as the number of fast retransmit increases. We find that a 60 second period is long enough for our algorithm to converge. Each experiment is repeated three times. We use the Princeton and Berkeley nodes as source and destination in all experiments, but we choose different pairs of paths in different sets of experiments.
In the first set of experiments, we use the four alternate paths in Table 1, where we know that all these paths are independent. The results are in Table 3. The first column shows the pairs of paths used by the mTCP flows. The remaining three columns show the results. A No means two paths are independent, a Yes means they share congestion, and N/A means one of the subflows is suppressed because its throughput is much lower than the other subflow before the end of the experiment. All the results in Table 1 conform to our expectation except the one false positive for using and . As explained before, a false positive will only degrade the performance of the mTCP flow to that of a single-path flow.
The second set of experiments use the paths in Table 4. From traceroute, we know the underlying physical links of any pair of these paths are mostly overlapping, so they should share congested links. The results are shown in Table 5. The first column gives the pairs of paths used in the experiments. The following three columns give the time in seconds when shared congestion is detected in each run. The last column gives the average detection time. Shared congestion is correctly detected in all cases.
Unlike other shared congestion detection algorithms, our algorithm seeks to minimize the detection time while maintaining a low false positive rate. In the second set of experiments, shared congestion is correctly detected mostly within 15 seconds. At the same time, such early decisions do not cause too many false positives in the first set of experiments.
In this section, we demonstrate mTCP can be more friendly to other single-path flows by suppressing its subflows that share congestion. We construct the topology of Figure 10 on Emulab. The source and destination nodes are and . There are one direct path and four alternate paths provided by the remaining four endhosts. Their RTTs are from 124ms to 133ms and they share the three links between and . We generate 12 1Mbps UDP flows as background traffic, with 4 UDP flows on each of the three shared links. By doing this, we ensure that all five paths share congestion. Each experiment runs for 300 seconds and the results are obtained by averaging three runs.
In Figure 11, the first five columns give the throughput of the flows when the number of paths being used increases from one to five. The first column is the throughput of single-path TCP flows. Under shared congestion, the mTCP flows become more aggressive as they use more paths. The sixth column shows the throughput of the mTCP flow with path suppression. Although it uses five paths in the beginning, it quickly detects shared congestion and suppresses all but one path. So its throughput is very close to that of a single-path flow and less aggressive than the flow using all five paths without suppression.
In this experiment, we demonstrate that mTCP can effectively aggregate the bandwidth of multiple paths with sufficiently differing characteristic, and path suppression can help avoid the penalty from using bad paths. We use the same topology as in Figure 2. The bandwidth of direct path is still 16Mbps. But the bandwidth of four alternate paths is 1/2, 1/4, 1/8 and 1/1000 of the bandwidth of the direct path. In Figure 12, on the x-axis means the direct path and alternate path with 1/n bandwidth are used in that experiment. We first run a single-path flow on each path respectively, then run a mTCP flow on both paths. The corresponding column compares the percentage that the throughput of an individual flow contributes to the total throughput of these flows. Ideally, the throughput percentage of mTCP flows should be 50%. Figure 12 shows mTCP can efficiently utilize the aggregate bandwidth of two paths even when one path has only 1/8 the bandwidth of the other path. The mTCP flows in the last 2 columns use the direct path and the alternate path with 1/1000 bandwidth. Such a scenario could occur when a path becomes heavily congested or even temporarily fails. Using such bad paths can bring no benefit but impair the performance of the whole flow. Because most packets are lost along that path, it persistently causes timeouts. While packets can still be sent over the other path for some time, the flow will finally stall when the send/receive buffer is exhausted. As explained in Section 3.4.1, mTCP will suppress the paths with too low a throughput to avoid such penalty. (We choose in our experiments.) This is confirmed by the last two columns which represent the throughput of mTCP flows with and without suppression.
We are going to compare three types of flows: single-path flows using direct Internet path (INET), single-path flows using RON path optimized for bandwidth (RON) and flows. flows will use multiple paths when there is no shared congestion. We use flows to demonstrate how much performance improvement mTCP can obtain without being too aggressive to other TCP flows. Table 6 shows the 10 nodes that serve as endhosts in an overlay network for this experiment. (We actually use a total of 24 nodes to form the overlay network, with the remaining 14 nodes only serving the role of packet forwarders.) For each source-destination pair, we transfer data for 40 seconds using each of the three types of flows. Each experiment is repeated three times and we report the average throughput.
The available bandwidth of the paths between the pairs of endhosts can be very high, because nine of them are connected to Internet2. We bypass those pairs with very high available bandwidth on the corresponding direct paths because: First, these paths are between pairs of nodes that exhibit shared congestion/bottleneck at the initial and/or final hops. Second, the bandwidth-delay products of the paths between such pairs of nodes are very large. The maximum send/receive buffer size of our user-level TCP implementation is 1MB and is not large enough to utilize the bandwidth on other alternate paths besides the direct path. We estimate the available bandwidth of a direct path between a source-destination pair by running a TCP flow for 10 seconds. If the measured throughput is less than Mbps, we will use that pair for our experiments. Among the 90 pairs, we got 15 pairs that satisfy the above condition. We want to emphasize that we are not trying to study the popularity of independent paths with distinct points of congestion between node-pairs on the Internet; such topic has been studied by others .Instead, we focus on demonstrating that mTCP can achieve better performance by taking advantage of such redundant paths.
Among the 15 pairs, flows achieve significantly higher throughput in 6 pairs, as shown in Figure 13. They achieve 33% to more than a factor of 60 better performance than single-path flows. We have to mention that flows only try to aggregate the available bandwidth on multiple paths when there is no shared congestion. Other () flows would obtain better performance, but they are potentially more aggressive.
The performance improvement of mTCP does not solely come from bandwidth aggregation on multiple paths, it is also because mTCP can help select better paths than those provided by the routing layer, such as the direct path or the RON path optimized for throughput. RON estimates the available bandwidth of a path using . Here is the packet loss rate and is the round trip time, both of which are obtained by active probing. Although it can help RON distinguish paths with significant performance difference and select better alternate path, this estimate may not be accurate; a path with high score may actually have low available bandwidth . In mTCP, the sender can monitor the performance of several paths in parallel. The throughput of each subflow provides a fairly good estimate of the available bandwidth on that path. This does not require any active probing because the data packets serve as probing packets. This can help mTCP discover and utilize better paths than the suboptimal RON path or direct path. We examined the paths in those 6 pairs and found that flows do take paths different from either the direct paths or the RON paths. The achieved throughput of on those paths are higher than that of direct path or RON path.
Figure 14 shows the results of the remaining 9 pairs. By examining the paths, we find that all flows degrade to single-path flows because of shared congestion, and the RON/INET/ flows all take the same single path for the whole transfer. Hence, the throughput of flows should be comparable to that of INET/RON flows, as shown in Figure 14. In three pairs, flows obtain slightly lower performance than RON/INET flows, this is because different types of flows are run sequentially and there is minor fluctuations in the available bandwidth of a path over time.
In this paper, we present mTCP, a transport layer protocol, for improving end-to-end throughput and robustness. mTCP can efficiently aggregate the available bandwidth on several paths in parallel. To address the aggressiveness of mTCP during shared congestion, we integrate a shared congestion detection mechanism into our system so that correlated subflows can be suppressed. mTCP flows are more robust to path failures than TCP flows, because they will not stall even when some paths fail. The failure detection time is within several seconds. We also propose a heuristic to find disjoint paths based on traceroute. We have implemented our system on top of overlay networks and evaluated it on PlanetLab and Emulab.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
The command line arguments were:
The translation was initiated by Ming Zhang on 2004-05-03
This paper was originally published in the
Proceedings of the 2004 USENIX Annual Technical Conference,
June 27-July 2, 2004, Boston, MA, USA
Last changed: 10 June 2004 aw