F10: A {Fault-Tolerant} Engineered Network

Vincent Liu; Daniel Halperin; Arvind Krishnamurthy; Thomas Anderson

Authors:

Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson, University of Washington
Awarded Best Paper!

Abstract:

The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, routing algorithm and failure detector. We create an engineered network and routing protocol that directly address the failure characteristics observed in data centers. At the core of our proposal is a novel network topology that has many of the same desirable properties as FatTrees, but with much better fault recovery properties. We then create a series of failover protocols that beneﬁt from this topology and are designed to cascade and complement each other. The resulting system, F10, can almost instantaneously reestablish connectivity and load balance, even in the presence of multiple failures. Our results show that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes. A trace-driven evaluation of MapReduce performance shows that F10’s lower packet loss yields a median application-level 30% speedup.

Vincent Liu, University of Washington

Daniel Halperin, University of Washington

Arvind Krishnamurthy, University of Washington

Thomas Anderson, University of Washington

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {180325,
author = {Vincent Liu and Daniel Halperin and Arvind Krishnamurthy and Thomas Anderson},
title = {F10: A {Fault-Tolerant} Engineered Network},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {399--412},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/liu_vincent},
publisher = {USENIX Association},
month = apr
}

Download

Liu PDF

View the slides

Presentation Video

Presentation Audio

Download Audio

Public Summary:

by George Porter

Widely reported has been the rise of "Big Data'' processing, from large-scale MapReduce jobs generating purchasing recommendations, to algorithms performed over social networking graphs with hundreds of millions of connections. The rich interconnection structure of these computations has placed an increased focus on networks able to deliver large amounts of bisection bandwidth. Delivering this bandwidth despite increasingly fast link speeds has led to the development of new scale-out network topologies making use of multiple, parallel network paths from source to destination, most notably FatTree networks based on folded-Clos topologies. FatTrees rely on multipath topologies not just for increasing bisection bandwidth, but also for routing around link and switch failures. This latter property is particularly important for the largest networks, which experience frequent failures.

F10 addresses a particular problem in existing scale-out networks, namely slow recovery time during failure. The authors show that while FatTree networks do support rerouting around failures, the number of nodes in the network with paths around any particular failure is surprisingly small, which necessitates longer and more time consuming rerouting, resulting in higher network latency and increased congestion. This is particularly troubling for low-latency networked applications whose performance is gated on the 95th or 99th percentile of latency.

Instead, the authors propose F10, which is actually a trio of contributions: a new asymmetric network topology called an AB-FatTree, a new routing protocol, and a low-latency failure detector. F10 demonstrates the benefits that co-designing each of these aspects of the network can bring. Their proposed topology enhances the ability of rerouting around failures without adding additional switches or links, which adds failure resilience without adding additional cost or network complexity. F10's routing protocol delivers three types of failure recovery: a very quick local rerouting (in less time than a single TCP flow timeout is triggered), a slower method of recovery via failure informers, and finally a method based on a centralized load balancer. The program committee found F10 to be imminently deployable and practical, and an excellent example of the benefits that co-designing an entire engineered network can bring. While the particulars of F10's recovery are based on careful tuning of a set of system parameters whose effects will only be determined over time as it becomes deployed, we are impressed with the benefits shown both in simulation and emulation, and expect F10 to have impact in the design of new data center networks.

connect with us