The {TCP} Outcast Problem: Exposing Unfairness in Data Center Networks

Pawan Prakash; Advait Dixit; Y. Charlie Hu; Ramana Kompella; Michio Honda; Fabien Duchene; Olivier Bonaventure; Mark Handley; Aditya Akella; David G. Andersen; John W. Byers; Srinivasan Seshan; Peter Steenkiste

NSDI '12 Technical Sessions

To access a presentation's content, please click on its title below.

All sessions will take place in Regency 2 unless otherwise noted.

Download the full NSDI Proceedings now.

Attendee Files

nsdi12_proceedings_full.pdf

Wednesday, April 25, 2012

8:45 a.m.–9:00 a.m.				Wednesday
Opening Remarks and Awards Presentations Program Co-Chairs: Steven Gribble, University of Washington; Dina Katabi, Massachusetts Institute of Technology Available Media Read more about Opening Remarks and Best Paper Awards
9:00 a.m.–10:30 a.m.				Wednesday
Big Data Session Chair: Nick Feamster, Georgia Tech CORFU: A Shared Log Design for Flash Clusters Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, and Ted Wobber, Microsoft Research Silicon Valley; Michael Wei, University of California, San Diego; John D. Davis, Microsoft Research Silicon Valley CORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services. CORFU can be viewed as a distributed SSD, providing advantages over conventional SSDs such as distributed wear-leveling, network locality, fault tolerance, incremental scalability and geodistribution. A single CORFU instance can support up to 200K appends/sec, while reads scale linearly with cluster size. Importantly, CORFU is designed to work directly over network-attached flash devices, slashing cost, power consumption and latency by eliminating storage servers. Available Media Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, University of California, Berkeley Awarded Best Paper! Awarded Community Award Honorable Mention! We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. Available Media Camdoop: Exploiting In-network Aggregation for Big Data Applications Paolo Costa, Microsoft Research Cambridge and Imperial College London; Austin Donnelly, Antony Rowstron, and Greg O'Shea, Microsoft Research Cambridge Large companies like Facebook, Google, and Microsoft as well as a number of small and medium enterprises daily process massive amounts of data in batch jobs and in real time applications. This generates high network traffic, which is hard to support using traditional, oversubscribed, network infrastructures. To address this issue, several alternative network topologies have been proposed, aiming to increase the bandwidth available in enterprise clusters. We observe that in many of the commonly used workloads, data is aggregated during the process and the output size is a fraction of the input size. This motivated us to explore a different point in the design space. Instead of increasing the bandwidth, we focus on decreasing the traffic by pushing aggregation from the edge into the network. We built Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers. Camdoop exploits the property that CamCube servers forward traffic, to perform in-network aggregation of data during the shuffle phase. Camdoop supports the same functions used in MapReduce and is compatible with existing MapReduce applications. We demonstrate that, in common cases, Camdoop significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ. Available Media
10:30 a.m.–11:00 a.m.				Wednesday
Break
11:00 a.m.–Noon				Wednesday
Wireless Session Chair: Brad Karp, University College London WiFi-NC : WiFi Over Narrow Channels Krishna Chintalapudi, Microsoft Research India; Bozidar Radunovic, Microsoft Research UK; Vlad Balan, USC; Michael Buettener, University of Washington; Srinivas Yerramalli, USC; Vishnu Navda and Ramachandran Ramjee, Microsoft Research India The quest for higher data rates in WiFi is leading to the development of standards that make use of wide channels (e.g., 40MHz in 802.11n and 80MHz in 802.11ac). In this paper, we argue against this trend of using wider channels, and instead advocate that radios should communicate over multiple narrow channels for efficient and fair spectrum utilization. We propose WiFi-NC, a novel PHY-MAC design that allows radios to use WiFi over multiple narrow channels simultaneously. To enable WiFi-NC, we have developed the compound radio, a single wideband radio that exposes the abstraction of multiple narrow channel radios, each with independent transmission, reception and carrier sensing capabilities. The architecture of WiFi-NC makes it especially suitable for use in white spaces where free spectrum may be fragmented. Thus, we also develop a frequency band selection algorithm for WiFi-NC making it suitable for use in white spaces. WiFi-NC has been implemented on an FPGA-based software defined radio platform. Through real experiments and simulations, we demonstrate that WiFi-NC provides better efficiency and fairness in both common WiFi as well as future white space scenarios. Available Media Catching Whales and Minnows Using WiFiNet: Deconstructing Non-WiFi Interference Using WiFi Hardware Shravan Rayanchu, Ashish Patro, and Suman Banerjee, University of Wisconsin Madison We present WiFiNet — a system to detect, localize, and quantify the interference impact of various non-WiFi interference sources on WiFi traffic using commodity WiFi hardware alone. While there are numerous specialized solutions today that can detect the presence of non-WiFi devices in the unlicensed spectrum, the unique aspects of WiFiNet are four-fold: First, WiFiNet quantifies the actual interference impact of each non-WiFi device on specific WLAN traffic in real-time, which can vary from being a whale — a device that currently causes a significant reduction in WiFi throughput — to being a minnow — a device that currently has minimal impact. WiFiNet continuously monitors changes in a device’s impact that depend on many spatio-temporal factors. Second, it can accurately discern an individual device’s impact in presence of multiple and simultaneously operating non-WiFi devices, even if the devices are of the exact same type. Third, it can pin-point the location of these non-WiFi in- terference sources in the physical space. Finally, and most importantly, WiFiNet meets all these objectives not by using sophisticated and high resolution spectrum sensors, but by using emerging off-the-shelf WiFi cards that provide coarse-grained energy samples per sub-carrier. Our deployment and evaluation of WiFiNet demonstrates its high accuracy — interference estimates are within ±10% of the ground truth and the median localization error is ≤ 4 meters. We believe a system such as WiFiNet can empower existing WiFi clients and APs to adapt against non-WiFi interference in ways that have not been possible before. Available Media
Noon–1:30 p.m.				Wednesday Regency 1
Symposium Luncheon
1:30 p.m.–3:00 p.m.				Wednesday
Content and Service-Oriented Networking Session Chair: Emin Gün Sirer, Cornell University RPT: Re-architecting Loss Protection for Content-Aware Networks Dongsu Han, Carnegie Mellon University; Ashok Anand and Aditya Akella, University of Wisconsin—Madison; Srinivasan Seshan, Carnegie Mellon University We revisit the design of redundancy-based loss protection schemes in light of recent advances in content-aware networking. Content-aware networks minimizes the overhead of redundancy, if the redundancy is introduced in a way that the network can understand. With this insight, we propose a new loss protection scheme called redundant packet transmission (RPT). Using redundant video streaming as an example, we show that our approach, unlike FEC in traditional networks, provides low latency with high robustness and is insensitive to parameter selection. We tackle practical issues such as minimizing the impact on other traffic and the network. We show that RPT provides a simple and general mechanism for application-specific control and flow prioritization. Available Media Serval: An End-Host Stack for Service-Centric Networking Erik Nordström, David Shue, Prem Gopalan, Rob Kiefer, and Matvey Arye, Princeton University; Steven Ko, University of Buffalo; Jennifer Rexford and Michael J. Freedman, Princeton University Awarded Community Award Honorable Mention! Internet services run on multiple servers in different locations, serving clients that are often mobile and multi-homed. This does not match well with today’s network stack, designed for communication between fixed hosts with topology-dependent addresses. As a result, on-line service providers resort to clumsy and management-intensive work-arounds—forfeiting the scalability of hierarchical addressing to support virtual server migration, directing all client traffic through dedicated load balancers, restarting connections when hosts move, and so on. In this paper, we revisit the design of the network stack to meet the needs of online services. The centerpiece of our Serval architecture is a new Service Access Layer (SAL) that sits above an unmodified network layer, and enables applications to communicate directly on service names. The SAL provides a clean service-level control/data plane split, enabling policy, control, and in-stack name-based routing that connects clients to services via diverse discovery techniques. By tying active sockets to the control plane, applications trigger updates to service routing state upon invoking socket calls, ensuring up-to-date service resolution. With Serval, end-points can seamlessly change network addresses, migrate flows across interfaces, or establish additional flows for efficient and uninterrupted service access. Experiments with our high-performance in-kernel prototype, and several example applications, demonstrate the value of a unified networking solution for online services. Available Media Reliable Client Accounting for P2P-Infrastructure Hybrids Paarijaat Aditya, Max Planck Institute for Software Systems (MPI-SWS); Mingchen Zhao, University of Pennsylvania; Yin Lin, Duke University and Akamai Technologies; Andreas Haeberlen, University of Pennsylvania; Peter Druschel, Max Planck Institute for Software Systems (MPI-SWS); Bruce Maggs, Duke University and Akamai Technologies; Bill Wishon, Akamai Technologies Content distribution networks (CDNs) have started to adopt hybrid designs, which employ both dedicated edge servers and resources contributed by clients. Hybrid designs combine many of the advantages of infrastructure- based and peer-to-peer systems, but they also present new challenges. This paper identifies reliable client accounting as one such challenge. Operators of hybrid CDNs are accountable to their customers (i.e., content providers) for the CDN’s performance. Therefore, they need to offer reliable quality of service and a detailed account of content served. Service quality and accurate accounting, however, depend in part on interactions among untrusted clients. Using the Akamai NetSession client network in a case study, we demonstrate that a small number of malicious clients used in a clever attack could cause significant accounting inaccuracies. We present a method for providing reliable accounting of client interactions in hybrid CDNs. The proposed method leverages the unique characteristics of hybrid systems to limit the loss of accounting accuracy and service quality caused by faulty or compromised clients. We also describe RCA, a system that applies this method to a commercial hybrid content-distribution network. Using trace-driven simulations, we show that RCA can detect and mitigate a variety of attacks, at the expense of a moderate increase in logging overhead. Available Media
3:00 p.m.–3:30 p.m.				Wednesday
Break
3:30 p.m.–5:00 p.m.				Wednesday
Network Robustness Session Chair: Ranveer Chandra, Microsoft Research Header Space Analysis: Static Checking for Networks Peyman Kazemian, Stanford University; George Varghese, UCSD and Yahoo! Research; Nick McKeown, Stanford University Today’s networks typically carry or deploy dozens of protocols and mechanisms simultaneously such as MPLS, NAT, ACLs and route redistribution. Even when individual protocols function correctly, failures can arise from the complex interactions of their aggregate, requiring network administrators to be masters of detail. Our goal is to automatically find an important class of failures, regardless of the protocols running, for both operational and experimental networks. To this end we developed a general and protocol-agnostic framework, called Header Space Analysis (HSA). Our formalism allows us to statically check network specifications and configurations to identify an important class of failures such as Reachability Failures, Forwarding Loops and Traffic Isolation and Leakage problems. In HSA, protocol header fields are not first class entities; instead we look at the entire packet header as a concatenation of bits without any associated meaning. Each packet is a point in the {0, 1}^L space where L is the maximum length of a packet header, and networking boxes transform packets from one point in the space to another point or set of points (multicast). We created a library of tools, called Hassel, to implement our framework, and used it to analyze a variety of networks and protocols. Hassel was used to analyze the Stanford University backbone network, and found all the forwarding loops in less than 10 minutes, and verified reachability constraints between two subnets in 13 seconds. It also found a large and complex loop in an experimental loose source routing protocol in 4 minutes. Available Media A NICE Way to Test OpenFlow Applications Marco Canini, Daniele Venzano, Peter Perešíni, and Dejan Kostić, EPFL; Jennifer Rexford, Princeton University The emergence of OpenFlow-capable switches enables exciting new network functionality, at the risk of programming errors that make communication less reliable. The centralized programming model, where a single controller program manages the network, seems to reduce the likelihood of bugs. However, the system is inherently distributed and asynchronous, with events happening at different switches and end hosts, and inevitable delays affecting communication with the controller. In this paper, we present efficient, systematic techniques for testing unmodified controller programs. Our NICE tool applies model checking to explore the state space of the entire system—the controller, the switches, and the hosts. Scalability is the main challenge, given the diversity of data packets, the large system state, and the many possible event orderings. To address this, we propose a novel way to augment model checking with symbolic execution of event handlers (to identify representative packets that exercise code paths on the controller). We also present a simplified OpenFlow switch model (to reduce the state space), and effective strategies for generating event interleavings likely to uncover bugs. Our prototype tests Python applications on the popular NOX platform. In testing three real applications—a MAC-learning switch, in-network server load balancing, and energy-efficient traffic engineering—we uncover eleven bugs. Available Media Toward Predictable Performance in Software Packet-Processing Platforms Mihai Dobrescu and Katerina Argyraki, EPFL, Switzerland; Sylvia Ratnasamy, UC Berkeley To become a credible alternative to specialized hardware, general-purpose networking needs to offer not only flexibility, but also predictable performance. Recent projects have demonstrated that general-purpose multicore hardware is capable of high-performance packet processing, but under a crucial simplifying assumption of uniformity: all processing cores see the same type/amount of traffic and run identical code, while all packets incur the same type of conventional processing (e.g., IP forwarding). Instead, we present a general-purpose packet-processing system that combines ease of programmability with predictable performance, while running a diverse set of applications and serving multiple clients with different needs. Offering predictability in this context is considered a hard problem, because software processes contend for shared hardware resources—caches, memory controllers, buses—in unpredictable ways. Still, we show that, in our system, (a) the way in which resource contention affects performance is predictable and (b) the overall performance depends little on how different processes are scheduled on different cores. To the best of our knowledge, our results constitute the first evidence that, when designing software network equipment, flexibility and predictability are not mutually exclusive goals. Available Media

Thursday, April 26, 2012

9:00 a.m.–10:30 a.m.				Thursday
Privacy Session Chair: Xiaowei Yang, Duke University Detecting and Defending Against Third-Party Tracking on the Web Franziska Roesner, Tadayoshi Kohno, and David Wetherall, University of Washington While third-party tracking on the web has garnered much attention, its workings remain poorly understood. Our goal is to dissect how mainstream web tracking occurs in the wild. We develop a client-side method for detecting and classifying five kinds of third-party trackers based on how they manipulate browser state. We run our detection system while browsing the web and observe a rich ecosystem, with over 500 unique trackers in our measurements alone. We find that most commercial pages are tracked by multiple parties, trackers vary widely in their coverage with a small number being widely deployed, and many trackers exhibit a combination of tracking behaviors. Based on web search traces taken from AOL data, we estimate that several trackers can each capture more than 20% of a user’s browsing behavior. We further assess the impact of defenses on tracking and find that no existing browser mechanisms prevent tracking by social media sites via widgets while still allowing those widgets to achieve their utility goals, which leads us to develop a new defense. To the best of our knowledge, our work is the most complete study of web tracking to date. Available Media Towards Statistical Queries over Distributed Private User Data Ruichuan Chen, Alexey Reznichenko, and Paul Francis, Max Planck Institute for Software Systems (MPI-SWS); Johannes Gehrke, Cornell University To maintain the privacy of individual users’ personal data, a growing number of researchers propose storing user data in client computers or personal data stores in the cloud, and allowing users to tightly control the release of that data. While this allows specific applications to use certain approved user data, it precludes broad statistical analysis of user data. Distributed differential privacy is one approach to enabling this analysis, but previous proposals are not practical in that they scale poorly, or that they require trusted clients. This paper proposes a design that overcomes these limitations. It places tight bounds on the extent to which malicious clients can distort answers, scales well, and tolerates churn among clients. This paper presents a detailed design and analysis, and gives performance results of a complete implementation based on the deployment of over 600 clients. Available Media Koi: A Location-Privacy Platform for Smartphone Apps Saikat Guha, Mudit Jain, and Venkata N. Padmanabhan, Microsoft Research India With mobile phones becoming first-class citizens in the online world, the rich location data they bring to the table is set to revolutionize all aspects of online life including content delivery, recommendation systems, and advertising. However, user-tracking is a concern with such location-based services, not only because location data can be linked uniquely to individuals, but because the low-level nature of current location APIs and the resulting dependence on the cloud to synthesize useful representations virtually guarantees such tracking. In this paper, we propose privacy-preserving location-based matching as a fundamental platform primitive and as an alternative to exposing low-level, latitude-longitude (lat-long) coordinates to applications. Applications set rich location-based triggers and have these be fired based on location updates either from the local device or from a remote device (e.g., a friend’s phone). Our Koi platform, comprising a privacy-preserving matching service in the cloud and a phone-based agent, realizes this primitive across multiple phone and browser platforms. By mask-ing low-level lat-long information from applications, Koi not only avoids leaking privacy-sensitive information, it also eases the task of programmers by providing a higher-level abstraction that is easier for applications to build upon. Koi’s privacy-preserving protocol prevents the cloud service from tracking users. We verify the non-tracking properties of Koi using a theorem prover, illustrate how privacy guarantees can easily be added to a wide range of location-based applications, and show that our public deployment is performant, being able to perform 12K matches per second on a single core. Available Media
10:30 a.m.–11:00 a.m.				Thursday
Break
11:00 a.m.–Noon				Thursday
Security and Availability Session Chair: Ed Nightingale, Microsoft Research Aiding the Detection of Fake Accounts in Large Scale Social Online Services Qiang Cao, Duke University; Michael Sirivianos, Telefonica Research; Xiaowei Yang, Duke University; Tiago Pregueiro, Tuenti, Telefonica Digital Users increasingly rely on the trustworthiness of the information exposed on Online Social Networks (OSNs). In addition, OSN providers base their business models on the marketability of this information. However, OSNs suffer from abuse in the form of the creation of fake accounts, which do not correspond to real humans. Fakes can introduce spam, manipulate online rating, or exploit knowledge extracted from the network. OSN operators currently expend significant resources to detect, manually verify, and shut down fake accounts. Tuenti, the largest OSN in Spain, dedicates 14 full-time employees in that task alone, incurring a significant monetary cost. Such a task has yet to be successfully automated because of the difficulty in reliably capturing the diverse behavior of fake and real OSN profiles. We introduce a new tool in the hands of OSN operators, which we call SybilRank. It relies on social graph properties to rank users according to their perceived likelihood of being fake (Sybils). SybilRank is computationally efficient and can scale to graphs with hundreds of millions of nodes, as demonstrated by our Hadoop prototype. We deployed SybilRank in Tuenti’s operation center. We found that ∼90% of the 200K accounts that SybilRank designated as most likely to be fake, actually warranted suspension. On the other hand, with Tuenti’s current user-report-based approach only ∼5% of the inspected accounts are indeed fake. Available Media Don't Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service Siddhartha Sen, Princeton University; Jacob R. Lorch, Richard Hughes, Carlos Garcia Jurado Suarez, and Brian Zill, Microsoft Research; Weverton Cordeiro, Universidade Federal do Rio Grande do Sul; Jitendra Padhye, Microsoft Research Large enterprises can save significant energy and money by putting idle desktop machines to sleep. Many systems that let desktops sleep and wake them on demand have been proposed, but enterprise IT departments refuse to deploy them because they require special hardware, disruptive virtualization technology, or dedicated per-subnet proxies, none of which are cost-effective. In response, we devised GreenUp, a minimal software-only system that allows any machine to act as a proxy for other sleeping machines in its subnet. To achieve this, GreenUp uses novel distributed techniques that spread load through randomization, efficiently synchronize state within a subnet, and maintain a minimum number of proxies despite the potential for correlated sleep times. In this paper, we present the details of GreenUp’s design as well as a theoretical analysis demonstrating its correctness and efficiency, using empirically-derived models where appropriate. We also present results and lessons from a seven-month live deployment on over 100 machines; a larger deployment on ~1,100 machines is currently ongoing. Available Media
Noon–1:30 p.m.				Thursday
Lunch (on your own)
1:30 p.m.–3:00 p.m.				Thursday
Data Center Networking Session Chair: Srikanth Kandula, Microsoft Research Jellyfish: Networking Data Centers Randomly Ankit Singla and Chi-Yao Hong, University of Illinois at Urbana-Champaign; Lucian Popa, HP Labs; P. Brighten Godfrey, University of Illinois at Urbana-Champaign Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree, supporting as many as 25% more servers at full capacity using the same equipment at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish’s unstructured design brings new challenges in routing, physical layout, and wiring. We describe approaches to resolve these challenges, and our evaluation suggests that Jellyfish could be deployed in today’s data centers. Available Media OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility Kai Chen, Northwestern University; Ankit Singla, UIUC; Atul Singh, Kishore Ramachandran, Lei Xu, and Yueping Zhang, NEC Labs America, Inc.; Xitao Wen and Yan Chen, Northwestern University Data center networks (DCNs) form the backbone infrastructure of many large-scale enterprise applications as well as emerging cloud computing providers. This paper describes the design, implementation and evaluation of OSA, a novel Optical Switching Architecture for DCNs. Leveraging runtime reconfigurable optical devices, OSA dynamically changes its topology and link capacities, thereby achieving unprecedented flexibility to adapt to dynamic traffic patterns. Extensive analytical simulations using both real and synthetic traffic patterns demonstrate that OSA can deliver high bisection bandwidth (60%-100% of the non-blocking architecture). Implementation and evaluation of a small-scale functional prototype further demonstrate the feasibility of OSA. Available Media Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center Mohammad Alizadeh, Stanford University; Abdul Kabbani, Google; Tom Edsall, Cisco Systems; Balaji Prabhakar, Stanford University; Amin Vahdat, Google and U.C. San Diego; Masato Yasuda, NEC Corporation, Japan Traditional measures of network goodness—goodput, quality of service, fairness—are expressed in terms of bandwidth. Network latency has rarely been a primary concern because delivering the highest level of bandwidth essentially entails driving up latency—at the mean and, especially, at the tail. Recently, however, there has been renewed interest in latency as a primary metric for mainstream applications. In this paper, we present the HULL (High-bandwidth Ultra-Low Latency) architecture to balance two seemingly contradictory goals: near baseline fabric latency and high bandwidth utilization. HULL leaves ‘bandwidth headroom’ using Phantom Queues that deliver congestion signals before network links are fully utilized and queues form at switches. By capping utilization at less than link capacity, we leave room for latency sensitive traffic to avoid buffering and the associated large delays. At the same time, we use DCTCP, a recently proposed congestion control algorithm, to adaptively respond to congestion and to mitigate the bandwidth penalties which arise from operating in a bufferless fashion. HULL further employs packet pacing to counter burstiness caused by Interrupt Coalescing and Large Send Offloading. Our implementation and simulation results show that by sacrificing a small amount (e.g., 10%) of bandwidth, HULL can dramatically reduce average and tail latencies in the data center. Available Media
3:00 p.m.–3:30 p.m.				Thursday
Break
3:30 p.m.–5:00 p.m.				Thursday
Big Data (2) Session Chair: Michael Freedman, Princeton University PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, and Andrew Wang, University of California, Berkeley; Dhruba Borthakur, Facebook; Srikanth Kandula, Microsoft Research; Scott Shenker and Ion Stoica, University of California, Berkeley Data-intensive analytics on large clusters is important for modern Internet services. As machines in these clusters have large memories, in-memory caching of inputs is an effective way to speed up these analytics jobs. The key challenge, however, is that these jobs run multiple tasks in parallel and a job is sped up only when inputs of all such parallel tasks are cached. Indeed, a single task whose input is not cached can slow down the entire job. To meet this “all-or-nothing” property, we have built PACMan, a caching service that coordinates access to the distributed caches. This coordination is essential to improve job completion times and cluster efficiency. To this end, we have implemented two cache replacement policies on top of PACMan’s coordinated infrastructure — LIFE that minimizes average completion time by evicting large incomplete inputs, and LFU-F that maximizes cluster efficiency by evicting less frequently accessed inputs. Evaluations on production workloads from Facebook and Microsoft Bing show that PACMan reduces average completion time of jobs by 53% and 51% (small interactive jobs improve by 77%), and improves efficiency of the cluster by 47% and 54%, respectively. Available Media Reoptimizing Data Parallel Computing Sameer Agarwal, University of California, Berkeley; Srikanth Kandula, Microsoft Research; Nico Bruno and Ming-Chuan Wu, Microsoft Bing; Ion Stoica, University of California, Berkeley; Jingren Zhou, Microsoft Bing Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are difficult to estimate due to the highly distributed nature of these frameworks, the freedom that allows users to specify arbitrary code as operations on the data, and since jobs in modern clusters have evolved beyond single map and reduce phases to logical graphs of operations. Using fixed apriori estimates of these properties to choose execution plans, as modern systems do, leads to poor performance in several instances. We present RoPE, a first step towards re-optimizing data-parallel jobs. RoPE collects certain code and data properties by piggybacking on job execution. It adapts execution plans by feeding these properties to a query optimizer. We show how this improves the future invocations of the same (and similar) jobs and characterize the scenarios of benefit. Experiments on Bing’s production clusters show up to 2× improvement across response time for production jobs at the 75th percentile while using 1.5× fewer resources. Available Media Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions Jiaxing Zhang and Hucheng Zhou, Microsoft Research Asia; Rishan Chen, Microsoft Research Asia and Peking University; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Zhenyu Guo and Haoxiang Lin, Microsoft Research Asia; Jack Y. Li, Microsoft Research Asia and Georgia Institute of Technology; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as “black boxes”, we propose to analyze those functions to turn them into “gray boxes” that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of cross-pod network traffic. Available Media
6:00 p.m.–7:30 p.m.				Thursday Regency 1
Poster Session and Reception Session Chair: Srikanth Kandula, Microsoft Research NSDI will be continuing its tradition of showcasing early research in progress at a poster and demo session during the reception. See the Poster and Demo Session for submission instructions.

Friday, April 27, 2012

9:00 a.m.–10:30 a.m.				Friday
New Architectures and Platforms Session Chair: Srinivasan Seshan, Carnegie Mellon University XIA: Efficient Support for Evolvable Internetworking Dongsu Han, Carnegie Mellon University; Ashok Anand, University of Wisconsin—Madison; Fahad Dogar, Boyan Li, and Hyeontaek Lim, Carnegie Mellon University; Michel Machado, Boston University; Arvind Mukundan, Carnegie Mellon University; Wenfei Wu and Aditya Akella, University of Wisconsin—Madison; David G. Andersen, Carnegie Mellon University; John W. Byers, Boston University; Srinivasan Seshan and Peter Steenkiste, Carnegie Mellon University Motivated by limitations in today’s host-centric IP network, recent studies have proposed clean-slate network architectures centered around alternate first-class principals, such as content, services, or users. However, much like the host-centric IP design, elevating one principal type above others hinders communication between other principals and inhibits the network’s capability to evolve. This paper presents the eXpressive Internet Architecture (XIA), an architecture with native support for multiple principals and the ability to evolve its functionality to accommodate new, as yet unforeseen, principals over time. We describe key design requirements, and demonstrate how XIA’s rich addressing and forwarding semantics facilitate flexibility and evolvability, while keeping core network functions simple and efficient. We describe case studies that demonstrate key functionality XIA enables. Available Media Design and Implementation of a Consolidated Middlebox Architecture Vyas Sekar, Intel Labs; Norbert Egi, Huawei; Sylvia Ratnasamy, UC Berkeley; Michael K. Reiter, UNC Chapel Hill; Guangyu Shi, Huawei Network deployments handle changing application, workload, and policy requirements via the deployment of specialized network appliances or “middleboxes”. Today, however, middlebox platforms are expensive and closed systems, with little or no hooks for extensibility. Furthermore, they are acquired from independent vendors and deployed as standalone devices with little cohesiveness in how the ensemble of middleboxes is managed. As network requirements continue to grow in both scale and variety, this bottom-up approach puts middlebox deployments on a trajectory of growing device sprawl with corresponding escalation in capital and management costs. To address this challenge, we present CoMb, a new architecture for middlebox deployments that systematically explores opportunities for consolidation, both at the level of building individual middleboxes and in managing a network of middleboxes. This paper addresses key resource management and implementation challenges that arise in exploiting the benefits of consolidation in middlebox deployments. Using a prototype implementation in Click, we show that CoMb reduces the network provisioning cost 1.8–2.5× and reduces the load imbalance in a network by 2–25×. Available Media An Operating System for the Home Colin Dixon, IBM Research; Ratul Mahajan, Sharad Agarwal, A.J. Brush, Bongshin Lee, Stefan Saroiu, and Paramvir Bahl, Microsoft Research Network devices for the home such as remotely controllable locks, lights, thermostats, cameras, and motion sensors are now readily available and inexpensive. In theory, this enables scenarios like remotely monitoring cameras from a smartphone or customizing climate control based on occupancy patterns. However, in practice today, such smarthome scenarios are limited to expert hobbyists and the rich because of the high overhead of managing and extending current technology. We present HomeOS, a platform that bridges this gap by presenting users and developers with a PC-like abstraction for technology in the home. It presents network devices as peripherals with abstract interfaces, enables cross-device tasks via applications written against these interfaces, and gives users a management interface designed for the home environment. HomeOS already has tens of applications and supports a wide range of devices. It has been running in 12 real homes for 4–8 months, and 42 students have built new applications and added support for additional devices independent of our efforts. Available Media
10:30 a.m.–11:00 a.m.				Friday
Break
11:00 a.m.–Noon				Friday
Cloud Performance Session Chair: David Andersen, Carnegie Mellon University Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems Karthik Nagaraj, Charles Killian, and Jennifer Neville, Purdue University Diagnosis and correction of performance issues in modern, large-scale distributed systems can be a daunting task, since a single developer is unlikely to be familiar with the entire system and it is hard to characterize the behavior of a software system without completely understanding its internal components. This paper describes DISTALYZER, an automated tool to support developer investigation of performance issues in distributed systems. We aim to leverage the vast log data available from large scale systems, while reducing the level of knowledge required for a developer to use our tool. Specifically, given two sets of logs, one with good and one with bad performance, DISTALYZER uses machine learning techniques to compare system behaviors extracted from the logs and automatically infer the strongest associations between system components and performance. The tool outputs a set of inter-related event occurrences and variable values that exhibit the largest divergence across the logs sets and most directly affect the overall performance of the system. These patterns are presented to the developer for inspection, to help them understand which system component(s) likely contain the root cause of the observed performance issue, thus alleviating the need for many human hours of manual inspection. We demonstrate the generality and effectiveness of DISTALYZER on three real distributed systems by showing how it discovers and highlights the root cause of six performance issues across the systems. DISTALYZER has broad applicability to other systems since it is dependent only on the logs for input, and not on the source code. Available Media Orchestrating the Deployment of Computations in the Cloud with Conductor Alexander Wieder, Pramod Bhatotia, Ansley Post, and Rodrigo Rodrigues, Max Planck Institute for Software Systems (MPI-SWS) When organizations move computation to the cloud, they must choose from a myriad of cloud services that can be used to outsource these jobs. The impact of this choice on price and performance is unclear, even for technical users. To further complicate this choice, factors like price fluctuations due to spot markets, or the cost of recovering from faults must also be factored in. In this paper, we present Conductor, a system that frees cloud customers from the burden of deciding which services to use when deploying MapReduce computations in the cloud. With Conductor, customers only specify goals, e.g., minimizing monetary cost or completion time, and the system automatically selects the best cloud services to use, deploys the computation according to that selection, and adapts to changing conditions at deployment time. The design of Conductor includes several novel features, such as a system to manage the deployment of cloud computations across different services, and a resource abstraction layer that provides a unified interface to these services, therefore hiding their low-level differences and simplifying the planning and deployment of the computation. We implemented Conductor and integrated it with the Hadoop framework. Our evaluation using Amazon Web Services shows that Conductor can find very subtle opportunities for cost savings while meeting deadline requirements, and that Conductor incurs a modest overhead due to planning computations and the resource abstraction layer. Available Media
Noon–1:30 p.m.				Friday
Lunch (on your own)
1:30 p.m.–3:00 p.m.				Friday
Transport Session Chair: Jon Crowcroft, University of Cambridge Fitting Square Pegs Through Round Pipes: Unordered Delivery Wire-Compatible with TCP and TLS Michael F. Nowlan, Yale University; Nabin Tiwari and Janardhan Iyengar, Franklin and Marshall College; Syed Obaid Amin and Bryan Ford, Yale University Internet applications increasingly employ TCP not as a stream abstraction, but as a substrate for application-level transports, a use that converts TCP’s in-order semantics from a convenience blessing to a performance curse. As Internet evolution makes TCP’s use as a substrate likely to grow, we offer Minion, an architecture for backward-compatible out-of-order delivery atop TCP and TLS. Small OS API extensions allow applications to manage TCP’s send buffer and to receive TCP segments out-of-order. Atop these extensions, Minion builds application-level protocols offering true unordered datagram delivery, within streams preserving strict wire-compatibility with unsecured or TLS-secured TCP connections. Minion’s protocols can run on unmodified TCP stacks, but benefit incrementally when either endpoint is upgraded, for a backward-compatible deployment path. Experiments suggest that Minion can noticeably improve performance of applications such as conferencing, virtual private networking, and web browsing, while incurring minimal CPU or bandwidth costs. Available Media How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP Costin Raiciu, Universitatea Politehnica Bucuresti; Christoph Paasch and Sebastien Barre, Université Catholique de Louvain; Alan Ford; Michio Honda, Keio University; Fabien Duchene and Olivier Bonaventure, Université Catholique de Louvain; Mark Handley, University College London Awarded Community Award! Networks have become multipath: mobile devices have multiple radio interfaces, datacenters have redundant paths and multihoming is the norm for big server farms. Mean- while, TCP is still only single-path. Is it possible to extend TCP to enable it to support multiple paths for current applications on today’s Internet? The answer is positive. We carefully review the constraints—partly due to various types of middleboxes— that influenced the design of Multipath TCP and show how we handled them to achieve its deployability goals. We report our experience in implementing Multipath TCP in the Linux kernel and we evaluate its performance. Our measurements focus on the algorithms needed to efficiently use paths with different characteristics, notably send and receive buffer tuning and segment reordering. We also compare the performance of our implementation with regular TCP on web servers. Finally, we discuss the lessons learned from designing MPTCP. Available Media The TCP Outcast Problem: Exposing Unfairness in Data Center Networks Pawan Prakash, Advait Dixit, Y. Charlie Hu, and Ramana Kompella, Purdue University In this paper, we observe that bandwidth sharing via TCP in commodity data center networks organized in multi-rooted tree topologies can lead to severe unfairness, which we term as the TCP Outcast problem, under many common traffic patterns. When many flows and a few flows arrive at two ports of a switch destined to one common output port, the small set of flows lose out on their throughput share significantly (almost by an order of magnitude sometimes). The Outcast problem occurs mainly in taildrop queues that commodity switches use. Using careful analysis, we discover that taildrop queues exhibit a phenomenon known as port blackout, where a series of packets from one port are dropped. Port blackout affects the fewer flows more significantly, as they lose more consecutive packets leading to TCP timeouts. In this paper, we show the existence of this TCP Outcast problem using a data center network testbed using real hardware under different scenarios. We then evaluate different solutions such as RED, SFQ, TCP pacing, and a new solution called equal-length routing to mitigate the Outcast problem. Available Media

sponsors

twitter

usenix conference policies

You are here

connect with us

NSDI '12 Technical Sessions

Wednesday, April 25, 2012

Thursday, April 26, 2012