Diego Didona, EPFL; Willy Zwaenepoel, EPFL and University of Sydney
This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.
Cheng Tan, NYU; Ze Jin, Cornell University; Chuanxiong Guo, Bytedance; Tianrong Zhang, Microsoft; Haitao Wu, Google; Karl Deng, Dongming Bi, and Dong Xiang, Microsoft
The availability of data center services is jeopardized by various network incidents. One of the biggest challenges for network incident handling is to accurately localize the failures, among millions of servers and tens of thousands of network devices. In this paper, we propose NetBouncer, a failure localization system that leverages the IP-in-IP technique to actively probe paths in a data center network. NetBouncer provides a complete failure localization framework which is capable of detecting both device and link failures. It further introduces an algorithm for high accuracy link failure inference that is resilient to real-world data inconsistency by integrating both our troubleshooting domain knowledge and machine learning techniques. NetBouncer has been deployed in Microsoft Azure’s data centers for three years. And in practice, it produced no false positives and only a few false negatives so far.
Daehyeok Kim and Tianlong Yu, Carnegie Mellon University; Hongqiang Harry Liu, Alibaba; Yibo Zhu, Jitu Padhye, and Shachar Raindel, Microsoft; Chuanxiong Guo, Bytedance; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
Many popular large-scale cloud applications are increasingly using containerization for high resource efficiency and lightweight isolation. In parallel, many data-intensive applications (e.g., data analytics and deep learning frameworks) are adopting or looking to adopt RDMA for high networking performance. Industry trends suggest that these two approaches are on an inevitable collision course. In this paper, we present FreeFlow, a software-based RDMA virtualization framework designed for containerized clouds. FreeFlow realizes virtual RDMA networking purely with a software-based approach using commodity RDMA NICs. Unlike existing RDMA virtualization solutions, FreeFlow fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies. FreeFlow is also transparent to applications and provides networking performance close to bare-metal RDMA with low CPU overhead. In our evaluations with TensorFlow and Spark, FreeFlow provides almost the same application performance as bare-metal RDMA.
Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David Andersen, Carnegie Mellon University
It is commonly believed that datacenter networking software must sacrifice generality to attain high performance. The popularity of specialized distributed systems designed specifically for niche technologies such as RDMA, lossless networks, FPGAs, and programmable switches testifies to this belief. In this paper, we show that such specialization is not necessary. eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics. eRPC performs well in three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. It handles packet loss, congestion, and background request execution. In microbenchmarks, one CPU core can handle up to 10 million small RPCs per second, or send large messages at 75 Gbps. We port a production-grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.
Ran Shu and Peng Cheng, Microsoft Research; Guo Chen, Microsoft Research & Hunan University; Zhiyuan Guo, Microsoft Research & Beihang University; Lei Qu and Yongqiang Xiong, Microsoft Research; Derek Chiou and Thomas Moscibroda, Microsoft Azure
FPGAs have been deployed at massive scale in data centers. The currently available communication architectures, however, make FPGAs very difficult to utilize resources in data center. In this paper, we present Direct Universal Access (DUA), a communication architecture that provides uniform access for FPGA to heterogeneous data center resources. Without considering machine boundaries, DUA provides global names and a common interface for communicating with various resources, where the underlying network automatically routes traffic and manages resource multiplexing. Our benchmarks show that DUA provides simple and fair-share resource access with small logic area overhead (<10%) and negligible latency (<0.2$\mu$s). We also build two practical multi-FPGA applications, deep crossing and regular expression matching, on top of DUA to demonstrate the usability and efficiency.
David Sommer, Aritra Dhar, Luka Malisa, and Esfandiar Mohammadi, ETH Zurich; Daniel Ronzani, Ronzani Schlauri Attorneys; Srdjan Capkun, ETH Zurich
Shizhen Zhao, Rui Wang, Junlan Zhou, Joon Ong, Jeffrey C. Mogul, and Amin Vahdat, Google, Inc.
Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions for Clos DCNs. Some prior work has claimed that the structure of Clos topologies hinders incremental expansion.
We demonstrate that it is indeed possible to design expandable Clos DCNs, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos DCN, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the average number of "stages" per expansion from 4 to 1.29, and reduces the number of wires changed by an order of magnitude or more—a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults.
Junaid Khalid and Aditya Akella, University of Wisconsin - Madison
Network functions virtualization (NFV) allows operators to employ NF chains to realize custom policies, and dynamically add instances to meet demand or for failover. NFs maintain detailed per- and cross-flow state which needs careful management, especially during dynamic actions. Crucially, state management must: (1) ensure NF chain-wide correctness and (2) have good performance. To this end, we built CHC, an NFV framework that leverages an external state store coupled with state management algorithms and metadata maintenance for correct operation even under a range of failures. Our evaluation shows that CHC can support ~10Gbps per-NF throughput and $<0.6μs increase in median per-NF packet processing latency, and chain-wide correctness at little additional cost.
Vishal Shrivastav, Cornell University; Asaf Valadarsky, Hebrew University of Jerusalem; Hitesh Ballani and Paolo Costa, Microsoft Research; Ki Suh Lee, Waltz Networks; Han Wang, Barefoot Networks; Rachit Agarwal and Hakim Weatherspoon, Cornell University
Disaggregated racks comprise a dense cluster of separate pools of compute, memory and storage blades, all inter-connected through an internal network within a single rack. However, their density poses a unique challenge for the rack’s network: it needs to connect an order of magnitude more nodes than today’s racks without exceeding the rack’s fixed power budget and without compromising on performance. We present Shoal, a power-efficient yet performant intra-rack network fabric built using fast circuit switches. Such switches consume less power as they have no buffers and no packet inspection mechanism, yet can be reconfigured in nanoseconds. Rack nodes transmit according to a static schedule such that there is no in-network contention without requiring a centralized controller. Shoal’s congestion control leverages the physical fabric to achieve fairness and both bounded worst-case network throughput and queuing. We use an FPGA-based prototype, testbed experiments, and simulations to illustrate Shoal’s mechanisms are practical, and can simultaneously achieve high density and high performance: 71% lower power and comparable or higher performance than today’s network designs.
Da Yu, Brown University; Yibo Zhu and Behnaz Arzani, Microsoft; Rodrigo Fonseca, Brown University; Tianrong Zhang, Karl Deng, and Lihua Yuan, Microsoft
Distributed, in-network packet capture is still the last resort for diagnosing network problems. Despite recent advances in collecting packet traces scalably, effectively utilizing pervasive packet captures still poses important challenges. Arbitrary combinations of middleboxes which transform packet headers make it challenging to even identify the same packet across multiple hops; packet drops in the collection system create ambiguities that must be handled; the large volume of captures, and their distributed nature, make it hard to do even simple processing; and the one-off and urgent nature of problems tends to generate ad-hoc solutions that are not reusable and do not scale. In this paper we propose dShark to address these challenges. dShark allows intuitive groupings of packets across multiple traces that are robust to header transformations and capture noise, offering simple streaming data abstractions for network operators. Using dShark on production packet captures from a major cloud provider, we show that dShark makes it easy to write concise and reusable queries against distributed packet traces that solve many common problems in diagnosing complex networks. Our evaluation shows that dShark can analyze production packet traces with more than 10 Mpps throughput on a commodity server, and has near-linear speedup when scaling out on multiple servers.
Ahmed Saeed and Yimeng Zhao, Georgia Institute of Technology; Nandita Dukkipati, Google; Ellen Zegura and Mostafa Ammar, Georgia Institute of Technology; Khaled Harras, Carnegie Mellon University; Amin Vahdat, Google
Packet scheduling determines the ordering of packets in a queuing data structure with respect to some ranking function that is mandated by a scheduling policy. It is the core component in many recent innovations in optimizing network performance and utilization. Packet scheduling is used for network resource allocation, meeting network-wide delay objectives, or providing isolation and differentiation of service. Our focus in this paper is on the design and deployment of packet scheduling in software. Software schedulers have several advantages including shorter development cycle and flexibility in functionality and deployment location. We substantially improve software packet scheduling performance, while maintaining its flexibility, by exploiting underlying features of packet ranking; the fact that packet ranks are integers that have predetermined ranges and that many packets will typically have equal rank. This allows us to rely on integer priority queues, compared to existing ranking algorithms, that rely on comparison-based priority queues that assume continuous ranks with infinite range. We introduce Eiffel, a novel programmable packet scheduling system. At the core of Eiffel is an integer priority queue based on the Find First Set (FFS) instruction and designed to support a wide range of policies and ranking functions efficiently. As an even more efficient alternative, we also propose a new approximate priority queue that can outperform FFS-based queues for some scenarios. To support flexibility, Eiffel introduces novel programming abstractions to express scheduling policies that cannot be captured by current, state of the art scheduler programming models. We evaluate Eiffel in a variety of settings and in both Kernel and userspace deployments. We show that it outperforms state of the art systems by 3-40x in terms of either number of cores utilized for network processing or number of flows given fixed processing capacity.
Salvatore Pontarelli, Axbryd/CNIT; Roberto Bifulco, NEC Laboratories Europe; Marco Bonola, Axbryd/CNIT; Carmelo Cascone, Open Networking Foundation; Marco Spaziani and Valerio Bruschi, CNIT/University of Rome Tor Vergata; Davide Sanvito, Politecnico di Milano; Giuseppe Siracusano, NEC Laboratories Europe; Antonio Capone, Politecnico di Milano; Michio Honda and Felipe Huici, NEC Laboratories Europe; Giuseppe Bianchi, CNIT/University of Rome Tor Vergata
Programmable NICs allow for better scalability to handle growing network workloads, however, providing an expressive, yet simple, abstraction to program stateful network functions in hardware remains a research challenge. We address the problem with FlowBlaze, an open abstraction for building stateful packet processing functions in hardware. The abstraction is based on Extended Finite State Machines and introduces the explicit definition of flow state, allowing FlowBlaze to leverage flow-level parallelism. FlowBlaze is expressive, supporting a wide range of complex network functions, and easy to use, hiding low-level hardware implementation issues from the programmer. Our implementation of FlowBlaze on a NetFPGA SmartNIC achieves very low latency (on the order of a few microseconds), consumes relatively little power, can hold per-flow state for hundreds of thousands of flows and yields speeds of 40 Gb/s, allowing for even higher speeds on newer FPGA models. Both hardware and software implementations of FlowBlaze are publicly available.
Anurag Khandelwal, UC Berkeley; Rachit Agarwal, Cornell University; Ion Stoica, UC Berkeley
Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure—Atomic MultiLog—that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.
Assaf Eisenman, Stanford University; Asaf Cidon, Stanford University and Barracuda Networks; Evgenya Pergament and Or Haimovich, Stanford University; Ryan Stutsman, University of Utah; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University
As its price per bit drops, SSD is increasingly becoming the default storage medium for hot data in cloud application databases. Even though SSD’s price per bit is more than 10× lower, and it provides sufficient performance (when accessed over a network) compared to DRAM, the durability of flash has limited its adoption in write-heavy use cases, such as key-value caching. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, which significantly shortens the lifetime of flash. We present Chimera, a hybrid key-value cache that uses DRAM as a “filter” to control and limit writes to SSD. Chimera performs lightweight machine learning admission control to predict which objects are likely to be read frequently without getting updated; these objects, which are prime candidates to be stored on SSD, are written to SSD in large chunks sequentially. In order to efficiently utilize the cache’s available memory, we design a novel in-memory index for the variable-sized objects stored on flash that requires only 4 bytes per object in DRAM. We describe Chimera’s design and implementation, and evaluate it on real-world traces from a widely used caching service, Memcachier. Compared to state-of-the-art systems that suffer a write amplification of 2.5× or more, Chimera maintains a median write amplification of 0.5× (since many filtered objects are never written to flash at all), without any loss of hit rate or throughput.
Mehrdad Hessar, Ali Najafi, and Shyamnath Gollakota, University of Washington
We present the first wireless protocol that scales to hundreds of concurrent transmissions from backscatter devices. Our key innovation is a distributed coding mechanism that works below the noise floor, operates on backscatter devices and can decode all the concurrent transmissions at the receiver using a single FFT operation. Our design addresses practical issues such as timing and frequency synchronization as well as the near-far problem. We deploy our design using a testbed of backscatter hardware and show that our protocol scales to concurrent transmissions from 256 devices using a bandwidth of only 500 kHz. Our results show throughput and latency improvements of 14–62x and 15–67x over existing approaches and 1–2 orders of magnitude higher transmission concurrency.
Frank Wang, MIT CSAIL; Ronny Ko and James Mickens, Harvard University
Riverbed is a new framework for building privacy-respecting web services. Using a simple policy language, users define restrictions on how a remote service can process and store sensitive data. A transparent Riverbed proxy sits between a user's front-end client (e.g., a web browser) and the back-end server code. The back-end code remotely attests to the proxy, demonstrating that the code respects user policies; in particular, the server code attests that it executes within a Riverbed-compatible managed runtime that uses IFC to enforce user policies. If attestation succeeds, the proxy releases the user's data, tagging it with the user-defined policies. On the server-side, the Riverbed runtime places all data with compatible policies into the same universe (i.e., the same isolated instance of the full web service). The universe mechanism allows Riverbed to work with unmodified, legacy software; unlike prior IFC systems, Riverbed does not require developers to reason about security lattices, or manually annotate code with labels. Riverbed imposes only modest performance overheads, with worst-case slowdowns of 10% for several real applications.
Suhas Jayaram Subramanya and Harsha Vardhan Simhadri, Microsoft Research India; Srajan Garg, IIT Bombay; Anil Kag and Venkatesh Balasubramanian, Microsoft Research India
Many large scale machine learning training and inference tasks are memory-bound rather than compute-bound. That is, on large data sets, the working set of these algorithms does not fit in memory for jobs that could run overnight on a few multi-core processors. This often forces an expensive redesign of the algorithm to distributed platforms such as parameter servers and Spark.
We propose an inexpensive and efficient alternative based on the observation that many ML tasks admit algorithms that can be programmed with linear algebra subroutines. A library that supports BLAS and sparseBLAS interface on large SSD-resident matrices can enable multi-threaded code to scale to industrial scale data sets on a single workstation.
We demonstrate that not only can such a library provide near in-memory performance for BLAS, but can also be used to write implementations of complex algorithms such as eigensolvers that outperform in-memory (ARPACK) and distributed (Spark) counterparts.
Existing multi-threaded in-memory code can link to our library with minor changes and scale to hundreds of Gigabytes of training or inference data at near in-memory processing speeds. We demonstrate this with two industrial scale use cases arising in ranking and relevance pipelines: training large scale topic models and inference for extreme multi-label learning.
This suggests that our approach could be an efficient alternative to expensive big-data compute systems for scaling up structurally complex machine learning tasks.
Yang Wu, Facebook; Ang Chen, Rice University; Linh Thi Xuan Phan, University of Pennsylvania
When diagnosing a problem in a distributed system, it is sometimes necessary to explain the timing of an event—for instance, why a response has been delayed, or why the network latency is high. Existing tools o er some support for this, typically by tracing the problem to a bottleneck or to an overloaded server. However, locating the bottleneck is merely the first step: the real problem may be some other service that is sending traffic over the bottleneck link, or a misbehaving machine that is overloading the server with requests. These off-path causes do not appear in a conventional trace and will thus be missed by most existing diagnostic tools.
In this paper, we introduce a new concept we call temporal provenance that can help with diagnosing timing-related problems. Temporal provenance is inspired by earlier work on provenance-based network debugging; however, in addition to the functional problems that can already be handled with classical provenance, it can also diagnose problems that are related to timing. We present an algorithm for generating temporal provenance and an experimental debugger called Zeno; our experimental evaluation shows that Zeno can successfully diagnose several realistic performance bugs.
Carlo Curino, Subru Krishnan, and Konstantinos Karanasos, Microsoft; Sriram Rao, Facebook; Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga, Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan, Microsoft
Microsoft's internal data lake processes exabytes of data over millions of cores daily on behalf of thousands of tenants. Scheduling this workload requires 10x to 100x more decisions per second than existing, general-purpose resource management frameworks are known to handle. In 2013, we were faced with a growing demand for workload diversity and richer sharing policies that our legacy system could not meet. In this paper, we present Hydra, the resource management infrastructure we built to meet these requirements.
Hydra leverages a federated architecture, in which a cluster is comprised of multiple, loosely coordinating subclusters. This allows us to scale by delegating placement of tasks on machines to each sub-cluster, while centrally coordinating only to ensure that tenants receive the right share of resources. To adapt to changing workload and cluster conditions promptly, Hydra's design features a control plane that can push scheduling policies across tens of thousands of nodes within seconds. This feature combined with the federated design allows for great agility in developing, evaluating, and rolling out new system behaviors.
We built Hydra by leveraging, extending, and contributing our code to Apache Hadoop YARN. Hydra is currently the primary big-data resource manager at Microsoft. Over the last few years, Hydra has scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data.