# NSDI '19 Technical Sessions

All sessions will be held in Constitution Ballroom unless otherwise noted.

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Attendee Files
NSDI '19 Proceedings Web Archive (ZIP)
NSDI '19 Attendee List (PDF)
View mode:

## NSDI '19 Preview Session

Grand Ballroom

Available Media

Are you new to NSDI? Are you a networking expert but feel bewildered when talk turns to security? Are you interested in engaging more deeply with paper presentations outside your research area? Join us for the NSDI preview session, where area experts will give short introductions to the Symposium's major technical sessions.

• Host Networking: Brent Stephens, University of Illinois at Chicago
• Distributed Systems: Aurojit Panda, New York University
• Modern Network Hardware: Akshay Narayan, Massachusetts Institute of Technology
• Analytics: Aurojit Panda, New York University
• Data Center Network Architecture: Anirudh Sivaraman, New York University
• Wireless Technologies: Dinesh Bharadia, University of California, San Diego
• Operating Systems: Amy Ousterhout, Massachusetts Institute of Technology
• Monitoring and Diagnosis: Anurag Khandelwal, University of California, Berkeley
• Improving Machine Learning: Junchen Jiang, University of Chicago
• Network Functions: Radhika Mittal, Massachusetts Institute of Technology and University of Illinois at Urbana–Champaign
• Network Characterization: David Choffnes, Northeastern
• Network Modeling: Costin Raicu, University Politehnica of Bucharest

## Continental Breakfast

Grand Ballroom Foyer

## Opening Remarks and Best Paper Awards

Program Co-Chairs: Jay Lorch, Microsoft Research, and Minlan Yu, Harvard University

## Host Networking

Session Chair: Anirudh Sivaraman, New York University

## Datacenter RPCs can be General and Fast

Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David Andersen, Carnegie Mellon University
Awarded Best Paper!

Available Media

It is commonly believed that datacenter networking software must sacrifice generality to attain high performance. The popularity of specialized distributed systems designed specifically for niche technologies such as RDMA, lossless networks, FPGAs, and programmable switches testifies to this belief. In this paper, we show that such specialization is not necessary. eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics. eRPC performs well in three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. It handles packet loss, congestion, and background request execution. In microbenchmarks, one CPU core can handle up to 10 million small RPCs per second, or send large messages at 75 Gbps. We port a production-grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.

## Eiffel: Efficient and Flexible Software Packet Scheduling

Ahmed Saeed and Yimeng Zhao, Georgia Institute of Technology; Nandita Dukkipati, Google; Ellen Zegura and Mostafa Ammar, Georgia Institute of Technology; Khaled Harras, Carnegie Mellon University; Amin Vahdat, Google

Available Media

Packet scheduling determines the ordering of packets in a queuing data structure with respect to some ranking function that is mandated by a scheduling policy. It is the core component in many recent innovations in optimizing network performance and utilization. Packet scheduling is used for network resource allocation, meeting network-wide delay objectives, or providing isolation and differentiation of service. Our focus in this paper is on the design and deployment of packet scheduling in software. Software schedulers have several advantages including shorter development cycle and flexibility in functionality and deployment location. We substantially improve software packet scheduling performance, while maintaining its flexibility, by exploiting underlying features of packet ranking; the fact that packet ranks are integers that have predetermined ranges and that many packets will typically have equal rank. This allows us to rely on integer priority queues, compared to existing ranking algorithms, that rely on comparison-based priority queues that assume continuous ranks with infinite range. We introduce Eiffel, a novel programmable packet scheduling system. At the core of Eiffel is an integer priority queue based on the Find First Set (FFS) instruction and designed to support a wide range of policies and ranking functions efficiently. As an even more efficient alternative, we also propose a new approximate priority queue that can outperform FFS-based queues for some scenarios. To support flexibility, Eiffel introduces novel programming abstractions to express scheduling policies that cannot be captured by current, state of the art scheduler programming models. We evaluate Eiffel in a variety of settings and in both Kernel and userspace deployments. We show that it outperforms state of the art systems by 3-40x in terms of either number of cores utilized for network processing or number of flows given fixed processing capacity.

## Loom: Flexible and Efficient NIC Packet Scheduling

Available Media

In multi-tenant cloud data centers, operators need to ensure that competing tenants and applications are isolated from each other and fairly share limited network resources. With current NICs, operators must either 1) use a single NIC queue and enforce network policy in software, which incurs high CPU overheads and struggles to drive increasing line-rates (100Gbps), or 2) use multiple NIC queues and accept imperfect isolation and policy enforcement. These problems arise due to inflexible and static NIC packet schedulers and an inefficient OS/NIC interface.

To overcome these limitations, we present Loom, a new NIC design that moves all per-flow scheduling decisions out of the OS and into the NIC. The key aspects of Loom's design are 1) a new network policy abstraction: restricted directed acyclic graphs (DAGs), 2) a programmable hierarchical packet scheduler, and 3) a new expressive and efficient OS/NIC interface that enables the OS to precisely control how the NIC performs packet scheduling while still ensuring low CPU utilization. Loom is the only multiqueue NIC design that is able to efficiently enforce network policy. We find empirically that Loom lowers latency, increases throughput, and improves fairness for collocated applications and tenants.

## Break with Refreshments

Grand Ballroom Foyer

## Distributed Systems

Session Chair: Manos Kapritsos, University of Michigan

## Exploiting Commutativity For Practical Fast Replication

Seo Jin Park and John Ousterhout, Stanford University

Available Media

Traditional approaches to replication require client requests to be ordered before making them durable by copying them to replicas. As a result, clients must wait for two round-trip times (RTTs) before updates complete. In this paper, we show that this entanglement of ordering and durability is unnecessary for strong consistency. Consistent Unordered Replication Protocol (CURP) allows clients to replicate requests that have not yet been ordered, as long as they are commutative. This strategy allows most operations to complete in 1 RTT (the same as an unreplicated system). We implemented CURP in the Redis and RAMCloud storage systems. In RAMCloud, CURP improved write latency by ~2x (14us -> 7.1us) and write throughput by 4x. Compared to unreplicated RAMCloud, CURP's latency overhead for 3-way replication is just 1us (6.1us vs 7.1us). CURP transformed a non-durable Redis cache into a consistent and durable storage system with only a small performance overhead.

## Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification

Assaf Eisenman, Stanford University; Asaf Cidon, Stanford University and Barracuda Networks; Evgenya Pergament and Or Haimovich, Stanford University; Ryan Stutsman, University of Utah; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University

Available Media

As its price per bit drops, SSD is increasingly becoming the default storage medium for hot data in cloud application databases. Even though SSD’s price per bit is more than 10× lower, and it provides sufficient performance (when accessed over a network) compared to DRAM, the durability of flash has limited its adoption in write-heavy use cases, such as key-value caching. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, which significantly shortens the lifetime of flash. We present Flashield, a hybrid key-value cache that uses DRAM as a “filter” to control and limit writes to SSD. Flashield performs lightweight machine learning admission control to predict which objects are likely to be read frequently without getting updated; these objects, which are prime candidates to be stored on SSD, are written to SSD in large chunks sequentially. In order to efficiently utilize the cache’s available memory, we design a novel in-memory index for the variable-sized objects stored on flash that requires only 4 bytes per object in DRAM. We describe Flashield’s design and implementation, and evaluate it on real-world traces from a widely used caching service, Memcachier. Compared to state-of-the-art systems that suffer a write amplification of 2.5× or more, Flashield maintains a median write amplification of 0.5× (since many filtered objects are never written to flash at all), without any loss of hit rate or throughput.

## Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores

Diego Didona, EPFL; Willy Zwaenepoel, EPFL and University of Sydney

Available Media

This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.

## Monoxide: Scale out Blockchains with Asynchronous Consensus Zones

Jiaping Wang, ICT/CAS, Sinovation Ventures; Hao Wang, Ohio State University

Available Media

Cryptocurrencies have provided a promising infrastructure for pseudonymous online payments. However, low throughput has significantly hindered the scalability and usability of cryptocurrency systems for increasing numbers of users and transactions. Another obstacle to achieving scalability is that every node is required to duplicate the communication, storage, and state representation of the entire network.

In this paper, we introduce the Asynchronous Consensus Zones, which scales blockchain system linearly without compromising decentralization or security. We achieve this by running multiple independent and parallel instances of single-chain consensus (zones). The consensus happens independently within each zone with minimized communication, which partitions the workload of the entire network and ensures moderate burden for each individual node as network grows. We propose eventual atomicity to ensure transaction atomicity across zones, which guarantees the efficient completion of transaction without the overhead of two-phase commit protocol. We also propose Chu-ko-nu mining to ensure the effective mining power in each zone is at the same level of the entire network, and makes an attack on any individual zone as hard as that on the entire network. Our experimental results show the effectiveness of our work. On a test-bed including 1,200 virtual machines worldwide to support 48,000 nodes, our system deliver $1,000\times$ throughput and capacity over Bitcoin and Ethereum network.

## Symposium Luncheon and Test of Time Award Presentation

Back Bay Ballroom ABC

## Modern Network Hardware

Session Chair: Manya Ghobadi, Massachusetts Institute of Technology

## FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

Daehyeok Kim and Tianlong Yu, Carnegie Mellon University; Hongqiang Harry Liu, Alibaba; Yibo Zhu, Microsoft and Bytedance; Jitu Padhye and Shachar Raindel, Microsoft; Chuanxiong Guo, Bytedance; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University

Available Media

Many popular large-scale cloud applications are increasingly using containerization for high resource efficiency and lightweight isolation. In parallel, many data-intensive applications (e.g., data analytics and deep learning frameworks) are adopting or looking to adopt RDMA for high networking performance. Industry trends suggest that these two approaches are on an inevitable collision course. In this paper, we present FreeFlow, a software-based RDMA virtualization framework designed for containerized clouds. FreeFlow realizes virtual RDMA networking purely with a software-based approach using commodity RDMA NICs. Unlike existing RDMA virtualization solutions, FreeFlow fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies. FreeFlow is also transparent to applications and provides networking performance close to bare-metal RDMA with low CPU overhead. In our evaluations with TensorFlow and Spark, FreeFlow provides almost the same application performance as bare-metal RDMA.

## Direct Universal Access: Making Data Center Resources Available to FPGA

Ran Shu and Peng Cheng, Microsoft Research; Guo Chen, Microsoft Research & Hunan University; Zhiyuan Guo, Microsoft Research & Beihang University; Lei Qu and Yongqiang Xiong, Microsoft Research; Derek Chiou and Thomas Moscibroda, Microsoft Azure

Available Media

FPGAs have been deployed at massive scale in data centers. The currently available communication architectures, however, make FPGAs very difficult to utilize resources in data center. In this paper, we present Direct Universal Access (DUA), a communication architecture that provides uniform access for FPGA to heterogeneous data center resources. Without considering machine boundaries, DUA provides global names and a common interface for communicating with various resources, where the underlying network automatically routes traffic and manages resource multiplexing. Our benchmarks show that DUA provides simple and fair-share resource access with small logic area overhead (<10%) and negligible latency (<0.2$\mu$s). We also build two practical multi-FPGA applications, deep crossing and regular expression matching, on top of DUA to demonstrate the usability and efficiency.

## Stardust: Divide and Conquer in the Data Center Network

Noa Zilberman, University of Cambridge; Gabi Bracha and Golan Schzukin, Broadcom

Available Media

Building scalable data centers, and network devices that fit within these data centers, has become increasingly hard. With modern switches pushing at the boundary of manufacturing feasibility, being able to build suitable, and scalable network fabrics becomes of critical importance. We introduce Stardust, a fabric architecture for data center scale networks, inspired by network-switch systems. Stardust combines packet switches at the edge and disaggregated cell switches at the network fabric, using scheduled traffic. Stardust is a distributed solution that attends to the scale limitations of network-switch design, while also offering improved performance and power savings compared with traditional solutions. With ever-increasing networking requirements, Stardust predicts the elimination of packet switches, replaced by cell switches in the network, and smart network hardware at the hosts.

## Blink: Fast Connectivity Recovery Entirely in the Data Plane

Thomas Holterbach, Edgar Costa Molero, and Maria Apostolaki, ETH Zurich; Alberto Dainotti, CAIDA / UC San Diego; Stefano Vissicchio, UC London; Laurent Vanbever, ETH Zurich

Available Media

In this paper, we explore new possibilities, created by programmable switches, for fast rerouting upon signals triggered by Internet traffic disruptions. We present Blink, a data-driven system exploiting TCP-induced signals to detect failures. The key intuition behind Blink is that a TCP flow exhibits a predictable behavior upon disruption: retransmitting the same packet over and over, at epochs exponentially spaced in time. When compounded over multiple flows, this behavior creates a strong and characteristic failure signal. Blink efficiently analyzes TCP flows, at line rate, to: (i) select flows to track; (ii) reliably and quickly detect major traffic disruptions; and (iii) recover data-plane connectivity, via next-hops compatible with the operator’s policies.

We present an end-to-end implementation of Blink in P4 together with an extensive evaluation on real and synthetic traffic traces. Our results indicate that Blink: (i) can achieve sub-second rerouting for realistic Internet traffic; (ii) prevents unnecessary traffic shifts, in the presence of noise; and (iii) scales to protect large fractions of realistic Internet traffic, on existing hardware. We further show the feasibility of Blink by running our system on a real Tofino switch.

## Break with Refreshments

Grand Ballroom Foyer

## Hydra: a federated resource manager for data-center scale analytics

Carlo Curino, Subru Krishnan, and Konstantinos Karanasos, Microsoft; Sriram Rao, Facebook; Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga, Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan, Microsoft

Available Media

Microsoft's internal data lake processes exabytes of data over millions of cores daily on behalf of thousands of tenants. Scheduling this workload requires 10x to 100x more decisions per second than existing, general-purpose resource management frameworks are known to handle. In 2013, we were faced with a growing demand for workload diversity and richer sharing policies that our legacy system could not meet. In this paper, we present Hydra, the resource management infrastructure we built to meet these requirements.

Hydra leverages a federated architecture, in which a cluster is comprised of multiple, loosely coordinating subclusters. This allows us to scale by delegating placement of tasks on machines to each sub-cluster, while centrally coordinating only to ensure that tenants receive the right share of resources. To adapt to changing workload and cluster conditions promptly, Hydra's design features a control plane that can push scheduling policies across tens of thousands of nodes within seconds. This feature combined with the federated design allows for great agility in developing, evaluating, and rolling out new system behaviors.

We built Hydra by leveraging, extending, and contributing our code to Apache Hadoop YARN. Hydra is currently the primary big-data resource manager at Microsoft. Over the last few years, Hydra has scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data.

## Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure

Qifan Pu, UC Berkeley; Shivaram Venkataraman, University of Wisconsin, Madison; Ion Stoica, UC Berkeley

Available Media

Serverless computing is poised to fulfill the long-held promise of transparent elasticity and millisecond-level pricing. To achieve this goal, service providers impose a finegrained computational model where every function has a maximum duration, a fixed amount of memory and no persistent local storage. We observe that the fine-grained elasticity of serverless is key to achieve high utilization for general computations such as analytics workloads, but that resource limits make it challenging to implement such applications as they need to move large amounts of data between functions that don’t overlap in time. In this paper, we present Locus, a serverless analytics system that judiciously combines (1) cheap but slow storage with (2) fast but expensive storage, to achieve good performance while remaining cost-efficient. Locus applies a performance model to guide users in selecting the type and the amount of storage to achieve the desired cost-performance trade-off. We evaluate Locus on a number of analytics applications including TPC-DS, CloudSort, Big Data Benchmark and show that Locus can navigate the cost-performance trade-off, leading to 4×-500× performance improvements over slow storage-only baseline and reducing resource usage by up to 59% while achieving comparable performance on a cluster of virtual machines, and within 1.99× slower compared to Redshift.

## dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces

Da Yu, Brown University; Yibo Zhu, Microsoft and Bytedance; Behnaz Arzani, Microsoft; Rodrigo Fonseca, Brown University; Tianrong Zhang, Karl Deng, and Lihua Yuan, Microsoft

Available Media

Distributed, in-network packet capture is still the last resort for diagnosing network problems. Despite recent advances in collecting packet traces scalably, effectively utilizing pervasive packet captures still poses important challenges. Arbitrary combinations of middleboxes which transform packet headers make it challenging to even identify the same packet across multiple hops; packet drops in the collection system create ambiguities that must be handled; the large volume of captures, and their distributed nature, make it hard to do even simple processing; and the one-off and urgent nature of problems tends to generate ad-hoc solutions that are not reusable and do not scale. In this paper we propose dShark to address these challenges. dShark allows intuitive groupings of packets across multiple traces that are robust to header transformations and capture noise, offering simple streaming data abstractions for network operators. Using dShark on production packet captures from a major cloud provider, we show that dShark makes it easy to write concise and reusable queries against distributed packet traces that solve many common problems in diagnosing complex networks. Our evaluation shows that dShark can analyze production packet traces with more than 10 Mpps throughput on a commodity server, and has near-linear speedup when scaling out on multiple servers.

## Data Center Network Architecture

Session Chair: George Porter, University of California, San Diego

## Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks

Shizhen Zhao, Rui Wang, Junlan Zhou, Joon Ong, Jeffrey C. Mogul, and Amin Vahdat, Google, Inc.

Available Media

Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions for Clos DCNs. Some prior work has claimed that the structure of Clos topologies hinders incremental expansion.

We demonstrate that it is indeed possible to design expandable Clos DCNs, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos DCN, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the average number of "stages" per expansion from 4 to 1.29, and reduces the number of wires changed by an order of magnitude or more—a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults.

## Understanding Lifecycle Management Complexity of Datacenter Topologies

Mingyang Zhang, University of Southern California; Radhika Niranjan Mysore, VMware Research; Sucha Supittayapornpong and Ramesh Govindan, University of Southern California
Awarded Best Paper!

Available Media

Most recent datacenter topology designs have focused on performance properties such as latency and throughput. In this paper, we explore a new dimension, life cycle management, which attempts to capture operational costs of topologies. Specifically, we consider costs associated with deployment and expansion of topologies and explore how structural properties of two different topology families (Clos and expander graphs as exemplified by Xpander) affect these. We also develop a new topology that has the wiring simplicity of Clos and the expandability of expander graphs using the insights from our study.

## Shoal: A Network Architecture for Disaggregated Racks

Vishal Shrivastav, Cornell University; Asaf Valadarsky, Hebrew University of Jerusalem; Hitesh Ballani and Paolo Costa, Microsoft Research; Ki Suh Lee, Waltz Networks; Han Wang, Barefoot Networks; Rachit Agarwal and Hakim Weatherspoon, Cornell University

Available Media

Disaggregated racks comprise a dense cluster of separate pools of compute, memory and storage blades, all inter-connected through an internal network within a single rack. However, their density poses a unique challenge for the rack’s network: it needs to connect an order of magnitude more nodes than today’s racks without exceeding the rack’s fixed power budget and without compromising on performance. We present Shoal, a power-efficient yet performant intra-rack network fabric built using fast circuit switches. Such switches consume less power as they have no buffers and no packet inspection mechanism, yet can be reconfigured in nanoseconds. Rack nodes transmit according to a static schedule such that there is no in-network contention without requiring a centralized controller. Shoal’s congestion control leverages the physical fabric to achieve fairness and both bounded worst-case network throughput and queuing. We use an FPGA-based prototype, testbed experiments, and simulations to illustrate Shoal’s mechanisms are practical, and can simultaneously achieve high density and high performance: 71% lower power and comparable or higher performance than today’s network designs.

## Continental Breakfast

Grand Ballroom Foyer

## Wireless Technologies

Session Chair: Swarun Kumar, Carnegie Mellon University

## NetScatter: Enabling Large-Scale Backscatter Networks

Mehrdad Hessar, Ali Najafi, and Shyamnath Gollakota, University of Washington

Available Media

We present the first wireless protocol that scales to hundreds of concurrent transmissions from backscatter devices. Our key innovation is a distributed coding mechanism that works below the noise floor, operates on backscatter devices and can decode all the concurrent transmissions at the receiver using a single FFT operation. Our design addresses practical issues such as timing and frequency synchronization as well as the near-far problem. We deploy our design using a testbed of backscatter hardware and show that our protocol scales to concurrent transmissions from 256 devices using a bandwidth of only 500 kHz. Our results show throughput and latency improvements of 14–62x and 15–67x over existing approaches and 1–2 orders of magnitude higher transmission concurrency.

## Towards Programming the Radio Environment with Large Arrays of Inexpensive Antennas

Zhuqi Li, Yaxiong Xie, and Longfei Shangguan, Princeton University; Rotman Ivan Zelaya, Yale University; Jeremy Gummeson, UMass Amherst; Wenjun Hu, Yale University; Kyle Jamieson, Princeton University

Available Media

Conventional thinking treats the wireless channel as a given constraint. Therefore, wireless network designs to date center on the problem of the endpoint optimization that best utilizes the channel, for example, via rate and power control at the transmitter or sophisticated decoding mechanisms at the receiver. We instead explore whether it is possible to reconfigure the environment itself to facilitate wireless communication. In this work, we instrument the environment with a large array of inexpensive antennas (LAIA) and design algorithms to configure them in real time. Our system achieves this level of programmability through rapid adjustments of an on-board phase shifter in each LAIA device. We design a channel decomposition algorithm to quickly estimate the wireless channel due to the environment alone, which leads us to a process to align the phases of the array elements. Variations of our core algorithm can then optimize wireless channels on the fly for single- and multi-antenna links, as well as nearby networks operating on adjacent frequency bands. We design and deploy a 36-element passive array in a real indoor home environment. Experiments with this prototype show that, by reconfiguring the wireless environment, we can achieve a 24% TCP throughput improvement on average and a median improvement of 51.4% in Shannon capacity over the baseline single-antenna links. Over the baseline multi-antenna links, LAIA achieves an improvement of 12.23% to 18.95% in Shannon capacity.

## Pushing the Range Limits of Commercial Passive RFIDs

Jingxian Wang, Carnegie Mellon University; Junbo Zhang, Tsinghua University; Rajarshi Saha, IIT Kharagpur; Haojian Jin and Swarun Kumar, Carnegie Mellon University

Available Media

This paper asks: “Can we push the prevailing range limits of commercial passive RFIDs?”. Today’s commercial passive RFIDs report ranges of 5-15 meters at best. This constrains RFIDs to be detected only at specific checkpoints in warehouses, stores and factories today, leaving them outside of communication range beyond these spaces. State-of-the-art approaches to improve the range of RFIDs develop new tag hardware that necessarily sacrifices some of the most attractive features of passive RFIDs such as their low cost, small form-factor or the absence of a battery.

We present PushID, a system that exploits collaboration between readers to enhance the range of commercial passive RFID tags, without altering the tags whatsoever. PushID uses distributed MIMO to coherently combine signals across geographically separated RFID readers at the tags. In doing so, it resolves the chicken-or-egg problem of inferring the optimal beamforming parameters to beam energy to a tag without any feedback from the tag itself, which needs this energy to respond in the first place. A prototype evaluation of PushID with 8 distributed RFID readers reveals a range of 64-meters to the closest reader, a 7.4×, 1.2× and 1.6× improvement in range compared to state-of-the-art commercial readers and other two schemes [10, 31].

## SweepSense: Sensing 5 GHz in 5 Milliseconds with Low-cost Radios

Yeswanth Guddeti, UC San Diego; Raghav Subbaraman, IIT Madras; Moein Khazraee, Aaron Schulman, and Dinesh Bharadia, UC San Diego

Available Media

Wireless transmissions occur intermittently across the entire spectrum. For example, WiFi and Bluetooth devices transmit frames across the 100 MHz-wide 2.4 GHz band, and LTE devices transmit frames between 700 MHz and 3.7 GHz). Today, only high-cost radios can sense across the spectrum with sufficient temporal resolution to observe these individual transmissions.

We present “SweepSense”, a low-cost radio architecture that senses the entire spectrum with high-temporal resolution by rapidly sweeping across it. Sweeping introduces new challenges for spectrum sensing: SweepSense radios only capture a small number of distorted samples of transmissions. To overcome this challenge, we correct the distortion with self-generated calibration data, and classify the protocol that originated each transmission with only a fraction of the transmission’s samples. We demonstrate that SweepSense can accurately identify four protocols transmitting simultaneously in the 2.4 GHz unlicensed band. We also demonstrate that it can simultaneously monitor the load of several LTE base stations operating in disjoint bands.

## Break with Refreshments

Grand Ballroom Foyer

## Operating Systems

Session Chair: Haryadi Gunawi, University of Chicago

## Slim: OS Kernel Support for a Low-Overhead Container Overlay Network

Danyang Zhuo and Kaiyuan Zhang, University of Washington; Yibo Zhu, Microsoft and Bytedance; Hongqiang Harry Liu, Alibaba; Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson, University of Washington

Available Media

Containers have become the de facto method for hosting large-scale distributed applications. Container overlay networks are essential to providing portability for containers, yet they impose significant overhead in terms of throughput, latency, and CPU utilization. The key problem is a reliance on packet transformation to implement network virtualization. As a result, each packet has to traverse the network stack twice in both the sender and the receiver’s host OS kernel. We have designed and implemented Slim, a low-overhead container overlay network that implements network virtualization by manipulating connection-level metadata. Our solution maintains compatibility with today’s containerized applications. Evaluation results show that Slim improves the throughput of an in-memory key-value store by 66% while reducing the latency by 42%. Slim reduces the CPU utilization of the in-memory key-value store by 54%. Slim also reduces the CPU utilization of a web server by 28%-40%, a database server by 25%, and a stream processing framework by 11%.

## Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency

Kostis Kaffes, Timothy Chong, and Jack Tigar Humphries, Stanford University; Adam Belay, Massachusetts Institute of Technology; David Mazières and Christos Kozyrakis, Stanford University

Available Media

The recently proposed dataplanes for microsecond scale applications, such as IX and ZygOS, use non-preemptive policies to schedule requests to cores. For the many real-world scenarios where request service times follow distributions with high dispersion or a heavy tail, they allow short requests to be blocked behind long requests, which leads to poor tail latency.

Shinjuku is a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5µsec and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over IX and ZygOS for a wide range of workload scenarios. For the case of a RocksDB server processing both point and range queries, Shinjuku achieves up to 6.6× higher throughput and 88% lower tail latency.

## Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan, MIT CSAIL

Available Media

Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today's kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load.

Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity—every 5 µs—enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.

## Monitoring and Diagnosis

Session Chair: Ankit Singla, ETH Zurich

## End-to-end I/O Monitoring on a Leading Supercomputer

Bin Yang, Shandong University, National Supercomputing Center in Wuxi; Xu Ji, Tsinghua University, National Supercomputing Center in Wuxi; Xiaosong Ma, Qatar Computing Research institute, HBKU; Xiyang Wang, National Supercomputing Center in Wuxi; Tianyu Zhang and Xiupeng Zhu, Shandong University, National Supercomputing Center in Wuxi; Nosayba El-Sayed, Emory University; Haidong Lan and Yibo Yang, Shandong Unversity; Jidong Zhai, Tsinghua University; Weiguo Liu, Shandong University, National Supercomputing Center in Wuxi; Wei Xue, Tsinghua University, National Supercomputing Center in Wuxi

Available Media

This paper presents an effort to overcome the complexities of production-use I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. It simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification.

With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with a collection of real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Finally, both codes and data collected are to be released.

## Zeno: Diagnosing Performance Problems with Temporal Provenance

Yang Wu, Facebook; Ang Chen, Rice University; Linh Thi Xuan Phan, University of Pennsylvania

Available Media

When diagnosing a problem in a distributed system, it is sometimes necessary to explain the timing of an event—for instance, why a response has been delayed, or why the network latency is high. Existing tools o er some support for this, typically by tracing the problem to a bottleneck or to an overloaded server. However, locating the bottleneck is merely the first step: the real problem may be some other service that is sending traffic over the bottleneck link, or a misbehaving machine that is overloading the server with requests. These off-path causes do not appear in a conventional trace and will thus be missed by most existing diagnostic tools.

In this paper, we introduce a new concept we call temporal provenance that can help with diagnosing timing-related problems. Temporal provenance is inspired by earlier work on provenance-based network debugging; however, in addition to the functional problems that can already be handled with classical provenance, it can also diagnose problems that are related to timing. We present an algorithm for generating temporal provenance and an experimental debugger called Zeno; our experimental evaluation shows that Zeno can successfully diagnose several realistic performance bugs.

## Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks

Anurag Khandelwal, UC Berkeley; Rachit Agarwal, Cornell University; Ion Stoica, UC Berkeley

Available Media

Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure—Atomic MultiLog—that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.

## DETER: Deterministic TCP Replay for Performance Diagnosis

Yuliang Li, Harvard University; Rui Miao, Alibaba Group; Mohammad Alizadeh, Massachusetts Institute of Technology; Minlan Yu, Harvard University

Available Media

TCP performance problems are notoriously tricky to diagnose because a subtle choice of TCP parameters or features may lead to completely different performance. A gold standard for diagnosis is to collect packet traces and trace TCP executions. However, it is not easy to use such tools in large-scale data centers where many TCP connections interact with each other. In this paper, we introduce DETER, a deterministic TCP replay tool, which runs lightweight recording all the time at all the hosts and then replay selected collections where operators can collect packet traces and trace TCP executions for diagnosis. The key challenge for deterministic TCP replay is the butterfly effect---a small timing variation causes a chain reaction between TCP and the network that drives the system to a completely different state in the replay. To eliminate the butterfly effect, we propose to replay individual TCP connection separately and capture all the interactions between a connection with the applications and the network. Our evaluation shows that \system has low recording overhead and can help diagnose many TCP performance problems such as long latency related to zero-window probes, late fast retransmission, frequent retransmission timeout, to problems related to the switch shared buffer.

## Break with Refreshments

Grand Ballroom Foyer

## Improving Machine Learning

Session Chair: Mosharaf Chowdhury, University of Michigan

## JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs

Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, and Byung-Gon Chun, Seoul National University

Available Media

The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability.

This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, the de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into the symbolic graph operations. Experiments demonstrate that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.

## BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference?

Suhas Jayaram Subramanya and Harsha Vardhan Simhadri, Microsoft Research India; Srajan Garg, IIT Bombay; Anil Kag and Venkatesh Balasubramanian, Microsoft Research India

Available Media

Many large scale machine learning training and inference tasks are memory-bound rather than compute-bound. That is, on large data sets, the working set of these algorithms does not fit in memory for jobs that could run overnight on a few multi-core processors. This often forces an expensive redesign of the algorithm to distributed platforms such as parameter servers and Spark.

We propose an inexpensive and efficient alternative based on the observation that many ML tasks admit algorithms that can be programmed with linear algebra subroutines. A library that supports BLAS and sparseBLAS interface on large SSD-resident matrices can enable multi-threaded code to scale to industrial scale data sets on a single workstation.

We demonstrate that not only can such a library provide near in-memory performance for BLAS, but can also be used to write implementations of complex algorithms such as eigensolvers that outperform in-memory (ARPACK) and distributed (Spark) counterparts.

Existing multi-threaded in-memory code can link to our library with minor changes and scale to hundreds of Gigabytes of training or inference data at near in-memory processing speeds. We demonstrate this with two industrial scale use cases arising in ranking and relevance pipelines: training large scale topic models and inference for extreme multi-label learning.

This suggests that our approach could be an efficient alternative to expensive big-data compute systems for scaling up structurally complex machine learning tasks.

## Tiresias: A GPU Cluster Manager for Distributed Deep Learning

Juncheng Gu, Mosharaf Chowdhury, and Kang G. Shin, University of Michigan, Ann Arbor; Yibo Zhu, Microsoft and Bytedance; Myeongjae Jeon, Microsoft and UNIST; Junjie Qian, Microsoft; Hongqiang Liu, Alibaba; Chuanxiong Guo, Bytedance

Available Media

Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two-Dimensional Gittins index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed, and present a placement algorithm to leverage these observations without any user input. Experiments on the Michigan ConFlux cluster with 60 P100 GPUs and large-scale trace-driven simulations show that Tiresias improves the average JCT by up to 5.5× over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

## Network Functions

Session Chair: Ryan Stutsman, University of Utah