NSDI '19 Technical Sessions

Registered attendees can download prepublication versions of papers submitted for the spring submission deadline. Abstracts of all papers are available to everyone now. Copyright to the individual works is retained by the author(s). In February, the full Proceedings as well as all of the final paper PDFs will be posted.

Monday, February 25

7:00 pm–9:30 pm

NSDI '19 Preview Session

Are you new to NSDI? Are you a networking expert but feel bewildered when talk turns to security? Are you interested in engaging more deeply with paper presentations outside your research area? Join us for the NSDI preview session, where area experts will give short introductions to the Symposium's major technical sessions.

Tuesday, February 26

7:30 am–8:30 am

Continental Breakfast

8:30 am–8:45 am

Opening Remarks and Best Paper Awards

Program Co-Chairs: Jay Lorch, Microsoft Research, and Minlan Yu, Harvard University

8:45 am–10:00 am

Host Networking

Datacenter RPCs can be General and Fast

Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David Andersen, Carnegie Mellon University

Available Media

It is commonly believed that datacenter networking software must sacrifice generality to attain high performance. The popularity of specialized distributed systems designed specifically for niche technologies such as RDMA, lossless networks, FPGAs, and programmable switches testifies to this belief. In this paper, we show that such specialization is not necessary. eRPC is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics. eRPC performs well in three key metrics: message rate for small messages; bandwidth for large messages; and scalability to a large number of nodes and CPU cores. It handles packet loss, congestion, and background request execution. In microbenchmarks, one CPU core can handle up to 10 million small RPCs per second, or send large messages at 75 Gbps. We port a production-grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 microseconds of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.

Eiffel: Efficient and Flexible Software Packet Scheduling

Ahmed Saeed and Yimeng Zhao, Georgia Institute of Technology; Nandita Dukkipati, Google; Ellen Zegura and Mostafa Ammar, Georgia Institute of Technology; Khaled Harras, Carnegie Mellon University; Amin Vahdat, Google

Available Media

Packet scheduling determines the ordering of packets in a queuing data structure with respect to some ranking function that is mandated by a scheduling policy. It is the core component in many recent innovations in optimizing network performance and utilization. Packet scheduling is used for network resource allocation, meeting network-wide delay objectives, or providing isolation and differentiation of service. Our focus in this paper is on the design and deployment of packet scheduling in software. Software schedulers have several advantages including shorter development cycle and flexibility in functionality and deployment location. We substantially improve software packet scheduling performance, while maintaining its flexibility, by exploiting underlying features of packet ranking; the fact that packet ranks are integers that have predetermined ranges and that many packets will typically have equal rank. This allows us to rely on integer priority queues, compared to existing ranking algorithms, that rely on comparison-based priority queues that assume continuous ranks with infinite range. We introduce Eiffel, a novel programmable packet scheduling system. At the core of Eiffel is an integer priority queue based on the Find First Set (FFS) instruction and designed to support a wide range of policies and ranking functions efficiently. As an even more efficient alternative, we also propose a new approximate priority queue that can outperform FFS-based queues for some scenarios. To support flexibility, Eiffel introduces novel programming abstractions to express scheduling policies that cannot be captured by current, state of the art scheduler programming models. We evaluate Eiffel in a variety of settings and in both Kernel and userspace deployments. We show that it outperforms state of the art systems by 3-40x in terms of either number of cores utilized for network processing or number of flows given fixed processing capacity.

Loom: Flexible and Efficient NIC Packet Scheduling

Brent Stephens, UIC; Aditya Akella and Michael Swift, UW-Madison

In multi-tenant cloud data centers, operators need to ensure that competing tenants and applications are isolated from each other and fairly share limited network resources. With current NICs, operators must either 1) use a single NIC queue and enforce network policy in software, which incurs high CPU overheads and struggles to drive increasing line-rates (100Gbps), or 2) use multiple NIC queues and accept imperfect isolation and policy enforcement. These problems arise due to inflexible and static NIC packet schedulers and an inefficient OS/NIC interface.

To overcome these limitations, we present Loom, a new NIC design that moves all per-flow scheduling decisions out of the OS and into the NIC. The key aspects of Loom's design are 1) a new network policy abstraction: restricted directed acyclic graphs (DAGs), 2) a programmable hierarchical packet scheduler, and 3) a new expressive and efficient OS/NIC interface that enables the OS to precisely control how the NIC performs packet scheduling while still ensuring low CPU utilization. Loom is the only multiqueue NIC design that is able to efficiently enforce network policy. We find empirically that Loom lowers latency, increases throughput, and improves fairness for collocated applications and tenants.

10:00 am–10:30 am

Break with Refreshments

10:30 am–12:10 pm

Distributed Systems

Exploiting Commutativity For Practical Fast Replication

Seo Jin Park and John Ousterhout, Stanford University

Traditional approaches to replication require client requests to be ordered before making them durable by copying them to replicas. As a result, clients must wait for two round-trip times (RTTs) before updates complete. In this paper, we show that this entanglement of ordering and durability is unnecessary for strong consistency. Consistent Unordered Replication Protocol (CURP) allows clients to replicate requests that have not yet been ordered, as long as they are commutative. This strategy allows most operations to complete in 1 RTT (the same as an unreplicated system). We implemented CURP in the Redis and RAMCloud storage systems. In RAMCloud, CURP improved write latency by ~2x (14us -> 7.1us) and write throughput by 4x. Compared to unreplicated RAMCloud, CURP's latency overhead for 3-way replication is just 1us (6.1us vs 7.1us). CURP transformed a non-durable Redis cache into a consistent and durable storage system with only a small performance overhead.

Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification

Assaf Eisenman, Stanford University; Asaf Cidon, Stanford University and Barracuda Networks; Evgenya Pergament and Or Haimovich, Stanford University; Ryan Stutsman, University of Utah; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University

Available Media

As its price per bit drops, SSD is increasingly becoming the default storage medium for hot data in cloud application databases. Even though SSD’s price per bit is more than 10× lower, and it provides sufficient performance (when accessed over a network) compared to DRAM, the durability of flash has limited its adoption in write-heavy use cases, such as key-value caching. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, which significantly shortens the lifetime of flash. We present Chimera, a hybrid key-value cache that uses DRAM as a “filter” to control and limit writes to SSD. Chimera performs lightweight machine learning admission control to predict which objects are likely to be read frequently without getting updated; these objects, which are prime candidates to be stored on SSD, are written to SSD in large chunks sequentially. In order to efficiently utilize the cache’s available memory, we design a novel in-memory index for the variable-sized objects stored on flash that requires only 4 bytes per object in DRAM. We describe Chimera’s design and implementation, and evaluate it on real-world traces from a widely used caching service, Memcachier. Compared to state-of-the-art systems that suffer a write amplification of 2.5× or more, Chimera maintains a median write amplification of 0.5× (since many filtered objects are never written to flash at all), without any loss of hit rate or throughput.

Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores

Diego Didona, EPFL; Willy Zwaenepoel, EPFL and University of Sydney

Available Media

This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.

Monoxide: Scale Out Blockchain with Asynchronized Consensus Zones

Jiaping Wang; Hao Wang, The Ohio State University

Cryptocurrencies have provided a promising infrastructure for pseudonymous online payments. However, low transaction-per-second (TPS) has significantly hindered the scalability and usability of cryptocurrency systems for increasing numbers of users and transactions. Another obstacle to achieving scalability is that every node is required to duplicate the communication, storage, and state representation of the entire network.

In this paper, we introduce a novel Nakamoto consensus system, which scales linearly without compromising decentralization or security. We achieve this by running multiple independent and parallel instances of Nakamoto consensus (shards). The consensus happens independently within each shard with minimized communication, which partitions the workload of the entire network and ensures moderate burden for each individual node as network grows. We propose eventual atomicity to ensure transaction atomicity cross shards, which guarantees the efficient completion of transaction without the overhead of two-phase commit protocol. We also propose the batch mining for ensuring effective mining power in each shard at the same level of the entire network, making an attack on any individual shard as hard as that on the entire network. Our experimental results show the effectiveness of our work. On a testbed including 1,200 virtual machines worldwide to support 48,000 nodes, our system can deliver $1,000\times$ TPS over Bitcoin.

12:10 pm–1:30 pm

Symposium Luncheon and Test of Time Award Presentation

1:30 am–3:10 pm

Modern Network Hardware

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

Daehyeok Kim and Tianlong Yu, Carnegie Mellon University; Hongqiang Harry Liu, Alibaba; Yibo Zhu, Microsoft and Bytedance; Jitu Padhye and Shachar Raindel, Microsoft; Chuanxiong Guo, Bytedance; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University

Available Media

Many popular large-scale cloud applications are increasingly using containerization for high resource efficiency and lightweight isolation. In parallel, many data-intensive applications (e.g., data analytics and deep learning frameworks) are adopting or looking to adopt RDMA for high networking performance. Industry trends suggest that these two approaches are on an inevitable collision course. In this paper, we present FreeFlow, a software-based RDMA virtualization framework designed for containerized clouds. FreeFlow realizes virtual RDMA networking purely with a software-based approach using commodity RDMA NICs. Unlike existing RDMA virtualization solutions, FreeFlow fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies. FreeFlow is also transparent to applications and provides networking performance close to bare-metal RDMA with low CPU overhead. In our evaluations with TensorFlow and Spark, FreeFlow provides almost the same application performance as bare-metal RDMA.

Direct Universal Access: Making Data Center Resources Available to FPGA

Ran Shu and Peng Cheng, Microsoft Research; Guo Chen, Microsoft Research & Hunan University; Zhiyuan Guo, Microsoft Research & Beihang University; Lei Qu and Yongqiang Xiong, Microsoft Research; Derek Chiou and Thomas Moscibroda, Microsoft Azure

Available Media

FPGAs have been deployed at massive scale in data centers. The currently available communication architectures, however, make FPGAs very difficult to utilize resources in data center. In this paper, we present Direct Universal Access (DUA), a communication architecture that provides uniform access for FPGA to heterogeneous data center resources. Without considering machine boundaries, DUA provides global names and a common interface for communicating with various resources, where the underlying network automatically routes traffic and manages resource multiplexing. Our benchmarks show that DUA provides simple and fair-share resource access with small logic area overhead (<10%) and negligible latency (<0.2$\mu$s). We also build two practical multi-FPGA applications, deep crossing and regular expression matching, on top of DUA to demonstrate the usability and efficiency.

Stardust: Divide and Conquer in the Data Center Network

Noa Zilberman, University of Cambridge; Gabi Bracha and Golan Schzukin, Broadcom

Building scalable data centers, and network devices that fit within these data centers, becomes increasingly hard. With modern switches pushing at the boundary of manufacturing feasibility, being able to build simple, and scalable network fabrics is of utmost importance in data centers. We introduce Stardust: a scalable fabric architecture for data center networks, inspired by network-switch system and scaled up to data center scale. Stardust combines packet switches at the edge and disaggregated cell switches at the network fabric, using scheduled traffic. Stardust is a scalable, distributed solution, attending to the limitations of network-switch design, while offering improved performance and power savings compared with traditional solutions. With networking requirements ever increasing, Stardust predicts the elimination of packet switches, replaced by cell switches in the network, and smart network interface cards at the edges.

Blink: Fast Connectivity Recovery Entirely in the Data Plane

Thomas Holterbach, Edgar Costa Molero, and Maria Apostolaki, ETH Zurich; Alberto Dainotti, CAIDA / UC San Diego; Stefano Vissicchio, UC London; Laurent Vanbever, ETH Zurich

In this paper, we explore new possibilities, created by programmable switches, for fast rerouting upon signals triggered by Internet traffic disruptions. We present Blink, a data-driven system exploiting TCP-induced signals to detect failures. The key intuition behind Blink is that a TCP flow exhibits a predictable behavior upon disruption: retransmit- ting the same packet over and over, at epochs exponentially spaced in time. When compounded over multiple flows, this behavior creates a strong and characteristic failure signal. Blink efficiently analyzes TCP flows, at line rate, to: (i) select flows to track; (ii) reliably and quickly detect major traffic disruptions; and (iii) recover data-plane connectivity, via next-hops compatible with the operator’s policies.

We present an end-to-end implementation of Blink in P4 together with an extensive evaluation on real and synthetic traffic traces. Our results indicate that Blink: (i) can achieve sub-second rerouting for realistic Internet traffic; (ii) pre- vents unnecessary traffic shifts, in the presence of noise; and (iii) scales to protect large fractions of realistic Internet traffic, on existing hardware. We further show the feasibility of Blink by running our system on a real Tofino switch.

3:10 pm–3:40 pm

Break with Refreshments

3:40 pm–4:55 pm

Analytics

Hydra: a federated resource manager for data-center scale analytics

Carlo Curino, Subru Krishnan, and Konstantinos Karanasos, Microsoft; Sriram Rao, Facebook; Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga, Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan, Microsoft

Available Media

Microsoft's internal data lake processes exabytes of data over millions of cores daily on behalf of thousands of tenants. Scheduling this workload requires 10x to 100x more decisions per second than existing, general-purpose resource management frameworks are known to handle. In 2013, we were faced with a growing demand for workload diversity and richer sharing policies that our legacy system could not meet. In this paper, we present Hydra, the resource management infrastructure we built to meet these requirements.

Hydra leverages a federated architecture, in which a cluster is comprised of multiple, loosely coordinating subclusters. This allows us to scale by delegating placement of tasks on machines to each sub-cluster, while centrally coordinating only to ensure that tenants receive the right share of resources. To adapt to changing workload and cluster conditions promptly, Hydra's design features a control plane that can push scheduling policies across tens of thousands of nodes within seconds. This feature combined with the federated design allows for great agility in developing, evaluating, and rolling out new system behaviors.

We built Hydra by leveraging, extending, and contributing our code to Apache Hadoop YARN. Hydra is currently the primary big-data resource manager at Microsoft. Over the last few years, Hydra has scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data.

Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure

Qifan Pu, UC Berkeley; Shivaram Venkataraman, University of Wisconsin, Madison; Ion Stoica, UC Berkeley

Serverless computing is poised to fulfill the long-held promise of transparent elasticity and millisecond-level pricing. To achieve this goal, service providers impose a finegrained computational model where every function has a maximum duration, a fixed amount of memory and no persistent local storage. We observe that the fine-grained elasticity of serverless is key to achieve high utilization for general computations such as analytics workloads, but that resource limits make it challenging to implement such applications as they need to move large amounts of data between functions that don’t overlap in time. In this paper, we present Locus, a serverless analytics system that judiciously combines (1) cheap but slow storage with (2) fast but expensive storage, to achieve good performance while remaining cost-efficient. Locus applies a performance model to guide users in selecting the type and the amount of storage to achieve the desired cost-performance trade-off. We evaluate Locus on a number of analytics applications including TPC-DS, CloudSort, Big Data Benchmark and show that Locus can navigate the costperformance trade-off, leading to 4×-500× performance improvements over slow storage-only baseline and reducing resource usage by up to 59% while achieving comparable performance on a cluster of virtual machines.

dShark: A General, Easy to Program and Scalable Framework for Analyzing In-network Packet Traces

Da Yu, Brown University; Yibo Zhu, Microsoft and Bytedance; Behnaz Arzani, Microsoft; Rodrigo Fonseca, Brown University; Tianrong Zhang, Karl Deng, and Lihua Yuan, Microsoft

Available Media

Distributed, in-network packet capture is still the last resort for diagnosing network problems. Despite recent advances in collecting packet traces scalably, effectively utilizing pervasive packet captures still poses important challenges. Arbitrary combinations of middleboxes which transform packet headers make it challenging to even identify the same packet across multiple hops; packet drops in the collection system create ambiguities that must be handled; the large volume of captures, and their distributed nature, make it hard to do even simple processing; and the one-off and urgent nature of problems tends to generate ad-hoc solutions that are not reusable and do not scale. In this paper we propose dShark to address these challenges. dShark allows intuitive groupings of packets across multiple traces that are robust to header transformations and capture noise, offering simple streaming data abstractions for network operators. Using dShark on production packet captures from a major cloud provider, we show that dShark makes it easy to write concise and reusable queries against distributed packet traces that solve many common problems in diagnosing complex networks. Our evaluation shows that dShark can analyze production packet traces with more than 10 Mpps throughput on a commodity server, and has near-linear speedup when scaling out on multiple servers.

4:55 pm–5:10 pm

Short Break

5:10 pm–6:25 pm

Data Center Network Architecture

Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks

Shizhen Zhao, Rui Wang, Junlan Zhou, Joon Ong, Jeffrey C. Mogul, and Amin Vahdat, Google, Inc.

Available Media

Clos topologies have been widely adopted for large-scale data center networks (DCNs), but it has been difficult to support incremental expansions for Clos DCNs. Some prior work has claimed that the structure of Clos topologies hinders incremental expansion.

We demonstrate that it is indeed possible to design expandable Clos DCNs, and to expand them while they are carrying live traffic, without incurring packet loss. We use a layer of patch panels between blocks of switches in a Clos DCN, which makes physical rewiring feasible, and we describe how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper. We also describe a block-aggregation technique that makes our ILP approach scalable. We tested our "minimal-rewiring" solver on two kinds of fine-grained expansions using 2250 synthetic DCN topologies, and found that the solver can handle 99% of these cases while changing under 25% of the connections. Compared to prior approaches, this solver (on average) reduces the average number of "stages" per expansion from 4 to 1.29, and reduces the number of wires changed by an order of magnitude or more—a significant improvement to our operational costs, and to our exposure (during expansions) to capacity-reducing faults.

Understanding Lifecycle Management Complexity of Datacenter Topologies

Mingyang Zhang, University of Southern California; Radhika Niranjan Mysore, VMware Research; Sucha Supittayapornpong and Ramesh Govindan, University of Southern California

Most recent datacenter topology designs have focused on performance properties such as latency and throughput. In this paper, we explore a new dimension, life cycle management, which attempts to capture operational costs of topologies. Specifically, we consider costs associated with deployment and expansion of topologies and explore how structural properties of two different topology families (Clos and expander graphs as exemplified by Xpander) affect these. We also develop a new topology that has the wiring simplicity of Clos and the expandability of expander graphs using the insights from our study.

Shoal: A Network Architecture for Disaggregated Racks

Vishal Shrivastav, Cornell University; Asaf Valadarsky, Hebrew University of Jerusalem; Hitesh Ballani and Paolo Costa, Microsoft Research; Ki Suh Lee, Waltz Networks; Han Wang, Barefoot Networks; Rachit Agarwal and Hakim Weatherspoon, Cornell University

Available Media

Disaggregated racks comprise a dense cluster of separate pools of compute, memory and storage blades, all inter-connected through an internal network within a single rack. However, their density poses a unique challenge for the rack’s network: it needs to connect an order of magnitude more nodes than today’s racks without exceeding the rack’s fixed power budget and without compromising on performance. We present Shoal, a power-efficient yet performant intra-rack network fabric built using fast circuit switches. Such switches consume less power as they have no buffers and no packet inspection mechanism, yet can be reconfigured in nanoseconds. Rack nodes transmit according to a static schedule such that there is no in-network contention without requiring a centralized controller. Shoal’s congestion control leverages the physical fabric to achieve fairness and both bounded worst-case network throughput and queuing. We use an FPGA-based prototype, testbed experiments, and simulations to illustrate Shoal’s mechanisms are practical, and can simultaneously achieve high density and high performance: 71% lower power and comparable or higher performance than today’s network designs.

Wednesday, February 27

7:30 am–8:30 am

Continental Breakfast

8:30 am–10:10 am

Wireless Technologies

NetScatter: Enabling Large-Scale Backscatter Networks

Mehrdad Hessar, Ali Najafi, and Shyamnath Gollakota, University of Washington

Available Media

We present the first wireless protocol that scales to hundreds of concurrent transmissions from backscatter devices. Our key innovation is a distributed coding mechanism that works below the noise floor, operates on backscatter devices and can decode all the concurrent transmissions at the receiver using a single FFT operation. Our design addresses practical issues such as timing and frequency synchronization as well as the near-far problem. We deploy our design using a testbed of backscatter hardware and show that our protocol scales to concurrent transmissions from 256 devices using a bandwidth of only 500 kHz. Our results show throughput and latency improvements of 14–62x and 15–67x over existing approaches and 1–2 orders of magnitude higher transmission concurrency.

Towards Programming the Radio Environment with Large Arrays of Inexpensive Antennas

Zhuqi Li, Yaxiong Xie, and Longfei Shangguan, Princeton University; Rotman Ivan Zelaya, Yale University; Jeremy Gummeson, UMass Amherst; Wenjun Hu, Yale University; Kyle Jamieson, Princeton University

Conventional thinking treats the wireless channel as a given constraint. Therefore, wireless network designs to date center on the problem of the endpoint optimization that best utilizes the channel, for example, via rate and power control at the transmitter or sophisticated decoding mechanisms at the receiver. We instead explore whether it is possible to reconfigure the environment itself to facilitate wireless communication. In this work, we instrument the environment with a large array of inexpensive antennas (LAIA) and design algorithms to configure them in real time. Our system achieves this level of programmability through rapid adjustments of an on-board phase shifter in each LAIA device. We design a channel decomposition algorithm to quickly estimate the wireless channel due to the environment alone, which leads us to a process to align the phases of the array elements. Variations of our core algorithm can then optimize wireless channels on the fly for single- and multi-antenna links, as well as nearby networks operating on adjacent frequency bands. We design and deploy a 36-element passive array in a real indoor home environment. Experiments with this prototype show that, by reconfiguring the wireless environment, we can achieve a 24% TCP throughput improvement on average and a median improvement of 51.4% in Shannon capacity over the baseline single-antenna links. Over the baseline multi-antenna links, LAIA achieves an improvement of 12.23% to 18.95% in Shannon capacity.

Pushing the Range Limits of Commercial Passive RFIDs

Jingxian Wang, CMU; Junbo Zhang, Tsinghua University; Rajarshi Saha, IIT Kharagpur; Haojian Jin and Swarun Kumar, CMU

This paper asks: "Can we push the prevailing range limits of commercial passive RFIDs?". Today’s commercial passive RFIDs report ranges of 5-15 meters at best. This constrains RFIDs to be detected only at specific checkpoints in warehouses, stores and factories today, leaving them outside of communication range beyond these spaces. State-of-the-art approaches to improve the range of RFIDs develop new tag hardware that necessarily sacrifices some of the most attractive features of passive RFIDs such as their low cost, small form-factor or the absence of a battery.

We present PushID, a system that exploits collaboration between readers to enhance the range of commercial passive RFID tags, without altering the tags whatsoever. PushID uses distributed MIMO to coherently combine signals across geographically separated RFID readers at the tags. In doing so, it resolves the chicken-or-egg problem of inferring the optimal beamforming parameters to beam energy to a tag without any feedback from the tag itself, which needs this energy to respond in the first place. A prototype evaluation of PushID with 8 distributed RFID readers reveals a range of 64-meters to the closest reader, a 7.4x improvement in range over state-of-the-art commercial RFID systems.

SweepSense: Sensing 5 GHz in 5 Milliseconds with Low-cost SDRs

Yeswanth Guddeti, UC San Diego; Raghav Subbaraman, IIT Madras; Moein Khazraee, Aaron Schulman, and Dinesh Bharadia, UC San Diego

We introduce a new spectrum sensing paradigm called ``SweepSense'' where spectrum sensors rapidly sweep 5 GHz in only five milliseconds. The insight behind SweepSense is that commercial Software Defined Radios can be easily modified to make them capable of rapidly sweeping the spectrum. However, this introduces a new challenge: SweepSense only captures a small number of distorted samples of the transmissions in the spectrum. To overcome this challenge, we introduce a novel technique to correct the distortion with self-generated calibration data, and to classify the protocol that originated each transmission with only a fraction of the transmission's samples. Surprisingly, we discovered that these samples contain sufficient signal to identify the protocol that originated a transmission in the heavily used ISM bands, and to determine the load on the downlink of an LTE basestation.

10:10 am–10:40 am

Break with Refreshments

10:40 am–11:55am

Operating Systems

Slim: OS Kernel Support for a Low-Overhead Container Overlay Network

Danyang Zhuo and Kaiyuan Zhang, University of Washington; Yibo Zhu, Microsoft and Bytedance; Hongqiang Harry Liu, Alibaba; Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson, University of Washington

Containers have become the de facto method for hosting large-scale distributed applications. Container overlay networks are essential to providing portability for containers, yet they impose significant overhead in terms of throughput, latency, and CPU utilization. The key problem is a reliance on packet transformation to implement network virtualization. As a result, each packet has to traverse the network stack twice in both the sender and the receiver’s host OS kernel. We have designed and implemented Slim, a low-overhead container overlay network that implements network virtualization by manipulating connection-level metadata. Our solution maintains compatibility with today’s containerized applications. Evaluation results show that Slim improves the throughput of an in-memory key-value store by 66% while reducing the latency by 42%. Slim reduces the CPU utilization of the in-memory key-value store by 54%. Slim also reduces the CPU utilization of a web server by 28%-40%, a database server by 25%, and a stream processing framework by 11%.

Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency

Kostis Kaffes, Timothy Chong, and Jack Tigar Humphries, Stanford University; Adam Belay, Massachusetts Institute of Technology; David Mazières and Christos Kozyrakis, Stanford University

The recently proposed dataplanes for microsecond scale applications, such as IX and ZygOS, use non-preemptive policies to schedule requests to cores. For the many real-world scenarios where request service times follow distributions with high dispersion or a heavy tail, they allow short requests to be blocked behind long requests, which leads to poor tail latency.

Shinjuku is a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5µsec and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over IX and ZygOS for a wide range of workload scenarios. For the case of a RocksDB server processing both point and range queries, Shinjuku achieves up to 6.6× higher throughput and 88% lower tail latency.

Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan, MIT CSAIL

Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today's kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the em peak expected load.

Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity—every 5us—enabling cycles unused by latency-sensitive applications to be used productively by background tasks. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for background application throughput, vastly increasing CPU efficiency.

11:55 am–1:30 pm

Lunch (on your own)

1:30 pm–3:10 pm

Monitoring and Diagnosis

End-to-end I/O Monitoring on a Leading Supercomputer

Bin Yang, Shandong University; Xu Ji, Tsinghua University; Xiaosong Ma, Qatar Computing Research institute; Xiyang Wang, National Supercomputing Center in Wuxi; Tianyu Zhang and Xiupeng Zhu, Shandong University; Nosayba El-Sayed, Emory University; Haidong Lan, Shandong University; Yibo Yang, Shandong Unversity; Jidong Zhai, Tsinghua University; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University

This paper presents an effort to overcome the complexities of production-use I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.2. It simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification.

With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with a collection of real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Finally, both codes and data collected are to be released.

Zeno: Diagnosing Performance Problems with Temporal Provenance

Yang Wu, Facebook; Ang Chen, Rice University; Linh Thi Xuan Phan, University of Pennsylvania

Available Media

When diagnosing a problem in a distributed system, it is sometimes necessary to explain the timing of an event—for instance, why a response has been delayed, or why the network latency is high. Existing tools o er some support for this, typically by tracing the problem to a bottleneck or to an overloaded server. However, locating the bottleneck is merely the first step: the real problem may be some other service that is sending traffic over the bottleneck link, or a misbehaving machine that is overloading the server with requests. These off-path causes do not appear in a conventional trace and will thus be missed by most existing diagnostic tools.

In this paper, we introduce a new concept we call temporal provenance that can help with diagnosing timing-related problems. Temporal provenance is inspired by earlier work on provenance-based network debugging; however, in addition to the functional problems that can already be handled with classical provenance, it can also diagnose problems that are related to timing. We present an algorithm for generating temporal provenance and an experimental debugger called Zeno; our experimental evaluation shows that Zeno can successfully diagnose several realistic performance bugs.

Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks

Anurag Khandelwal, UC Berkeley; Rachit Agarwal, Cornell University; Ion Stoica, UC Berkeley

Available Media

Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure—Atomic MultiLog—that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.

DETER: Deterministic TCP Replay For Performance Diagnosis

Yuliang Li, Harvard University; Rui Miao, Alibaba Group; Mohammad Alizadeh, Massachusetts Institute of Technology; Minlan Yu, Harvard University

TCP performance problems are notoriously tricky to diagnose because a subtle choice of TCP parameters or features may lead to completely different performance. A gold standard for diagnosis is to collect packet traces and trace TCP executions. However, it is not easy to use such tools in large-scale data centers where many TCP connections interact with each other. In this paper, we introduce DETER, a deterministic TCP replay tool, which runs lightweight recording all the time at all the hosts and then replay selected collections where operators can collect packet traces and trace TCP executions for diagnosis. The key challenge for deterministic TCP replay is the butterfly effect—a small timing variation causes a chain reaction between TCP and the network that drives the system to a completely different state in the replay. To eliminate the butterfly effect, we propose to replay individual TCP connection separately and capture all the interactions between a connection with the applications and the network. Our evaluation shows that \system has low recording overhead and can help diagnose many TCP performance problems such as long latency related to zero-window probes, late fast retransmission, frequent retransmission timeout, to problems related to the switch shared buffer.

3:10 pm–3:40 pm

Break with Refreshments

3:40 pm–4:55 pm

Improving Machine Learning

JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs

Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, DongJin Shin, and Byung-Gon Chun, Seoul National University

The rapid evolution of deep neural networks is demanding deep learning (DL) frameworks not only to satisfy the traditional requirement of quickly executing large computations, but also to support straightforward programming models for quickly implementing and experimenting with complex network structures. However, existing frameworks fail to excel in both departments simultaneously, leading to diverged efforts for optimizing performance and improving usability. This paper presents JANUS, a system that combines the advantages from both sides by transparently converting an imperative DL program written in Python, the de-facto scripting language for DL, into an efficiently executable symbolic dataflow graph. JANUS can convert various dynamic features of Python, including dynamic control flow, dynamic types, and impure functions, into the symbolic graph operations. Experiments demonstrate that JANUS can achieve fast DL training by exploiting the techniques imposed by symbolic graph-based DL frameworks, while maintaining the simple and flexible programmability of imperative DL frameworks at the same time.

BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference?

Suhas Jayaram Subramanya and Harsha Vardhan Simhadri, Microsoft Research India; Srajan Garg, IIT Bombay; Anil Kag and Venkatesh Balasubramanian, Microsoft Research India

Available Media

Many large scale machine learning training and inference tasks are memory-bound rather than compute-bound. That is, on large data sets, the working set of these algorithms does not fit in memory for jobs that could run overnight on a few multi-core processors. This often forces an expensive redesign of the algorithm to distributed platforms such as parameter servers and Spark.

We propose an inexpensive and efficient alternative based on the observation that many ML tasks admit algorithms that can be programmed with linear algebra subroutines. A library that supports BLAS and sparseBLAS interface on large SSD-resident matrices can enable multi-threaded code to scale to industrial scale data sets on a single workstation.

We demonstrate that not only can such a library provide near in-memory performance for BLAS, but can also be used to write implementations of complex algorithms such as eigensolvers that outperform in-memory (ARPACK) and distributed (Spark) counterparts.

Existing multi-threaded in-memory code can link to our library with minor changes and scale to hundreds of Gigabytes of training or inference data at near in-memory processing speeds. We demonstrate this with two industrial scale use cases arising in ranking and relevance pipelines: training large scale topic models and inference for extreme multi-label learning.

This suggests that our approach could be an efficient alternative to expensive big-data compute systems for scaling up structurally complex machine learning tasks.

Tiresias: A GPU Cluster Manager for Distributed Deep Learning

Juncheng Gu, Mosharaf Chowdhury, and Kang G. Shin, University of Michigan, Ann Arbor; Yibo Zhu, Microsoft and Bytedance; Myeongjae Jeon, UNIST and Microsoft; Junjie Qian, Microsoft Research; Hongqiang Liu, Alibaba; Chuanxiong Guo, Bytedance

Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or- nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers – coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible – cause long queueing delays and low overall performance.

We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms – Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic – that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs – and large-scale trace-driven simulations – show that Tiresias improves the average JCT by up to 5.5× over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.

4:55 pm–5:10 pm

Short Break

5:10 pm–6:25 pm

Network Functions

Correctness and Performance for Stateful Chained Network Functions

Junaid Khalid and Aditya Akella, University of Wisconsin - Madison

Available Media

Network functions virtualization (NFV) allows operators to employ NF chains to realize custom policies, and dynamically add instances to meet demand or for failover. NFs maintain detailed per- and cross-flow state which needs careful management, especially during dynamic actions. Crucially, state management must: (1) ensure NF chain-wide correctness and (2) have good performance. To this end, we built CHC, an NFV framework that leverages an external state store coupled with state management algorithms and metadata maintenance for correct operation even under a range of failures. Our evaluation shows that CHC can support ~10Gbps per-NF throughput and $<0.6μs increase in median per-NF packet processing latency, and chain-wide correctness at little additional cost.

Performance contracts for software network functions

Rishabh Iyer, Luis Pedrosa, Arseniy Zaostrovnykh, Solal Pirelli, Katerina Argyraki, and George Candea, EPFL

While software network functions (NFs) promise great flexibility and easy deployment of network services, they face the challenge of unpredictable performance. We propose Bolt, a technique and tool for predicting the performance of the entire software stack of an NF comprising the core NF logic, DPDK packet processing framework, and the NIC driver. Bolt takes as input the NF implementation and generates a performance contract that provides, for any arbitrary packet scenario, a precise characterization of the NF's performance. Under the covers, Bolt leverages a state-based demarcation of NFs and combines a pre-analysis of stateful data structures with automated symbolic execution of the stateless NF code. Performance contracts allow scrutiny of NF performance with a fine level of granularity, enabling network developers and operators to understand the performance of the NF in the face of any workload, whether typical, exceptional, or adversarial. We evaluate Volt on four realistic NFs—a NAT, a Maglev-like load balancer, an LPM Router, and a MAC bridge—and show that Bolt's performance contracts predict the dynamic instruction count and memory accesses of the NF to within a maximum of 7% of real executions, for all NFs and traffic classes analyzed.

FlowBlaze: Stateful Packet Processing in Hardware

Salvatore Pontarelli, Axbryd/CNIT; Roberto Bifulco, NEC Laboratories Europe; Marco Bonola, Axbryd/CNIT; Carmelo Cascone, Open Networking Foundation; Marco Spaziani and Valerio Bruschi, CNIT/University of Rome Tor Vergata; Davide Sanvito, Politecnico di Milano; Giuseppe Siracusano, NEC Laboratories Europe; Antonio Capone, Politecnico di Milano; Michio Honda and Felipe Huici, NEC Laboratories Europe; Giuseppe Bianchi, CNIT/University of Rome Tor Vergata

Available Media

Programmable NICs allow for better scalability to handle growing network workloads, however, providing an expressive, yet simple, abstraction to program stateful network functions in hardware remains a research challenge. We address the problem with FlowBlaze, an open abstraction for building stateful packet processing functions in hardware. The abstraction is based on Extended Finite State Machines and introduces the explicit definition of flow state, allowing FlowBlaze to leverage flow-level parallelism. FlowBlaze is expressive, supporting a wide range of complex network functions, and easy to use, hiding low-level hardware implementation issues from the programmer. Our implementation of FlowBlaze on a NetFPGA SmartNIC achieves very low latency (on the order of a few microseconds), consumes relatively little power, can hold per-flow state for hundreds of thousands of flows and yields speeds of 40 Gb/s, allowing for even higher speeds on newer FPGA models. Both hardware and software implementations of FlowBlaze are publicly available.

6:30 pm–8:00 pm

Poster Session and Reception

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.

See the Call for Posters for information on how to submit your poster. The submission deadline is Wednesday, January 16, 2019.

Thursday, February 28

7:30 am–8:30 am

Continental Breakfast

8:30 am–10:10 am

Network Characterization

SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks

Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.

Network measurement and monitoring have been key to understanding the inner workings of computer networks and debugging the performance problems of distributed applications. Despite many products and much research on these topics, in the context of data centers, performing accurate measurement at scale in near real-time has remained elusive. On one hand, switch-based telemetry can give accurate per-packet views, but these must be assembled across the network and across packets to get network- and application-level insight: this is not scalable. On the other hand, purely end-host-based measurement is naturally scalable but so far has only provided partial views of in-network operation.

In this paper, we set out to push the boundary of edge-based measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this, we propose SIMON, an accurate and scalable measurement system for data centers that reconstructs key network state variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We then demonstrate that the function approximation capability of multi-layered neural networks can speed up SIMON by a factor of 5,000–10,000, enabling it to run in near real-time. We deployed SIMON in three testbeds with different link speeds, layers of switching and number of servers; evaluations with NetFPGAs and a cross-validation technique show that SIMON reconstructs queue-lengths to within 3-5 KBs and link utilizations to less than 1% of actual. The accuracy and speed of SIMON enables sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.

Is advance knowledge of flow sizes a plausible assumption?

Vojislav Dukic, ETH Zurich; Sangeetha Abdu Jyothi, University of Illinois at Urbana–Champaign; Bojan Karlas, Muhsen Owaida, Ce Zhang, and Ankit Singla, ETH Zurich

Recent research has proposed several packet, flow, and coflow scheduling methods that could substantially improve performance for data center workloads. Most of this work assumes advance knowledge of flow sizes, but the lack of a clear path to obtaining such knowledge has also prompted some work on non-clairvoyant scheduling, albeit with more limited performance benefits and narrower applicability.

We thus investigate whether flow sizes can be known in advance in practice, using both simple heuristics and learning methods. Our systematic and substantial efforts across these approaches for estimating flow sizes indicate, unfortunately, that such knowledge is likely hard to obtain with high confidence across many settings of practical interest. However, our prognosis is ultimately more positive: even simple heuristics can help estimate flow sizes for many flows, and this partial knowledge has utility even in schedulers designed for fully clairvoyant operation. These results indicate that a presumed lack of advance knowledge of flow sizes is not necessarily prohibitive for highly efficient scheduling, and suggest further exploration in two directions: (a) scheduling under partial knowledge; and (b) evaluating the practical payoff and expense of obtaining more knowledge.

Stable and Practical AS Relationship Inference with ProbLink

Yuchen Jin, University of Washington; Colin Scott, UC Berkeley; Amogh Dhamdhere, CAIDA; Vasileios Giotsas, Lancaster University; Arvind Krishnamurthy, University of Washington; Scott Shenker, UC Berkeley, ICSI

Knowledge of the business relationships between Autonomous Systems (ASes) is essential to understanding the behavior of the Internet routing system. Despite significant progress in the development of sophisticated relationship inference algorithms, the resulting datasets are impractical for many critical real-world applications, cannot offer adequate predictability in the configuration of routing policies, and suffer from inference oscillations. To achieve more practical and stable relationship inferences we first illuminate the root causes of the contradictions between these shortcomings and the near-perfect validation results of AS-Rank, the state-of-the-art relationship inference algorithm. Using a "naive" inference approach as a benchmark, we find that the available validation datasets over-represent AS links with easier inference requirements. We identify which types of links are harder to infer, and we develop appropriate validation subsets to enable more representative evaluation.

We then develop a probabilistic algorithm, ProbLink, to overcome the inference barriers for hard links, such as non-valley-free routing, limited visibility, and non-conventional peering practices. To this end, we identify key interconnection features that provide stochastically informative and highly predictive relationship inference signals. Compared to AS-Rank, our approach reduces the error rate for all links by 1.6$\times$, and importantly, by up to 6.1$\times$ for different types of hard links. We demonstrate the practical significance of our improvements by evaluating their impact on three applications. Compared to the current state-of-the-art, ProbLink increases the precision and recall of route leak detection by 4.1$\times$ and 3.4$\times$ respectively, reveals 27% more complex relationships, and increases the precision of predicting the impact of selective advertisements by 34%.

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

Cheng Tan, NYU; Ze Jin, Cornell University; Chuanxiong Guo, Bytedance; Tianrong Zhang, Microsoft; Haitao Wu, Google; Karl Deng, Dongming Bi, and Dong Xiang, Microsoft

Available Media

The availability of data center services is jeopardized by various network incidents. One of the biggest challenges for network incident handling is to accurately localize the failures, among millions of servers and tens of thousands of network devices. In this paper, we propose NetBouncer, a failure localization system that leverages the IP-in-IP technique to actively probe paths in a data center network. NetBouncer provides a complete failure localization framework which is capable of detecting both device and link failures. It further introduces an algorithm for high accuracy link failure inference that is resilient to real-world data inconsistency by integrating both our troubleshooting domain knowledge and machine learning techniques. NetBouncer has been deployed in Microsoft Azure’s data centers for three years. And in practice, it produced no false positives and only a few false negatives so far.

10:10 am–10:40 am

Break with Refreshments

10:40 am–12:20 pm

Privacy and Security

Riverbed: Enforcing User-defined Privacy Constraints in Distributed Web Services

Frank Wang, MIT CSAIL; Ronny Ko and James Mickens, Harvard University

Available Media

Riverbed is a new framework for building privacy-respecting web services. Using a simple policy language, users define restrictions on how a remote service can process and store sensitive data. A transparent Riverbed proxy sits between a user's front-end client (e.g., a web browser) and the back-end server code. The back-end code remotely attests to the proxy, demonstrating that the code respects user policies; in particular, the server code attests that it executes within a Riverbed-compatible managed runtime that uses IFC to enforce user policies. If attestation succeeds, the proxy releases the user's data, tagging it with the user-defined policies. On the server-side, the Riverbed runtime places all data with compatible policies into the same universe (i.e., the same isolated instance of the full web service). The universe mechanism allows Riverbed to work with unmodified, legacy software; unlike prior IFC systems, Riverbed does not require developers to reason about security lattices, or manually annotate code with labels. Riverbed imposes only modest performance overheads, with worst-case slowdowns of 10% for several real applications.

Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs

Xiang Wang, Yang Hong, and Harry Chang, Intel; KyoungSoo Park, KAIST; Geoff Langdale, branchfree.org; Jiayu Hu and Heqing Zhu, Intel

Regular expression matching serves as the key functionality of modern network security applications that detect malicious traffic. Unfortunately, it often becomes the performance bottleneck as it involves compute-intensive scan of every byte of packet payload. With trends towards a larger ruleset with more complex patterns, the performance requirement gets ever more demanding.

In this paper, we present Hyperscan, a high performance regular expression matcher for commodity server machines. Hyperscan employs two core techniques for efficient pattern matching. First, it exploits graph decomposition that translates regular expression matching into a series of string and finite automata matching. Unlike existing solutions, string matching becomes a part of regular expression matching, eliminating duplicate matching operations. Decomposed regular expression components also increase the chance of fast DFA matching as they tend to be smaller than the original pattern. Second, Hyperscan accelerates both string and finite automata matching by SIMD operations, which brings substantial throughput improvement. Our evaluation shows that Hyperscan improves the performance of Snort by a factor of 3.8 and 47.5 for a real traffic trace.

Deniable Upload and Download via Passive Participation

David Sommer, Aritra Dhar, Luka Malisa, and Esfandiar Mohammadi, ETH Zurich; Daniel Ronzani, Ronzani Schlauri Attorneys; Srdjan Capkun, ETH Zurich

Available Media

Downloading or uploading controversial information can put users at risk, making them hesitant to access or share such information. While anonymous communication networks (ACNs) are designed to hide communication meta-data, already connecting to an ACN can raise suspicion. In order to enable plausible deniability while providing or accessing controversial information, we design CoverUp: a system that enables users to asynchronously upload and download data. The key idea is to involve visitors from a collaborating website. This website serves a JavaScript snippet, which, after user's consent produces cover traffic for the controversial site / content. This cover traffic is indistinguishable from the traffic of participants interested in the controversial content; hence, they can deny that they actually up- or downloaded any data.

CoverUp provides a feed-receiver that achieves a downlink rate of 10 to 50 Kbit/s. The indistinguishability guarantee of the feed-receiver holds against strong global network-level attackers who control everything except for the user's machine. We extend CoverUp to a full upload and download system with a rate of 10 up to 50 Kbit/s. In this case, we additionally need the integrity of the JavaScript snippet, for which we introduce a trusted party. The analysis of our prototype shows a very small timing leakage, even after half a year of continual observation. Finally, as passive participation raises ethical and legal concerns for the collaborating websites and the visitors of the collaborating website, we discuss these concerns and describe how they can be addressed.

CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks

Phuong Cao, Yuming Wu, and Subho Banerjee, UIUC; Alex Withers and Justin Azoff, NCSA; Zbigniew Kalbarczyk and Ravishankar Iyer, UIUC

This paper describes CAUDIT, an operational system deployed at the National Center for Supercomputing Applications (NCSA) at the University of Illinois. CAUDIT is a fully automated system to enable the identification and exclusion of hosts that are vulnerable to SSH brute-force attacks. Its key features includes: 1) a honeypot for attracting SSH-based attacks over a /16 IP address range and extracting key-metadata (e.g., source IP, password, SSH-client version, or -key) from these attacks; 2) executing audits on the live production network by replaying attack attempts recorded by the honeypot; 3) using the IP addresses recorded by the honeypot to block SSH attack attempts at the network border using a Black Hole Router (BHR) while significantly reducing the load on NCSA's security monitoring system; and 4) informing peer sites of attack attempts in real-time to ensure containment of coordinated attacks. The system is composed of existing techniques with custom-built components, and its novelty is to execute at a scale that has not been validated earlier (thousands of nodes and tens of millions of attack attempts per day). Experience over 463 days shows that CAUDIT successfully blocks an average of 57 million attack attempts on a daily basis using the proposed BHR. This represents a 66$\times$ reduction in the number of SSH attempts compared to the daily average and has reduced 78% of the traffic to the NCSA internal network-security-monitoring infrastructure.

12:20 pm–1:50 pm

Lunch (on your own)

1:50 pm–3:05 pm

Network Modeling

Dataplane equivalence and its applications

Dragos Dumitrescu, Radu Stoenescu, Matei Popovici, Lorina Negreanu, and Costin Raiciu, University Politehnica of Bucharest

Network verification promises to find rare bugs in networks, but using it requires that administrators (completely) characterize the expected behavior of the network in formal languages such as Datalog or CTL. The difficulty of achieving this task hampers deployment of verification widely. We propose to use equivalence between different network dataplanes as an implicit, simpler way to specify the required correctness properties. While equivalence is a well- known undecidable problem for general-purpose programs, we show that for network dataplanes without infinite loops it is decidable and can be checked efficiently. We present netdiff, an algorithm that checks the equivalence of two network dataplanes implemented in the SEFL language by using symbolic execution [11]. We implement netdiff and use it to catch a variety of bugs in Openstack Neutron, P4 programs and network dataplane updates. Our evaluation highlights that equivalence is an easy way to find bugs, scales well to relatively large programs and discovers subtle issues otherwise difficult to find.

Alembic: Automated Model Inference for Stateful Network Functions

Soo-Jin Moon, Carnegie Mellon University; Jeffrey Helt, Princeton University; Yifei Yuan, Intentionet; Yves Bieri, ETH Zurich; Sujata Banerjee, VMWare; Vyas Sekar, Carnegie Mellon University; Wenfei Wu, Tsinghua University; Mihalis Yannakakis, Columbia University; Ying Zhang, Facebook

Network operators need to trust and verify correct network behaviors even when they deploy proprietary network func- tions (NFs) in their environments. Typically, only the bi- nary executable with configuration interfaces and operational manuals are provided by the NF vendor. It is challenging for operators to ensure correct operation and debug problems with complex stateful NFs under a variety of configurations and deployment scenarios. Thus offline NF testing and veri- fication tools are used that typically employ models of NFs. However, the success of this process depends on the fidelity and the accuracy of the NF models used. Today, such models are handwritten, which is both tedious and error prone and do not account for implementation artifacts. To address this gap, our goal is to automatically infer the behavioral model of stateful NFs for a given configuration with high fidelity. This goal presents several challenges, due to the fact that NF configurations can contain a diverse set of rule types and that the state space of dynamic and stateful NF behaviors to be explored is enormous. In this paper, we design Alembic, which views an NF as an ensemble of finite-state machines (FSMs) and infers an FSM for each rule type in the config- uration. For scalability, Alembic consists of an offline stage that learns symbolic finite-state machine representations for each rule type and a fast online stage that quickly generates a concrete behavioral model for a new configuration using these learned models. We demonstrate that Alembic is ac- curate and scalable. It uncovers key differences across NF implementations and enables more accurate network verifi- cation.

Model-Agnostic and Efficient Exploration of Numerical State Space of Real-World TCP Congestion Control Implementations

Wei Sun and Lisong Xu, UNL; Sebastian Elbaum, University of Virginia; Di Zhao, UNL

The significant impact of TCP congestion control on the Internet highlights the importance of testing the correctness and performance of congestion control algorithm implementations (CCAIs) in various network environments. Many CCAI testing questions can be answered by exploring the numerical state space of CCAIs, which is defined by a group of numerical (and nonnumerical) state variables of the CCAI. However, the current practices for automated numerical state space exploration are either limited by the approximate abstract CCAI models or inefficient due to the large space of network environment parameters and the complicated relation between the CCAI states and network environment parameters. In this paper, we propose an automated numerical state space exploration method, called ACT, which leverages the model-agnostic feature of random testing and greatly improves its efficiency by guiding random testing under the feedback iteratively obtained in a test. Our experiments on five representative Linux TCP CCAIs show that ACT can more efficiently explore a large numerical state space than manual testing, undirected random testing, and symbolic execution based testing, while without requiring any abstract CCAI models. ACT successfully detects multiple implementation bugs and design issues of these Linux TCP CCAIs, including some new bugs and issues not reported before.

3:05 pm–3:35 pm

Break with Refreshments

3:35 pm–5:15 pm

Wireless Applications

Scaling Community Cellular Networks with CommunityCellularManager

Shaddi Hasan, UC Berkeley; Claire Barela, University of the Philippines Diliman; Matthew Johnson, University of Washington; Eric Brewer, UC Berkeley; Kurtis Heimerl, University of Washington

Hundreds of millions of people still live beyond the coverage of basic mobile connectivity, primarily in rural areas with low population density. Mobile-network operators (MNOs) traditionally struggle to justify expansion into these rural areas due to the high infrastructure costs necessary to provide service. Community cellular networks, networks built "by and for" the people they serve, represent an alternative model that, to an extent, bypasses these business case limitations and enables sustainable rural coverage. Yet despite aligned economic incentives, real deployments of community cellular networks still face significant regulatory, commercial and technical challenges.

In this paper, we present CommunityCellularManager (CCM), a system for operating community cellular networks at scale. CCM enables multiple community networks to operate under the control of a single, multi-tenant controller and in partnership with a traditional MNO. CCM preserves flexibility for each community network to operate independently, while allowing the mobile network operator to safely make critical resources such as spectrum and phone numbers available to these networks. We evaluate CCM through a multi-year, large-scale community cellular network deployment in the Philippines in partnership with a traditional MNO, providing basic communication services to over 2,000 people in 15 communities without requiring changes to the existing regulatory framework, and using existing handsets. We demonstrate that CCM can support independent community networks with unique service offerings and operating models while providing a basic level of MNO-defined service. To our knowledge, this represents the largest deployment of community cellular networks to date.

TrackIO: Tracking First Responders Inside-Out

Ashutosh Dhekne, University of Illinois, Urbana-Champaign; Ayon Chakraborty, Karthikeyan Sundaresan, and Sampath Rangarajan, NEC Labs America

First responders, a critical lifeline of any society, often find themselves in precarious situations. The ability to track them real-time in unknown indoor environments, significantly contributes to the success of their mission as well as their safety. In this work, we present the design, implementation and evaluation of TrackIO - a novel system that is capable of accurately localizing and tracking mobile responders real-time in large indoor environments.

TrackIO leverages the mobile virtual infrastructure offered by unmanned aerial vehicles (UAVs), coupled with the balanced penetration-accuracy tradeoff offered by ultra-wideband (UWB), to accomplish this objective directly from outside, without relying on access to any indoor infrastructure. Towards a practical system, TrackIO incorporates four novel mechanisms in its design that address key challenges to enable tracking responders (i) who are mobile with potentially non-uniform velocities (e.g. during turns), (ii) deep indoors with challenged reachability, (iii) in real-time even for a large network, and (iv) with high accuracy even when impacted by UAV's position error. TrackIO's real-world performance reveals that it can track static nodes with a median accuracy of about 1-2m and mobile (even running) nodes with a median accuracy of 2-2.5m in large buildings in real-time.

3D Backscatter Localization for Fine-Grained Robotics

Zhihong Luo, Qiping Zhang, Yunfei Ma, Manish Singh, and Fadel Adib, MIT Media Lab

This paper presents the design and implementation of TurboTrack, a 3D localization system for fine-grained robotic tasks. TurboTrack's unique capability is that it can localize backscatter nodes with sub-centimeter accuracy without any constraints on their locations or mobility. TurboTrack makes two key technical contributions. First, it presents a pipelined architecture that can extract a sensing bandwidth from every single backscatter packet that is three orders of magnitude larger than the backscatter communication bandwidth. Second, it introduces a Bayesian space-time super-resolution algorithm that combines time series of the sensed bandwidth across multiple antennas to enable accurate positioning. Our experiments show that TurboTrack simultaneously achieves a median accuracy of sub-centimeter in each of the x/y/z dimensions and a $99^{th}$ percentile latency less than 7.5 milliseconds in 3D localization. This enables TurboTrack's real-time prototype to achieve fine-grained positioning for agile robotic tasks, as we demonstrate in multiple collaborative applications with robotic arms and nanodrones including indoor tracking, packaging, assembly, and handover.

Many-to-Many Beam Alignment in Millimeter Wave Networks

Suraj Jog, Jiaming Wang, Junfeng Guan, Thomas Moon, Haitham Hassanieh, and Romit Roy Choudhury, UIUC

Millimeter Wave (mmWave) networks can deliver multi- Gbps wireless links that use extremely narrow directional beams. This provides us with a new opportunity to exploit spatial reuse in order to scale network throughput. Exploit- ing such spatial reuse, however, requires aligning the beams of all nodes in a network. A difficult process which is complicated by indoor multipath which can create interference as well as the inefficiency of carrier sense at detecting interference in directional links. This paper presents BounceNet, the first many-to-many millimeter wave beam alignment protocol that can exploit dense spatial reuse to allow many links to operate in parallel in a confined space and scale the wireless throughput with the number of clients. Results from three millimeter wave testbeds show that BounceNet can scale the throughput with the number of clients to deliver a total network data rate of more the 39 Gbps for 10 clients which is up to 6.6x higher than current 802.11 mmWave standards.