All sessions will be held in the Santa Clara Ballroom unless otherwise noted.
NSDI '20 Program Grid
View the program in mobile-friendly grid format.
See the Program at a Glance for a quick overview of what's happening at NSDI '20.
Papers and Proceedings
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, February 24
7:00 pm–10:00 pm
NSDI '20 Preview Session
Are you new to NSDI? Are you a networking expert but feel bewildered when talk turns to security? Are you interested in engaging more deeply with paper presentations outside your research area? Join us for the NSDI preview session, where area experts will give short introductions to the Symposium's major technical sessions.
- Network Stacks, RMDA, and Protocols: Brent Stephens, University of Illinois at Chicago
- Datacenter Network Architectures: Prateesh Goyal, Massachusetts Institute of Technology
- OS, Storage, and Hardware: Amy Ousterhout, University of California, Berkeley
- Monitoring and Diagnosis: Behnaz Arzani, Microsoft Research
- Network Characterization and Measurement: Robert Beverly, Naval Postgraduate School
- Security: Raluca Ada Popa, University of California, Berkeley
8:30 am–8:40 am
Opening Remarks and Awards
Program Co-Chairs: Ranjita Bhagwan, Microsoft Research India, and George Porter, University of California, San Diego
8:40 am–10:00 am
Datacenter Networking 1
Session Chair: Aditya Akella, University of Wisconsin—Madison
William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, and George Porter, University of California San Diego
Datacenters need networks that support both low-latency and high-bandwidth packet delivery to meet the stringent requirements of modern applications. We present Opera, a dynamic network that delivers latency-sensitive traffic quickly by relying on multi-hop forwarding in the same way as expander-graph-based approaches, but provides near-optimal bandwidth for bulk flows through direct forwarding over time-varying source-to-destination circuits. Unlike prior approaches, Opera requires no separate electrical network and no active circuit scheduling. The key to Opera's design is the rapid and deterministic reconfiguration of the network, piece-by-piece, such that at any moment in time the network implements an expander graph, yet, integrated across time, the network provides bandwidth-efficient single-hop paths between all racks. We show that Opera supports low-latency traffic with flow completion times comparable to cost-equivalent static topologies, while delivering up to 4x the bandwidth for all-to-all traffic and supporting up to 60% higher load for published datacenter workloads.
Wenxue Cheng and Kun Qian, Tsinghua University and Beijing National Research Center for Information Science and Technology (BNRist); Wanchun Jiang, Central South University; Tong Zhang, Tsinghua University, Beijing National Research Center for Information Science and Technology (BNRist), and Nanjing University of Aeronautics and Astronautics; Fengyuan Ren, Tsinghua University and Beijing National Research Center for Information Science and Technology (BNRist)
The lossless Ethernet is attractive for data centers and cluster systems, but various performance issues, such as unfairness, head-of-line blocking and congestion spreading, etc., impede its large-scale deployment in production systems. Through fine-grained experimental observations, we inspect the interactions between flow control and congestion control, and are aware that the radical cause of performance problems is the ineffective elements in the congestion management architecture for lossless Ethernet, including the improper congestion detection mechanism and inadequate rate adjustment law.
Inspired by these insights and findings obtained in experiment investigations, we revise the congestion management architecture, and propose the Photonic Congestion Notification (PCN) scheme, which consists of two basic components: (i) a novel congestion detection and identification mechanism to recognize which flows are really responsible for congestion; (ii) a receiver-driven rate adjustment method to alleviate congestion in as short as 1 RTT. We implement PCN using DPDK NICs and conduct evaluations using testbed experiments and simulations. The results show that PCN greatly improves performance under concurrent burst workload, and significantly mitigates PFC PAUSE messages and reduces the flow completion time under realistic workload.
Saurabh Jha and Archit Patke, University of Illinois at Urbana-Champaign; Jim Brandt and Ann Gentile, Sandia National Lab; Benjamin Lim, University of Illinois at Urbana-Champaign; Mike Showerman and Greg Bauer, National Center for Supercomputing Applications; Larry Kaplan, Cray Inc.; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign; William Kramer, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Ravi Iyer, University of Illinois at Urbana-Champaign
While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.
Albert Gran Alcoz, Alexander Dietmüller, and Laurent Vanbever, ETH Zürich
Push-In First-Out (PIFO) queues are hardware primitives which enable programmable packet scheduling by allowing to perfectly reorder packets at line rate. While promising, implementing PIFO queues in hardware and at scale is not easy: only hardware designs (not implementations) exist and they can only support about 1000 flows.
In this paper, we introduce SP-PIFO, a programmable packet scheduler which closely approximates the behavior of PIFO queues using strict-priority queues—at line rate, at scale, and on existing devices. The key insight behind SP-PIFO is to dynamically adapt the mapping between packet ranks and available queues to minimize the scheduling errors. We present a mathematical formulation of the problem and derive an adaptation technique which closely approximates the optimal queue mapping without any traffic knowledge.
We fully implement SP-PIFO in P4 and evaluate it on real workloads. We show that SP-PIFO: (i) closely matches ideal PIFO performance, with as little as 8 priority queues; (ii) arbitrarily scales to large amount of flows and ranks; and (iii) quickly adapts to traffic variations. We also show that SP-PIFO runs at line rate on existing programmable data planes.
10:00 am–10:30 am
Break with Coffee and Tea
10:30 am–12:30 pm
OS, Storage, and Hardware
Session Chair: Rachit Agarwal, Cornell University
YoungGyoun Moon and SeungEon Lee, KAIST; Muhammad Asim Jamshed, Intel Labs; KyoungSoo Park, KAIST
The performance of modern key-value servers or layer-7 load balancers often heavily depends on the efficiency of the underlying TCP stack. Despite numerous optimizations such as kernel-bypassing and zero-copying, performance improvement with a TCP stack is fundamentally limited due to the protocol conformance overhead for compatible TCP operations. Unfortunately, the protocol conformance overhead amounts to as large as 60% of the entire CPU cycles for short-lived connections or degrades the performance of L7 proxying by 3.2x to 6.3x.
This work presents AccelTCP, a hardware-assisted TCP stack architecture that harnesses programmable network interface cards (NICs) as a TCP protocol accelerator. AccelTCP can offload complex TCP operations such as connection setup and teardown completely to NIC, which simplifies the host stack operations and frees a significant amount of CPU cycles for application processing. In addition, it supports running connection splicing on NIC so that the NIC relays all packets of the spliced connections with zero DMA overhead. Our evaluation shows that AccelTCP enables short-lived connections to perform comparably to persistent connections. It also improves the performance of Redis, a popular in-memory key-value store, and HAProxy, a widely-used layer-7 load balancer, by 2.3x and 11.9x, respectively.
Mina Tahmasbi Arashloo and Alexey Lavrov, Princeton University; Manya Ghobadi, MIT; Jennifer Rexford, David Walker, and David Wentzlaff, Princeton University
Data-center network stacks are moving into hardware to achieve 100 Gbps data rates and beyond at low latency and low CPU utilization. However, hardwiring the network stack in the NIC would stifle innovation in transport protocols. In this paper, we enable programmable transport protocols in high-speed NICs by designing Tonic, a flexible hardware architecture for transport logic. At 100 Gbps, transport protocols must generate a data segment every few nanoseconds using only a few kilobits of per-flow state on the NIC. By identifying common patterns across transport logic of different transport protocols, we design an efficient hardware "template" for transport logic that satisfies these constraints while being programmable with a simple API. Experiments with our FPGA-based prototype show that Tonic can support the transport logic of a wide range of protocols and meet timing for 100 Gbps of back-to-back 128-byte packets. That is, every 10 ns, our prototype generates the address of a data segment for one of more than a thousand active flows for a downstream DMA pipeline to fetch and transmit a packet.
Jian Yang, UC San Diego; Joseph Izraelevitz, University of Colorado, Boulder; Steven Swanson, UC San Diego
The emergence of dense, byte-addressable nonvolatile main memories (NVMMs) allows application developers to combine storage and memory into a single layer. With NVMMs, servers can equip terabytes of memory that survive power outages, and all of this persistent capacity can be managed through a specialized NVMM file system. NVMMs appear to mesh perfectly with another popular technology, remote direct memory access (RDMA). RDMA gives a client direct access to memory on a remote machine and mediates this access through a memory region abstraction that handles the necessary translations and permissions.
NVMM and RDMA seem eminently compatible: by combining them, we should be able to build network-attached, byte-addressable, persistent storage. Unfortunately, however, the systems were not designed to work together. AnNVMM-aware file system manages persistent memory as files, whereas RDMA uses memory regions to organize remotely accessible memory. As a result, in practice, building RDMA-accessible NVMMs requires expensive translation layers resulting from this duplication of effort that spans permissions, naming, and address translation.
This work introduces two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. These optimizations create an abstraction that combines memory regions and files: a client can directly access a file backed by NVMM file system through RDMA, addressing its contents via file offsets. By eliminating redundant translations, it minimizes the amount of translations done at the NIC, reducing the load on the NIC's translation cache and increasing the hit rate by 3.8X -340X and resulting in application performance improvement by 1.8X - 2.0X.
Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agarwal, Cornell University
This paper presents design, implementation and evaluation of i10, a new remote storage stack implemented entirely in the kernel. i10 runs on commodity hardware, allows unmodified applications to operate directly on kernel's TCP/IP network stack, and yet, saturates a 100Gbps link for remote accesses using CPU utilization similar to state-of-the-art user-space and RDMA-based solutions.
Yohei Kuga and Ryo Nakamura, The University of Tokyo; Takeshi Matsuya, Keio University; Yuji Sekiya, The University of Tokyo
Observability on data communication is always essential for prototyping, developing, and optimizing communication systems. However, it is still challenging to observe transactions flowing inside PCI Express (PCIe) links despite them being a key component for emerging peripherals such as smart NICs, NVMe, and accelerators. To offer the practical observability on PCIe and for productively prototyping PCIe devices, we propose NetTLP, a development platform for software PCIe devices that can interact with hardware root complexes. On the NetTLP platform, software PCIe devices on top of IP network stacks can send and receive Transaction Layer Packets (TLPs) to and from hardware root complexes or other devices through Ethernet links, an Ethernet and PCIe bridge called a NetTLP adapter, and PCIe links. This paper describes the NetTLP platform and its implementation: the NetTLP adapter and LibTLP, which is a software implementation of the PCIe transaction layer. Moreover, this paper demonstrates the usefulness of NetTLP through three use cases: (1) observing TLPs sent from four commercial PCIe devices, (2) 400 LoC software Ethernet NIC implementation that performs an actual NIC for a hardware root complex, and (3) physical memory introspection.
Muhammed Uluyol, Anthony Huang, Ayush Goel, Mosharaf Chowdhury, and Harsha V. Madhyastha, University of Michigan
By replicating data across sites in multiple geographic regions, web services can maximize availability and minimize latency for their users. However, when sacrificing data consistency is not an option, we show that service providers have to today incur significantly higher cost to meet desired latency goals than the lowest cost theoretically feasible. We show that the key to addressing this sub-optimality is to 1) allow for erasure coding, not just replication, of data across data centers, and 2) mitigate the resultant increase in read and write latencies by rethinking how to enable consensus across the wide-area network. Our extensive evaluation mimicking web service deployments on the Azure cloud service shows that we enable near-optimal latency versus cost tradeoffs.
12:30 pm–1:50 pm
Symposium Luncheon and Test of Time Award Presentation
1:50 pm–3:30 pm
Session Chair: Cole Schlesinger, Barefoot Networks
Yifei Yuan, Intentionet; Soo-Jin Moon, Sahil Uppal, Limin Jia, and Vyas Sekar, Carnegie Mellon University
Modern networks enforce rich and dynamic policies (e.g., dynamic service chaining and path pinning) over a number of complex and stateful NFs (e.g., stateful firewall and load balancer). Verifying if those policies are correctly implemented is important to ensure the network’s availability, safety, and security. Unfortunately, theoretical results suggest that verifying even simple policies (e.g., A cannot talk to B) in stateful networks is undecidable. Consequently, any approach for stateful network verification has to fundamentally make some relaxations; e.g., either on policies supported, or the network behaviors it can capture, or in terms of the soundness/completeness guarantees. In this paper, we identify practical opportunities for relaxations in order to develop an efficient verification tool. First, we identify key domain-specific insights to develop a more compact network semantic model which is equivalent to a general semantic model for checking a wide range of policies under practical conditions. Second, we identify a restrictive-yet-expressive policy language to support a wide range of policies including dynamic service chaining and path pinning while enable efficient verification. Third, we develop customized symbolic model checking algorithms as our model and policy specification allows us to succinctly encode network states using existential first-order logic, which enables efficient checking algorithms. We prove the correctness of our approach for a subset of policies and show that our tool, NetSMC, achieves orders of magnitude speedup compared to existing approaches.
Anubhavnidhi Abhashkumar, University of Wisconsin - Madison; Aaron Gember-Jacobson, Colgate University; Aditya Akella, University of Wisconsin - Madison
Today's distributed network control planes are highly sophisticated, with multiple interacting protocols operating at layers 2 and 3. The complexity makes network configurations highly complex and bug-prone. State-of-the-art tools that check if control plane bugs can lead to violations of key properties are either too slow, or do not model common network features. We develop a new, general multilayer graph control plane model that enables using fast, property-customized verification algorithms. Our tool, Tiramisu can verify if policies hold under failures for various real-world and synthetic configurations in < 0.08s in small networks and < 2.2s in large networks. Tiramisu is 2-600X faster than state-of-the-art without losing generality.
Kaiyuan Zhang, University of Washington; Danyang Zhuo, Duke University; Aditya Akella, University of Wisconsin–Madison; Arvind Krishnamurthy and Xi Wang, University of Washington
Building a formally-verified software middlebox is attractive for network reliability. In this paper, we explore the feasibility of verifying "almost unmodified" software middleboxes. Our key observation is that software middleboxes are already designed and implemented in a modular way (e.g., Click). Further, to achieve high performance, the number of operations each element or module performs is finite and small. These two characteristics place them within reach of automated verification through symbolic execution.
We perform a systematic study to test how many existing Click elements can be automatically verified using symbolic execution. We show that 45% of the elements can be automatically verified and an additional 33% of Click elements can be automatically verified with slight code modifications. To allow automated verification, we build Gravel, a software middlebox verification framework. Gravel allows developers to specify high-level middlebox properties and checks correctness in the implementation without requiring manual proofs. We then use Gravel to specify and verify middlebox-specific properties for several Click-based middleboxes. Our evaluation shows that Gravel avoids bugs that are found in today's middleboxes with minimal code changes and that the code modifications needed for proof automation do not affect middlebox performance.
Peng Zhang and Xu Liu, Xi'an Jiaotong University; Hongkun Yang, Google; Ning Kang, Zhengchang Gu, and Hao Li, Xi'an Jiaotong University
Realtime network verification ensures the correctness of network by incrementally checking data plane updates in real time (e.g., < 1ms per rule update). Even state-of-the-art methods can already achieve sub-millisecond verification time, such speed is achieved mostly for pure IP forwarding devices, and is unrealistic for real-world networks, due to two reasons. (1) Their network models cannot express the forwarding behavior of real devices, which have various functions including IP forwarding, ACL, NAT, policy-based routing, etc. (2) Their update algorithms do not scale in space and/or time: multi-field rules (e.g., ACL rules) can make these tools run out of memory and/or incur long verification time. To scale real- time verification to real networks, we propose APKeep based on a new modular network model that is expressive for real devices, and propose new algorithms that can achieve low memory cost and fast update speed at the same time. Our experiments show that for real-world update traces consisting of IP forwarding rules and ACL rules, existing methods either run out of memory or incur a prohibitively long verification time, while APKeep still achieves a sub-millisecond verifica- tion time. We also show that APKeep can verify an update of NAT rule mostly in less than 1 millisecond.
Farnaz Yousefi, Johns Hopkins University; Anubhavnidhi Abhashkumar and Kausik Subramanian, University of Wisconsin-Madison; Kartik Hans, IIT Delhi; Soudeh Ghorbani, Johns Hopkins University; Aditya Akella, University of Wisconsin-Madison
Network verification tools focus almost exclusively on various safety properties such as reachability invariants, e.g., is there a path from host A to host B? Thus, they are inapplicable to providing strong correctness guarantees for modern programmable networks that increasingly rely on stateful network functions. Correct operations of such networks depend on the validity of a larger set of properties, in particular liveness properties. For instance, a stateful firewall that only allows solicited external traffic works correctly if it eventually detects and blocks malicious connections, e.g., if it eventually blocks an external host E that tries to reach the internal host I before receiving a request from I.
Alas, verifying liveness properties is computationally expensive and in some cases undecidable. Existing verification techniques do not scale to verify such properties. In this work, we provide a compositional programming abstraction that decouples reachability from stateful network functions. We then model the behavior of the programs expressed in this abstraction using compact Boolean formulas, and show that verification of complex properties is fast on these formulas, e.g., for a network with 100 hosts, these formulas result in 8x speedup in verifying key properties compared to the state-of-the-art. Finally, we provide a compiler that translates the programs written using our abstraction to P4 programs.
3:30 pm–4:00 pm
Break with Refreshments
4:00 pm–5:20 pm
Session Chair: Manya Ghobadi, Massachusetts Institute of Technology
Fan Lai, Jie You, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury, University of Michigan
The popularity of big data and AI has led to many optimizations at different layers of distributed computation stacks. Despite – or perhaps, because of – its role as the narrow waist of such software stacks, the design of the execution engine, which is in charge of executing every single task of a job, has mostly remained unchanged. As a result, the execution engines available today are ones primarily designed for low latency and high bandwidth datacenter networks. When either or both of the network assumptions do not hold, CPUs are significantly underutilized.
In this paper, we take a first-principles approach toward developing an execution engine that can adapt to diverse network conditions. Sol, our federated execution engine architecture, flips the status quo in two respects. First, to mitigate the impact of high latency, Sol proactively assigns tasks, but does so judiciously to be resilient to uncertainties. Second, to improve the overall resource utilization, Sol decouples communication from computation internally instead of committing resources to both aspects of a task simultaneously. Our evaluations on EC2 show that, compared to Apache Spark in resource-constrained networks, Sol improves SQL and machine learning jobs by 16.4× and 4.2× on average.
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, and Aditya Akella, University of Wisconsin-Madison; Amar Phanishayee, Microsoft Research; Shuchi Chawla, University of Wisconsin-Madison
Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads. We find that established cluster scheduling disciplines are a poor fit because of ML workloads' unique attributes: ML jobs have long-running tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement.
We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off fairness for efficiency in the short term, but ensuring finish-time fairness in the long term. Our evaluation on a production trace shows that Themis can improve fairness by more than 2.25X and is ~5% to 250% more cluster efficient in comparison to state-of-the-art schedulers.
Ming Liu and Arvind Krishnamurthy, University of Washington; Harsha V. Madhyastha, University of Michigan; Rishi Bhardwaj, Karan Gupta, Chinmay Kamat, Huapeng Yuan, Aditya Jaltade, Roger Liao, Pavan Konka, and Anoop Jawahar, Nutanix
We describe the design and implementation of a consistent and fault-tolerant metadata index for a scalable block storage system. The block storage system supports the virtualized execution of legacy applications inside enterprise clusters by automatically distributing the stored blocks across the cluster’s storage resources. To support the availability and scalability needs of the block storage system, we develop a distributed index that provides a replicated and consistent key-value storage abstraction.
The key idea underlying our design is the use of fine-grained replicated state machines, wherein every key-value pair in the index is treated as a separate replicated state machine. This approach has many advantages over a traditional coarse-grained approach that represents an entire shard of data as a state machine: it enables effective use of multiple storage devices and cores, it is more robust to both short- and long-term skews in key access rates, and it can tolerate variations in key-value access latencies. The use of fine-grained replicated state machines, however, raises new challenges, which we address by co-designing the consensus protocol with the data store and streamlining the operation of the per-key replicated state machines. We demonstrate that fine-grained replicated state machines can provide significant performance benefits, characterize the performance of the system in the wild, and report on our experiences in building and deploying the system.
Cong Ding, David Chu, and Evan Zhao, Cornell University; Xiang Li, Alibaba Group; Lorenzo Alvisi and Robbert van Renesse, Cornell University
The shared log paradigm is at the heart of modern distributed applications in the growing cloud computing industry. Often, application logs must be stored durably for analytics, regulations, or failure recovery, and their smooth operation depends closely on how the log is implemented. Scalog is a new implementation of the shared log abstraction that offers an unprecedented combination of features for continuous smooth delivery of service: Scalog allows applications to customize data placement, supports reconfiguration with no loss in availability, and recovers quickly from failures. At the same time, Scalog provides high throughput and total order.
The paper describes the design and implementation of Scalog and presents examples of applications running upon it. To evaluate Scalog at scale, we use a combination of real experiments and emulation. Using 4KB records, a 10 Gbps infrastructure, and SSDs, Scalog can totally order up to 52 million records per second.
5:20 pm–5:30 pm
5:30 pm–6:30 pm
Wireless Networks 1
Session Chair: Keith Winstein, Stanford University
Akshay Gadre, Carnegie Mellon University; Revathy Narayanan, Carnegie Mellon University and IIT Madras; Anh Luong, Anthony Rowe, Bob Iannucci, and Swarun Kumar, Carnegie Mellon University
Low-power Wide-Area Networks (LP-WANs) are seen as a leading candidate to network the Internet-of-Things at city-scale. Yet, the battery life and performance of LP-WAN devices varies greatly based on their operating frequency. In multipath-rich urban environments, received signal power varies rapidly with a low-power transmitter's frequency, impacting its transmission time, data rate and battery life. However, the low bandwidth of LP-WANs means that there are hundreds of operating frequencies to choose from. Among them, we show how choosing a select few of these frequencies(<3.55%) effectively triples the battery life when compared to the rest for LP-WAN devices.
This paper presents Chime, a system enabling LP-WAN base stations to identify an optimal frequency of operation after the client sends one packet at one frequency. Chime achieves this by analyzing the wireless channels of this packet across many base stations to disentangle multipath and ascertain an optimal frequency that maximizes client battery life and minimizes interference. We implement Chime on a campus-scale test-bed and achieve a median gain of 3.4 dB in SINR leading to a median increase in battery life of 230% (~1.4-5.7 years), data rate by 3.3X and reduction in interference of 2.8X over commodity LP-WANs.
Prateesh Goyal, MIT CSAIL; Anup Agarwal, CMU; Ravi Netravali, UCLA; Mohammad Alizadeh and Hari Balakrishnan, MIT CSAIL
We propose Accel-Brake Control (ABC), a simple and deployable explicit congestion control protocol for network paths with time-varying wireless links. ABC routers mark each packet with an “accelerate” or “brake”, which causes senders to slightly increase or decrease their congestion windows. Routers use this feedback to quickly guide senders towards a desired target rate. ABC requires no changes to header formats or user devices, but achieves better performance than XCP. ABC is also incrementally deployable; it operates correctly when the bottleneck is a non-ABC router, and can coexist with non-ABC traffic sharing the same bottleneck link. We evaluate ABC using a Wi-Fi implementation and trace-driven emulation of cellular links. ABC achieves 30-40% higher throughput than Cubic+Codel for similar delays, and 2.2× lower delays than BBR on a Wi-Fi path. On cellular network paths, ABC achieves 50% higher throughput than Cubic+Codel.
Charles J. Carver and Zhao Tian, Department of Computer Science, Dartmouth College; Hongyong Zhang and Kofi M. Odame, Thayer School of Engineering, Dartmouth College; Alberto Quattrini Li and Xia Zhou, Department of Computer Science, Dartmouth College
Awarded Best Paper!
Air-water communication is fundamental for efficient underwater operations, such as environmental monitoring, surveying, or coordinating of heterogeneous aerial and underwater systems. Existing wireless techniques mostly focus on a single physical medium and fall short in achieving high-bandwidth bidirectional communication across the air-water interface. We propose a bidirectional, direct air-water wireless communication link based on laser light, capable of (1) adapting to water dynamics with ultrasonic sensing and (2) steering within a full 3D hemisphere using only a MEMS mirror and passive optical elements. In real-world experiments, our system achieves static throughputs up to 5.04 Mbps, zero-BER transmission ranges up to 6.1m in strong ambient light conditions, and connection time improvements between 47.1% and 29.5% during wave dynamics.
8:30 am–10:30 am
Session Chair: Sujata Banerjee, VMware Research
Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
Ze Li, Qian Cheng, Ken Hsieh, and Yingnong Dang, Microsoft Azure; Peng Huang, Johns Hopkins University; Pankaj Singh and Xinsheng Yang, Microsoft Azure; Qingwei Lin, Microsoft Research; Youjiang Wu, Sebastien Levy, and Murali Chintalapati, Microsoft Azure
Modern cloud systems have a vast number of components that continuously undergo updates. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals. It will correlate each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. The core decision logic of Gandalf includes an ensemble ranking algorithm that determines which rollout may have caused the fault signals, and a binary classifier that assesses the impact of the fault signals. The analysis result will decide whether a rollout is safe to proceed or should be stopped.
By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for data-plane rollouts. For control-plane rollouts, Gandalf achieves 94.9% precision and 99.8% recall.
Jeffrey C. Mogul, Drago Goricanec, Martin Pool, Anees Shaikh, Douglas Turk, and Bikash Koley, Google; Xiaoxue Zhao, Alibaba Group Inc.
Network management is becoming increasingly automated, and automation depends on detailed, explicit representations of data about the state of a network and about an operator's intent for its networks. In particular, we must explicitly represent the desired and actual topology of a network. Almost all other network-management data either derives from its topology, constrains how to use a topology, or associates resources (e.g., addresses) with specific places in a topology.
MALT, a Multi-Abstraction-Layer Topology representation, supports virtually all network management phases: design, deployment, configuration, operation, measurement, and analysis. MALT provides interoperability across our network-management software, and its support for abstraction allows us to explicitly tie low-level network elements to high-level design intent. MALT supports a declarative style, simplifying what-if analysis and testbed support.
We also describe the software base that supports efficient use of MALT, as well as numerous, sometimes painful lessons we have learned about curating the taxonomy for a comprehensive, and evolving, representation for topology.
Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa, Amazon Web Services
Serverless containers and functions are widely used for deploying and managing software in the cloud. Their popularity is due to reduced cost of operations, improved utilization of hardware, and faster scaling than traditional deployment methods. The economics and scale of serverless applications demand that workloads from multiple customers run on the same hardware with minimal overhead, while preserving strong security and performance isolation. The traditional view is that there is a choice between virtualization with strong security and high overhead, and container technologies with weaker security and minimal overhead. This tradeoff is unacceptable to public infrastructure providers, who need both strong security and minimal overhead. To meet this need, we developed Firecracker, a new open source Virtual Machine Monitor (VMM) specialized for serverless workloads, but generally useful for containers, functions and other compute workloads within a reasonable set of constraints. We have deployed Firecracker in two publically-available serverless compute services at AWS (Lambda and Fargate), where it supports millions of production workloads, and trillions of requests per month. We describe how specializing for serverless informed the design of Firecracker, and what we learned from seamlessly migrating AWS Lambda customers to Firecracker.
Sonu Mehta, Ranjita Bhagwan, and Rahul Kumar, Microsoft Research India; Chetan Bansal, Microsoft Research; Chandra Maddila and B. Ashok, Microsoft Research India; Sumit Asthana, University of Michigan; Christian Bird, Microsoft Research; Aditya Kumar
Large services experience extremely frequent changes to code and configuration. In many cases, these changes are correlated across files. For example, an engineer introduces a new feature following which they also change a configuration file to enable the feature only on a small number of experimental machines. This example captures only one of numerous types of correlations that emerge organically in large services. Unfortunately, in almost all such cases, no documentation or specification guides engineers on how to make correlated changes and they often miss making them. Such misses can be vastly disruptive to the service.
We have designed and deployed Rex, a tool that, using a combination of machine learning and program analysis, learns change-rules that capture such correlations. When an engineer changes only a subset of files in a change-rule, Rex suggests additional changes to the engineer based on the change-rule. Rex has been deployed for 14 months on 360 repositories within Microsoft that hold code and configuration for services such as Office 365 and Azure. Rex has so far positively affected 4926 changes without which, at the very least, code-quality would have degraded and, in some cases, the service would have been severely disrupted.
Midhul Vuppalapati, Justin Miron, and Rachit Agarwal, Cornell University; Dan Truong, Ashish Motivala, and Thierry Cruanes, Snowflake Computing
We present operational experience running Snowflake, a cloud-based data warehousing system with SQL support similar to state-of-the-art databases. Snowflake design is motivated by three goals: (1) compute and storage elasticity; (2) support for multi-tenancy; and, (3) high performance. Over the last few years, Snowflake has grown to serve thousands of customers executing millions of queries on petabytes of data every day.
This paper presents Snowflake design and implementation, along with a discussion on how recent changes in cloud infrastructure (emerging hardware, fine-grained billing, etc.) have altered the many assumptions that guided the design and optimization of Snowflake system. Using data collected from various components of our system during execution of 70 million queries over a 14 day period, our study both deepens the understanding of existing problems and highlights new research challenges along a multitude of dimensions including design of storage systems and high-performance query execution engines.
Marc Brooker, Tao Chen, and Fan Ping, Amazon Web Services
Starting in 2013, we set out to build a new database to act as the configuration master for a high-performance cloud block storage system (Amazon EBS). This database needs to be not only highly available, durable, and scalable but also strongly consistent. We quickly realized that the constraints on availability imposed by the CAP theorem, and the realities of operating distributed systems, meant that we didn't want one database. We wanted millions. Physalia is a transactional key-value store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client.
This paper describes Physalia in context of Amazon EBS, and some other uses within Amazon Web Services. We believe that the same patterns, and approach to design, is widely applicable to distributed systems problems like control planes, configuration management, and service discovery.
10:30 am–11:00 am
Break with Coffee and Tea
11:00 am–12:20 pm
Measurement and Adaptation
Session Chair: Renata Teixeira, Inria
Kevin Vermeulen, Sorbonne Université; Justin P. Rohrer and Robert Beverly, Naval Postgraduate School; Olivier Fourmaux and Timur Friedman, Sorbonne Université
Despite the well-known existence of load-balanced forwarding paths in the Internet, current active topology Internet-wide mapping efforts are multipath agnostic — largely because of the probing volume and time required for existing multipath discovery techniques. This paper introduces D-Miner, a system that marries previous work on high-speed probing with multipath discovery to make Internet-wide topology mapping, inclusive of load-balanced paths, feasible. We deploy D-Miner and collect multiple IPv4 interface-level topology snapshots, where we find >64% more edges, and significantly more complex topologies relative to existing systems. We further scrutinize topological changes between snapshots and attribute forwarding differences not to routing or policy changes, but to load balancer "remapping" events. We precisely categorize remapping events and find that they are a much more frequent contributor of path changes than previously recognized. By making D-Miner and our collected Internet-wide topologies publicly available, we hope to help facilitate better understanding of the Internet's true structure and resilience.
Francis Y. Yan and Hudson Ayers, Stanford University; Chenzhi Zhu, Tsinghua University; Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein, Stanford UniversityCommunity Award Winner!
We describe the results of a randomized controlled trial of video-streaming algorithms for bitrate selection and network prediction. Over the last year, we have streamed 38.6 years of video to 63,508 users across the Internet. Sessions are randomized in blinded fashion among algorithms.
We found that in this real-world setting, it is difficult for sophisticated or machine-learned control schemes to outperform a "simple" scheme (buffer-based control), notwithstanding good performance in network emulators or simulators. We performed a statistical analysis and found that the heavy-tailed nature of network and user behavior, as well as the challenges of emulating diverse Internet paths during training, present obstacles for learned algorithms in this setting.
We then developed an ABR algorithm that robustly outperformed other schemes, by leveraging data from its deployment and limiting the scope of machine learning only to making predictions that can be checked soon after. The system uses supervised learning in situ, with data from the real deployment environment, to train a probabilistic predictor of upcoming chunk transmission times. This module then informs a classical control policy (model predictive control).
To support further investigation, we are publishing an archive of data and results each week, and will open our ongoing study to the community. We welcome other researchers to use this platform to develop and validate new algorithms for bitrate selection, network prediction, and congestion control.
Alexandru Uta and Alexandru Custura, Vrije Universiteit Amsterdam; Dmitry Duplyakin, University of Utah; Ivo Jimenez, UC Santa Cruz; Jan Rellermeyer, TU Delft; Carlos Maltzahn, UC Santa Cruz; Robert Ricci, University of Utah; Alexandru Iosup, Vrije Universiteit Amsterdam
Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our dataset consists of millions of datapoints gathered while transferring over 9 petabytes on cloud providers' networks. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines to reduce the volatility of big data performance, making experiments more repeatable.
Zhenyu Song, Princeton University; Daniel S. Berger, Microsoft Research & Carnegie Mellon University; Kai Li and Wyatt Lloyd, Princeton University
This paper presents a new approach for caching in CDNs that uses machine learning to approximate the Belady MIN algorithm. To accomplish this complex task, we introduce the Relaxed Belady algorithm, the Belady boundary, and the good decision ratio that inform the design of Learning Relaxed Belady (LRB). LRB addresses the necessary system challenges to build an end-to-end machine learning caching prototype, including how to gather training data, limit memory overhead, and have lightweight training and inference paths.
We implement an LRB simulator and a prototype within Apache Traffic Server. Our simulation using 6 production CDN traces show LRB reduces WAN traffic compared to a typical production CDN cache design by 5–24%, and consistently outperform other state-of-the-art methods. Our evaluation of the LRB prototype shows its overhead is modest and it can be deployed on today’s CDN servers.
12:20 pm–1:50 pm
Lunch (on your own)
1:50 pm–3:30 pm
Fault Tolerance and Availability
Session Chair: Hakim Weatherspoon, Cornell University
Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan, Google
High availability is a critical requirement for cloud applications: if a system does not have high availability, users cannot count on it for their critical work. Having a metric that meaningfully captures availability is useful for both users and system developers. It informs users what they should expect of the availability of the application. It informs developers what they should focus on to improve user-experienced availability. This paper presents and evaluates, in the context of Google's G Suite, a novel availability metric: windowed user-uptime. This metric has two main components. First, it directly models user-perceived availability and avoids the bias in commonly used availability metrics. Second, by simultaneously calculating the availability metric over many windows it can readily distinguish between many short periods of unavailability and fewer but longer periods of unavailability.
Chang Lou, Peng Huang, and Scott Smith, Johns Hopkins University
Awarded Best Paper!
Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing effective detectors to systematically detect such failures is both time-consuming and error-prone. We thus propose OmegaGen, a static analysis tool that automatically generates customized watchdogs for a given program by using a novel program reduction technique. We have successfully applied OmegaGen to six large distributed systems. In evaluating 22 real-world partial failure cases in these systems, the generated watchdogs can detect 20 cases with a median detection time of 4.2 seconds, and pinpoint the failure scope for 18 cases. The generated watchdogs also expose an unknown, confirmed partial failure bug in the latest version of ZooKeeper.
Ennan Zhai, Alibaba Group; Ang Chen, Rice University; Ruzica Piskac, Yale University; Mahesh Balakrishnan, Facebook; Bingchuan Tian, Nanjing University; Bo Song and Haoliang Zhang, Google
The reliability of cloud services can be significantly undermined by correlated failures due to shared service dependencies, even when the services are already replicated across machines. State-of-the-art failure prevention systems can proactively audit a service before its deployment to detect risks for correlated failures, but their auditing speeds are too slow for frequent service updates. This paper presents CloudCanary, a system that can perform real-time audits on service updates to identify the root causes of correlated failure risks, and generate improvement plans with increased reliability.
CloudCanary achieves this with two primitives, SnapAudit and DepBooster. SnapAudit leverages two insights to achieve high accuracy and efficiency: a) service updates typically affect only a small part of the service stack, allowing the majority of previous auditing results to be reused; and b) structural reliability auditing tasks can be reduced to a Boolean satisfiability problem, which can then be solved efficiently using modern SAT solvers. DepBooster, on the other hand, can generate improvement plans efficiently by reducing the required reasoning load, using novel techniques such as model counting. We demonstrate in our experiments that CloudCanary can perform audits over large deployments 200x faster than state-of-the-art systems, and that it consistently generates high-quality improvement plans within minutes. Moreover, CloudCanary can yield valuable insights over real-world traces collected from production environments.
Matthew Burke, Cornell University; Audrey Cheng and Wyatt Lloyd, Princeton University
Linearizability reduces the complexity of building correct applications. However, there is a tradeoff between using linearizability for geo-replicated storage and low tail latency. Traditional approaches use consensus to implement linearizable replicated state machines, but consensus is inefficient for workloads composed mostly of reads and writes.
We present the design, implementation, and evaluation of Gryff, a system that offers linearizability and low tail latency by unifying consensus with shared registers. Gryff introduces carstamps to correctly order reads and writes without incurring unnecessary constraints that are required when ordering stronger synchronization primitives. Our evaluation shows that Gryff’s combination of an optimized shared register protocol with EPaxos allows it to provide lower service-level latency than EPaxos or MultiPaxos due to its lower tail latency for reads.
Jiyao Hu, Zhenyu Zhou, and Xiaowei Yang, Duke University; Jacob Malone, CableLabs; Jonathan W Williams, The University of North Carolina at Chapel Hill
Cable broadband networks are one of the few “last-mile” broadband technologies widely available in the U.S. Unfortunately, they have poor reliability after decades of deployment. Cable industry proposed a framework called Proactive Network Maintenance (PNM) to diagnose the cable networks. However, there is little public knowledge or systematic study on how to use these data to detect and localize cable network problems. Existing tools in the public domain have prohibitive high false-positive rates. In this paper, we propose CableMon, the first public-domain system that applies machine learning techniques to PNM data to improve the reliability of cable broadband networks. CableMon uses statistical models to generate features from time series data and uses customer trouble tickets as hints to infer abnormal thresholds for these generated features. We use eight-month of PNM data and customer trouble tickets from an ISP to evaluate CableMon’s performance. Our results show that 81.9% of the abnormal events detected by CableMon overlap with at least one customer trouble ticket. This ticket prediction accuracy is four times higher than that of the existing public-domain tools used by ISPs. The tickets predicted by CableMon constitute 23.0% of the total network-related trouble tickets, suggesting that if an ISP deploys CableMon and proactively repairs the faults detected by CableMon, it can preempt those customer calls. Our current results, while still not mature, can already tangibly reduce an ISP’s operational expenses and improve customers’ quality of experience. We expect future work can further improve these results.
3:30 pm–4:00 pm
Break with Refreshments
4:00 pm–5:20 pm
Datacenter Networking 2
Session Chair: Behnaz Arzani, Microsoft Research
Tamás Lévai, Budapest University of Technology and Economics & University of Southern California; Felicián Németh, Budapest University of Technology and Economics; Barath Raghavan, University of Southern California; Gábor Rétvári, MTA-BME Information Systems Research Group & Ericsson Research, Hungary
Data flow graphs are a popular program representation in machine learning, big data analytics, signal processing, and, increasingly, networking, where graph nodes correspond to processing primitives and graph edges describe control flow. To improve CPU cache locality and exploit data-level parallelism, nodes usually process data in batches. Unfortunately, as batches are split across dozens or hundreds of parallel processing pathways through the graph they tend to fragment into many small chunks, leading to a loss of batch efficiency.
We present Batchy, a scheduler for run-to-completion packet processing engines, which uses controlled queuing to efficiently reconstruct fragmented batches in accordance with strict service-level objectives (SLOs). Batchy comprises a runtime profiler to quantify batch-processing gain on different processing functions, an analytical model to fine-tune queue backlogs, a new queuing abstraction to realize this model in run-to-completion execution, and a one-step receding horizon controller that adjusts backlogs across the pipeline. We present extensive experiments on five networking use cases taken from an official 5G benchmark suite to show that Batchy provides 2--3x the performance of prior work while accurately satisfying delay SLOs.
Matthew K. Mukerjee, Carnegie Mellon University / Nefeli Networks; Christopher Canel, Carnegie Mellon University; Weiyang Wang, UC San Diego; Daehyeok Kim, Carnegie Mellon University / Microsoft Research; Srinivasan Seshan, Carnegie Mellon University; Alex C. Snoeren, UC San Diego
Reconfigurable datacenter networks (RDCNs) augment traditional packet switches with high-bandwidth reconfigurable circuits. In these networks, high-bandwidth circuits are assigned to particular source-destination rack pairs based on a schedule. To make efficient use of RDCNs, active TCP flows between such pairs must quickly ramp up their sending rates when high-bandwidth circuits are made available. Past studies have shown that TCP performs well on RDCNs with millisecond-scale reconfiguration delays, during which time the circuit network is offline. However, modern RDCNs can reconfigure in as little as 20 μs, and maintain a particular configuration for fewer than 10 RTTs. We show that existing TCP variants cannot ramp up quickly enough to work well on these modern RDCNs. We identify two methods to address this issue: First, an in-network solution that dynamically resizes top-of-rack switch virtual output queues to prebuffer packets; Second, an endpoint-based solution that increases the congestion window, cwnd, based on explicit circuit state feedback sent via the ECN-echo bit. To evaluate these techniques, we build an open-source RDCN emulator, Etalon, and show that a combination of dynamic queue resizing and explicit circuit state feedback increases circuit utilization by 1.91× with an only 1.20× increase in tail latency.
Tom Barbette, Chen Tang, Haoran Yao, Dejan Kostić, Gerald Q. Maguire Jr., Panagiotis Papadimitratos, and Marco Chiesa, KTH Royal Institute of Technology
Large service providers use load balancers to dispatch millions of incoming connections per second towards thousands of servers. There are two basic yet critical requirements for a load balancer: uniform load distribution of the incoming connections across the servers and per-connection-consistency (PCC), i.e., the ability to map packets belonging to the same connection to the same server even in the presence of changes in the number of active servers and load balancers. Yet, meeting both these requirements at the same time has been an elusive goal. Today's load balancers minimize PCC violations at the price of non-uniform load distribution.
This paper presents Cheetah, a load balancer that supports uniform load distribution and PCC while being scalable, memory efficient, resilient to clogging attacks, and fast at processing packets. The Cheetah LB design guarantees PCC for any realizable server selection load balancing mechanism and can be deployed in both a stateless and stateful manner, depending on the operational needs. We implemented Cheetah on both a software and a Tofino-based hardware switch. Our evaluation shows that a stateless version of Cheetah guarantees PCC, has negligible packet processing overheads, and can support load balancing mechanisms that reduce the flow completion time by a factor of 2–3×.
Naveen Kr. Sharma, Chenxingyu Zhao, and Ming Liu, University of Washington; Pravein G Kannan, School of Computing, National University of Singapore; Changhoon Kim, Barefoot Networks; Arvind Krishnamurthy, University of Washington; Anirudh Sivaraman, NYU
Packet schedulers traditionally focus on the prioritized transmission of packets. Scheduling is often realized through coarse-grained queue-level priorities, as in today's switches, or through fine-grained packet-level priorities, as in recent proposals such as PIFO. Unfortunately, fixed packet priorities determined when a packet is received by the traffic manager are not sufficient to support a broad class of scheduling algorithms that require the priorities of packets to change as a function of the time it has spent inside the network. In this paper, we revisit the Calendar Queue abstraction and show that it is an appropriate fit for scheduling algorithms that not only require prioritization but also perform dynamic escalation of packet priorities. We show that the calendar queue abstraction can be realized using either dataplane primitives or control-plane commands that dynamically modify the scheduling status of queues. Further, when paired with programmable switch pipelines, we can realize programmable calendar queues that can emulate a diverse set of scheduling policies. We demonstrate the power of this abstraction using three case studies that implement variants of LSTF, Fair Queueing, and pFabric in order to provide stronger delay guarantees, burst-friendly fairness, and starvation-free prioritization of short flows, respectively. We evaluate the benefits associated with these scheduling policies using both a custom simulator and a small-scale testbed.
5:20 pm–5:30 pm
5:30 pm–6:30 pm
Session Chair: KyoungSoo Park, Korea Advanced Institute of Science and Technology (KAIST)
Kuo-Feng Hsu, Rice University; Ryan Beckett, Microsoft Research; Ang Chen, Rice University; Jennifer Rexford, Praveen Tammana, and David Walker, Princeton University
We present Contra, a system for performance-aware routing that can adapt to traffic changes at hardware speeds. While point solutions exist for a fixed topology (e.g., a Fattree) with a fixed routing policy (e.g., use least utilized paths), Contra can operate seamlessly over any network topology and a wide variety of sophisticated routing policies. Users of Contra write network-wide policies that rank network paths given their current performance. A compiler then analyzes such policies in conjunction with the network topology and decomposes them into switch-local P4 programs, which collectively implement a new, specialized distance-vector protocol. This protocol generates compact probes that traverse the network, gathering path metrics to optimize for the user policy dynamically. Switches respond to changing network conditions by routing flowlets along the best policy-compliant paths. Our experiments show that Contra scales to large networks, and that in terms of flow completion times, it is competitive with hand-crafted systems that have been customized for specific topologies and policies.
Hatem Takruri, Ibrahim Kettaneh, Ahmed Alquraan, and Samer Al-Kiswany, University of Waterloo
We present FLAIR, a novel approach for accelerating read operations in leader-based consensus protocols. FLAIR leverages the capabilities of the new generation of programmable switches to serve reads from follower replicas without compromising consistency. The core of the new approach is a packet-processing pipeline that can track client requests and system replies, identify consistent replicas, and at line speed, forward read requests to replicas that can serve the read without sacrificing linearizability. An additional benefit of FLAIR is that it facilitates devising novel consistency-aware load balancing techniques. Following the new approach, we designed FlairKV, a key-value store atop Raft. FlairKV implements the processing pipeline using the P4 programming language. We evaluate the benefits of the proposed approach and compare it to previous approaches using a cluster with a Barefoot Tofino switch. Our evaluation indicates that, compared to state-of-the-art alternatives, the proposed approach can bring significant performance gains: up to 42% higher throughput and 35-97% lower latency for most workloads.
Shahrooz Pouryousef, Lixin Gao, and Arun Venkataramani, University of Massachusetts at Amherst
In this paper, we present the design and implementation of CIRCA, a logically centralized architecture and system for interdomain routing that enables operators to offload BGP-style route computation to the cloud while preserving the confidentiality of proprietary information. To this end, our work presents the first provably safe, live, and fully distributed convergence detection algorithm for decentralized policy routing and, somewhat surprisingly, shows that long MRAI timers can almost completely be eliminated while significantly improving convergence delays with logical centralization. Our experiments with a Quagga-based CIRCA prototype and the Internet's AS topologies suggest that CIRCA can improve interdomain routing convergence delays and transient route inconsistencies by over an order of magnitude and offers nontrivial incremental deployability benefits with modest changes to widely deployed routing infrastructure.
6:30 pm–8:00 pm
Poster Session and Reception
Cypress Room and Mezzanine
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers. View the complete list of accepted posters.
8:30 am–10:30 am
Session Chair: Rachit Agarwal, Cornell University
Albert Kwon, MIT; David Lu, MIT PRIMES; Srinivas Devadas, MIT
Even as end-to-end encrypted communication becomes more popular, private messaging remains a challenging problem due to metadata leakages, such as who is communicating with whom. Most existing systems that hide communication metadata either (1) do not scale easily, (2) incur significant overheads, or (3) provide weaker guarantees than cryptographic privacy, such as differential privacy or heuristic privacy. This paper presents XRD (short for Crossroads), a metadata private messaging system that provides cryptographic privacy, while scaling easily to support more users by adding more servers. At a high level, XRD uses multiple mix networks in parallel with several techniques, including a novel technique we call aggregate hybrid shuffle. As a result, XRD can support 2 million users with 228 seconds of latency with 100 servers. This is 13.3× and 4.1× faster than Atom and Pung, respectively, which are prior scalable messaging systems with cryptographic privacy.
Vibhaalakshmi Sivaraman, Massachusetts Institute of Technology; Shaileshh Bojja Venkatakrishnan, Ohio State University; Kathleen Ruan, Carnegie Mellon University; Parimarjan Negi and Lei Yang, Massachusetts Institute of Technology; Radhika Mittal, University of Illinois at Urbana-Champaign; Giulia Fanti, Carnegie Mellon University; Mohammad Alizadeh, Massachusetts Institute of Technology
Despite growing adoption of cryptocurrencies, making fast payments at scale remains a challenge. Payment channel networks (PCNs) such as the Lightning Network have emerged as a viable scaling solution. However, completing payments on PCNs is challenging: payments must be routed on paths with sufficient funds. As payments flow over a single channel (link) in the same direction, the channel eventually becomes depleted and cannot support further payments in that direction; hence, naive routing schemes like shortest-path routing can deplete key payment channels and paralyze the system. Today’s PCNs also route payments atomically, worsening the problem. In this paper, we present Spider, a routing solution that "packetizes" transactions and uses a multi-path transport protocol to achieve high-throughput routing in PCNs. Packetization allows Spider to complete even large transactions on low-capacity payment channels over time, while the multi-path congestion control protocol ensures balanced utilization of channels and fairness across flows. Extensive simulations comparing Spider with state-of-the-art approaches shows that Spider requires less than 25% of the funds to successfully route over 95% of transactions on balanced traffic demands, and offloads 4x more transactions onto the PCN on imbalanced demands.
Behnaz Arzani, Microsoft Research; Selim Ciraci, Microsoft; Stefan Saroiu, Alec Wolman, and Jack Stokes, Microsoft Research; Geoff Outhred and Lechao Diwu, Microsoft
Today, it is difficult for operators to detect compromised VMs in their data centers (DCs). Despite their benefits, the compromise detection systems operators offer are mostly unused. Operators are faced with a dilemma: allow VMs to remain unprotected, or mandate all customers use the compromise detection systems they provide. Neither is appealing: unprotected VMs can be used to attack other VMs. Many customers would view a mandate to use these detection systems as unacceptable due to privacy and performance concerns. Data from a production cloud show their compromise detection systems protect less than 5% of VMs.
PrivateEye is a scalable and privacy-preserving solution. It uses sanitized summaries of network traffic patterns obtained from the vSwitch, rather than installing binaries in customer VMs, introspection at the hypervisor, or packet captures. The challenge it addresses is protecting all VMs at DC-scale while preserving customer privacy using low-signal data. We developed PrivateEye to meet the needs of production DCs, and our data collection agent is deployed across all DCs of a large cloud. Evaluation on VMs of both internal and customer VM's shows it has an area under the ROC curve -- the curve showing the model's true positive rate vs its false-positive rate -- of 0.96.
Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, and Yige Hu, The University of Texas at Austin; Christopher J. Rossbach, The University of Texas at Austin and VMware Research; Emmett Witchel, The University of Texas at Austin
GPUs have become ubiquitous in the cloud due to the dramatic performance gains they enable in domains such as machine learning and computer vision. However, offloading GPU computation to the cloud requires placing enormous trust in providers and administrators. Recent proposals for GPU trusted execution environments (TEEs) are promising but fail to address very real side-channel concerns. To illustrate the severity of the problem, we demonstrate a novel attack that enables an attacker to correctly classify images from ImageNet by observing only the timing of GPU kernel execution, rather than the images themselves.
Telekine enables applications to use GPU acceleration in the cloud securely, based on a novel GPU stream abstraction that ensures execution and interaction through untrusted components are independent of any secret data. Given a GPU with support for a TEE, Telekine employs a novel variant of API remoting to partition application-level software into components to ensure secret-dependent behaviors occur only on trusted components. Telekine can securely train modern image recognition models on MXNet with 10%–22% performance penalty relative to an insecure baseline with a locally attached GPU. It runs graph algorithms using Galois on one and two GPUs with 18%–41% overhead.
Lukas Burkhalter, ETH Zurich; Anwar Hithnawi, UC Berkeley, ETH Zurich; Alexander Viand and Hossein Shafagh, ETH Zurich; Sylvia Ratnasamy, UC Berkeley
A growing number of devices and services collect detailed time series data that is stored in the cloud. Protecting the confidentiality of this vast and continuously generated data is an acute need for many applications in this space. At the same time, we must preserve the utility of this data by enabling authorized services to securely and selectively access and run analytics. This paper presents TimeCrypt, a system that provides scalable and real-time analytics over large volumes of encrypted time series data. TimeCrypt allows users to define expressive data access and privacy policies and enforces it cryptographically via encryption. In TimeCrypt, data is encrypted end-to-end, and authorized parties can only decrypt and verify queries within their authorized access scope. Our evaluation of TimeCrypt shows that its memory overhead and performance are competitive and close to operating on data in the clear.
Yuncong Hu, Sam Kumar, and Raluca Ada Popa, University of California, Berkeley
Data-sharing systems are often used to store sensitive data. Both academia and industry have proposed numerous solutions to protect the user privacy and data integrity from a compromised server. Practical state-of-the-art solutions, however, use weak threat models based on centralized trust—they assume that part of the server will remain uncompromised, or that the adversary will not perform active attacks. We propose Ghostor, a data-sharing system that, using only decentralized trust, (1) hides user identities from the server, and (2) allows users to detect server-side integrity violations. To achieve (1), Ghostor avoids keeping any per-user state at the server, requiring us to redesign the system to avoid common paradigms like per-user authentication and user-specific mailboxes. To achieve (2), Ghostor develops a technique called verifiable anonymous history. Ghostor leverages a blockchain rarely, publishing only a single hash to the blockchain for the entire system once every epoch. We measured that Ghostor incurs a 4–5x throughput overhead compared to an insecure baseline. Although significant, Ghostor's overhead may be worth it for security- and privacy-sensitive applications.
10:30 am–11:00 am
Break with Coffee and Tea
11:00 am–12:20 pm
Wireless Networks 2
Session Chair: Mohammad Alizadeh, Massachusetts Institute of Technology
Shaghayegh Mardani, UCLA; Mayank Singh, IIT Delhi; Ravi Netravali, UCLA
Despite the rapid increase in mobile web traffic, page loads still fall short of user performance expectations. State-of-the-art web accelerators optimize computation or network fetches that occur after a page’s HTML has been fetched. However, clients still suffer multiple round trips and server processing delays to fetch that HTML; during that time, a browser cannot display any visual content, frustrating users. This problem persists in warm cache settings since HTML is most often marked as uncacheable because it usually embeds a mixture of static and dynamic content.
Xin Liu, Zicheng Chi, Wei Wang, Yao Yao, and Ting Zhu, University of Maryland, Baltimore County
In this paper, we design and implement a versatile MIMO backscatter (VMscatter) system, which leverages the diversity features of MIMO to dramatically decrease bit error rate (BER) and increase throughput with negligible overhead. Our approach is different from existing WiFi MIMO backscatter approaches which simply reflect the signals from the WiFi MIMO sender and do not take advantage of MIMO technologies' advanced features (i.e., low bit error rate and high throughput). In our approach, the backscatter can achieve the same full diversity gain as traditional MIMO system by implementing the space-time coding on the backscatter tag under the constraint that backscatter tags cannot control the reflected signals to be orthogonal. Moreover, the backscatter can reflect excitation signals from the senders that have either a single antenna or multiple antennas. To implement the VMscatter system, we addressed the special design challenges such as complicated channel estimations among the sender, tag, and receiver by using a novel pre-scatter channels elimination method and a post-scatter channels equalization method. Our VMscatter design introduces negligible overheads (in terms of hardware cost, energy consumption, and computation) on the backscatter tag. We further extended our design to support any number of antennas that the sender, tag, and receiver have. Our MIMO backscatter design is generic and has the potential to be extended to achieve massive MIMO. We extensively evaluated our system in different real-world scenarios. Results show that the BER is reduced by a factor of 862 compared to the most related work MOXcatter.
Sam Kumar, Michael P Andersen, Hyung-Sin Kim, and David E. Culler, University of California, Berkeley
Low-power and lossy networks (LLNs) enable diverse applications integrating many resource-constrained embedded devices, often requiring interconnectivity with existing TCP/IP networks as part of the Internet of Things. But TCP has received little attention in LLNs due to concerns about its overhead and performance, leading to LLN-specific protocols that require specialized gateways for interoperability. We present a systematic study of a well-designed TCP stack in IEEE 802.15.4-based LLNs, based on the TCP protocol logic in FreeBSD. Through careful implementation and extensive experiments, we show that modern low-power sensor platforms are capable of running full-scale TCP and that TCP, counter to common belief, performs well despite the lossy nature of LLNs. By carefully studying the interaction between the transport and link layers, we identify subtle but important modifications to both, achieving TCP goodput within 25% of an upper bound (5–40x higher than prior results) and low-power operation commensurate to CoAP in a practical LLN application scenario. This suggests that a TCP-based transport layer, seamlessly interoperable with existing TCP/IP networks, is viable and performant in LLNs.
Shangqing Zhao, Zhe Qu, Zhengping Luo, Zhuo Lu, and Yao Liu, University of South Florida
Packet collisions happen every day in WiFi networks. RTS/CTS is a widely-used approach to reduce the cost of collisions of long data packets as well as combat the hidden terminal problem. In this paper, we present a new design called comb decoding (CombDec) to efficiently resolve RTS collisions without changing the 802.11 standard. We observe that an RTS payload, when treated as a vector in a vector space, exhibits a comb-like distribution; i.e., a limited number of vectors are much more likely to be used than the others due to RTS payload construction and firmware design. This enables us to reformulate RTS collision resolution as a sparse recovery problem. We create algorithms that carefully construct the search range for sparse recovery, making the complexity feasible for system design and implementation. Experimental results show that CombDec boosts the WiFi throughput by 33.6%–46.2% in various evaluation scenarios.
12:20 pm–1:50 pm
Lunch (on your own)
1:50 pm–3:30 pm
Session Chair: Lalith Suresh, VMware Research
Santhosh Prabhu, Kuan-Yen Chou, Ali Kheradmand, Brighten Godfrey, and Matthew Caesar, University of Illinois at Urbana–Champaign
Network configuration verification enables operators to ensure that the network will behave as intended, prior to deployment of their configurations. Although techniques ranging from graph algorithms to SMT solvers have been proposed, scalable configuration verification with sufficient protocol support continues to be a challenge. In this paper, we show that by combining equivalence partitioning with explicit-state model checking, network configuration verification can be scaled significantly better than the state of the art, while still supporting a rich set of protocol features. We propose Plankton, which uses symbolic partitioning to manage large header spaces and efficient model checking to exhaustively explore protocol behavior. Thanks to a highly effective suite of optimizations including state hashing, partial order reduction, and policy-based pruning, Plankton successfully verifies policies in industrial-scale networks quickly and compactly, at times reaching a 10000x speedup compared to the state of the art.
Rüdiger Birkner, ETH Zürich; Dana Drachsler-Cohen, Technion; Laurent Vanbever and Martin Vechev, ETH Zürich
Network verification and configuration synthesis are promising approaches to make networks more reliable and secure by enforcing a set of policies. However, these approaches require a formal and precise description of the intended network behavior, imposing a major barrier to their adoption: network operators are not only reluctant to write formal specifications, but often do not even know what these specifications are.
We present Config2Spec, a system that automatically synthesizes a formal specification (a set of policies) of a network given its configuration and a failure model (e.g., up to two link failures). A key technical challenge is to design a synthesis algorithm which can efficiently explore the large space of possible policies. To address this challenge, Config2Spec relies on a careful combination of two well-known methods: data plane analysis and control plane verification.
Experimental results show that Config2Spec scales to mining specifications of large networks (>150 routers).
Sam Burnett and Lily Chen, Google; Douglas A. Creager, GitHub; Misha Efimov, Ilya Grigorik, and Ben Jones, Google; Harsha V. Madhyastha, Google and University of Michigan; Pavlos Papageorge, Brian Rogan, Charles Stahl, and Julia Tuttle, Google
We present NEL (Network Error Logging), a planet-scale, client-side, network reliability measurement system. NEL is implemented in Google Chrome and has been proposed as a new W3C standard, letting any web site operator collect reports of clients’ successful and failed requests to their sites. These reports are similar to web server logs, but include information about failed requests that never reach serving infrastructure. Reports are uploaded via redundant failover paths, reducing the likelihood of shared-fate failures of report uploads. We have designed NEL such that service providers can glean no additional information about users or their behavior compared to what services already have visibility into during normal operation. Since 2014, NEL has been invaluable in monitoring all of Google’s domains, allowing us to detect and investigate instances of DNS hijacking, BGP route leaks, protocol deployment bugs, and other problems where packets might never reach our servers. This paper presents the design of NEL, case studies of real outages, and deployment lessons for other operators who choose to use NEL to monitor their traffic.
Siva Kesava Reddy Kakarla and Alan Tang, UCLA; Ryan Beckett, Microsoft Research; Karthick Jayaraman, Microsoft Azure; Todd Millstein, UCLA / Intentionet; Yuval Tamir and George Varghese, UCLA
Network verification to detect router configuration errors typically requires an explicit correctness specification. Unfortunately, specifications often either do not exist, are incomplete, or are written informally in English. We describe an approach to infer likely network configuration errors without a specification through a form of automated outlier detection. Unlike prior techniques, our approach can identify outliers even for complex, structured configuration elements that have a variety of intentional differences across nodes, like access-control lists, prefix lists, and route policies.
Given a collection of configuration elements, our approach automatically infers a set of parameterized templates, modeling the (likely) intentional differences as variations within a template while modeling the (likely) erroneous differences as variations across templates. We have implemented our algorithm, which we call structured generalization, in a tool called SelfStarter and used it to automatically identify configuration outliers in a collection of datacenter networks from a large cloud provider, the wide-area network from the same cloud provider, and the campus network of a large university. SelfStarter found misconfigurations in all three networks, including 43 previously unknown bugs, and is in the process of adoption in the configuration management system of a major cloud provider.
Nofel Yaseen, John Sonchack, and Vincent Liu, University of Pennsylvania
When designing, understanding, or optimizing a computer network, it is often useful to identify and rank common patterns in its usage over time. Often referred to as a network traffic pattern, identifying the patterns in which the network spends most of its time can help ease network operators' tasks considerably. Despite this, extracting traffic patterns from a network is, unfortunately, a difficult and highly manual process.
In this paper, we introduce tpprof, a profiler for network traffic patterns. tpprof is built around two novel abstractions: (1) network states, which capture an approximate snapshot of network link utilization and (2) traffic pattern sub-sequences, which represent a finite-state automaton over a sequence of network states. Around these abstractions, we introduce novel techniques to extract these abstractions, a robust tool to analyze them, and a system for alerting operators of their presence in a running network.
3:30 pm–4:00 pm
Break with Refreshments
4:00 pm–6:00 pm
Session Chair: Arpit Gupta, University of California, Santa Barbara
Mehrdad Hessar, Ali Najafi, Vikram Iyer, and Shyamnath Gollakota, University of Washington
Wireless protocol design for IoT networks is an active area of research which has seen significant interest and developments in recent years. The research community is however handicapped by the lack of a flexible, easily deployable platform for prototyping IoT endpoints that would allow for ground up protocol development and investigation of how such protocols perform at scale. We introduce tinySDR, the first software-defined radio platform tailored to the needs of power-constrained IoT endpoints. TinySDR provides a standalone, fully programmable low power software-defined radio solution that can be duty cycled for battery operation like a real IoT endpoint, and more importantly, can be programmed over the air to allow for large scale deployment. We present extensive evaluation of our platform showing it consumes as little as 30 uW of power in sleep mode, which is 10,000x lower than existing SDR platforms. We present two case studies by implementing LoRa and BLE beacons on the platform and achieve sensitivities of -126 dBm and -94 dBm respectively while consuming 11% and 3% of the FPGA resources. Finally, using tinySDR, we explore the research question of whether an IoT device can demodulate concurrent LoRa transmissions in real-time, within its power and computing constraints.
Venkat Arun and Hari Balakrishnan, Massachusetts Institute of Technology
To reduce transmit power, increase throughput, and improve range, radio systems benefit from the ability to direct more of the transmitted power toward the intended receiver. Many modern systems beamform with antenna arrays for this purpose. However, a radio's ability to direct its signal is fundamentally limited by its size. This limitation is acute on IoT and mobile devices, which are small and inexpensive, but even access points and base stations are typically constrained to a modest number of antennas.
To address this problem, we introduce RFocus, which moves beamforming functions from the radio endpoints to the environment. RFocus includes a two-dimensional surface with a rectangular array of simple RF switch elements. Each switch element either lets the signal through or reflects it. The surface does not emit any power of its own. The state of the elements is set by a software controller to maximize the signal strength at a receiver, with a novel optimization algorithm that uses signal strength measurements from the receiver. The RFocus surface can be manufactured as an inexpensive thin "wallpaper". In one floor of an office building (at MIT CSAIL), our prototype improves the median signal strength by 9.5×.and the median channel capacity by 2.0×.
Fawad Ahmad and Hang Qiu, University of Southern California; Ray Eells, California State Polytechnic University, Pomona; Fan Bai, General Motors; Ramesh Govindan, University of Southern California
Autonomous vehicles need an accurate, up-to-date, 3D map to localize themselves with respect to their surroundings. Today, map collection runs infrequently and uses a fleet of specialized vehicles. In this paper, we explore a different approach: near-real time crowd-sourced 3D map collection from vehicles with advanced sensors (LiDAR, stereo cameras). Our main technical challenge is to find a lean representation of a 3D map such that new map segments, or updates to existing maps, are compact enough to upload in near real-time over a cellular network. To this end, we develop CarMap, which finds a parsimonious representation of a feature map, contains novel object filtering and position-based feature matching techniques to improve localization robustness, and incorporates a novel stitching algorithm to combine map segments from multiple vehicles for unmapped regions and an efficient map-update operation for updating existing map regions. Evaluations show that CarMap takes less than a second (0.6 seconds) to update a map, reduces map sizes by 75× relative to competing strategies, has higher localization accuracy, and is able to localize in corner cases (e.g., multi-lane scenarios) when other approaches fail.
Unsoo Ha, Junshan Leng, Alaa Khaddaj, and Fadel Adib, Massachusetts Institute of Technology
We present the design and implementation of RF-EATS, a system that can sense food and liquids in closed containers without opening them or requiring any contact with their contents. RF-EATS uses passive backscatter tags (e.g., RFIDs) placed on a container, and leverages near-field coupling between a tag’s antenna and the container contents to sense them noninvasively.
In contrast to prior proposals that are invasive or require strict measurement conditions, RF-EATS is non- invasive and does not require any calibration; it can robustly identify contents in practical indoor environments and generalize to unseen environments. These capabilities are made possible by a learning framework that adapts recent advances in variational inference to the RF sensing problem. The framework introduces an RF kernel and incorporates a transfer model that together allow it to generalize to new contents in a sample-efficient manner, enabling users to extend it to new inference tasks using a small number of measurements.
We built a prototype of RF-EATS and tested it in seven different applications including identifying fake medicine, adulterated baby formula, and counterfeit beauty products. Our results demonstrate that RF-EATS can achieve over 90% classification accuracy in scenarios where state-of-the-art RFID sensing systems cannot perform better than a random guess.
Xingyu Chen, Jia Liu, Xia Wang, Haisong Liu, Dong Jiang, and Lijun Chen, Nanjing University
RFID tag authentication is challenging because most advanced cryptographic algorithms cannot be afforded by passive tags. Recent physical-layer identification utilizes unique features of RF signals as the fingerprint to authenticate a tag. This approach is effective but difficult for practical use because it either requires a purpose-built device to extract the signal features or is sensitive to environmental conditions. In this paper, we present a new energy-related fingerprint called Eingerprint to authenticate passive tags with commodity RFID devices. The competitive advantage of Eingerprint is that it is fully compatible with the RFID standard EPCglobal Gen2, which makes it more applicable and scalable in practice. Besides, it takes the electrical energy stored in a tag's resistor-capacitor (RC) circuit as the fingerprint, which is robust to environmental changes such as tag position, communication distance, transmit power, and multi-path effects. We propose a new metric called persistence time to indirectly estimate the energy level in the RC circuit. A select-query based scheme is designed to extract the persistence time by flipping and observing a flag in the tag's volatile memory. We implement a prototype of Eingerprint with commodity RFID devices without any modifications to the hardware or the firmware. Experiment results show that Eingerprint is able to achieve a high authentication accuracy of 99.4% when three persistence times are used, regardless of device diversity and environmental conditions.
Roshan Ayyalasomayajula, Aditya Arun, Chenfeng Wu, Shrivatsan Rajagopalan, Shreya Ganesaraman, Aravind Seetharaman, Ish Kumar Jain, and Dinesh Bharadia, University of California, San Diego
Indoor localization has been studied for nearly two decades fueled by wide interest in indoor navigation, achieving the necessary decimeter-level accuracy. However, there are no real-world deployments of WiFi-based user localization algorithms, primarily because these algorithms are infrastructure dependent and therefore assume the location of the Access Points, their antenna geometries, and deployment orientations in the physical map. In the real world, such detailed knowledge of the location attributes of the access point is seldom available, thereby making WiFi localization hard to deploy. In this paper, for the first time, we establish the accuracy requirements for the location attributes of access points to achieve decimeter level user localization accuracy. Surprisingly, these requirements for antenna geometries and deployment orientation are very stringent, requiring millimeter level and sub-10○ of accuracy respectively, which is hard to achieve with manual effort. To ease the deployment of real-world WiFi localization, we present LocAP, which is an autonomous system to physically map the environment and accurately locate the attributes of existing wireless infrastructure in the physical space down to the required stringent accuracy of 3 mm antenna separation and 3o deployment orientation median errors, whereas state-of-the-art algorithm reports 150 mm and 25o respectively.