# USENIX ATC '18 Technical Sessions

### USENIX ATC '18 Proceedings

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

USENIX ATC '18 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
USENIX ATC '18 Full Proceedings (Mobi)

Attendee Files
USENIX ATC '18 Attendee List (PDF)
USENIX ATC '18 Proceedings Web Archive (ZIP)
Display:
View mode:

## Continental Breakfast

Essex Ballroom Foyer

## Opening Remarks and Awards

Essex Ballroom North/Center

## Blockchain in the Lens of BFT

Dahlia Malkhi, VMware Research

Available Media

Blockchain is a Byzantine Fault Tolerant (BFT) replicated state machine, in which each state-update is by itself a Turing machine with bounded resources. The core algorithm for achieving BFT in a Blockchain appears completely different from classical BFT algorithms:

• Classical solutions like DLS, PBFT solve BFT among a small-to-medium group of known participants. Such algorithms consist of multiple rounds of message exchanges carrying votes and safety-proofs. They are evidently quite intimidating to the non-expert.
• In contrast, Bitcoin solves BFT among a very large group of unknown users. In each time-period, one user broadcasts a single message carrying a Proof-of-Work (PoW). No other messages or information is exchanged.

What a difference between the two worlds!

Recent advances in blockchain technology blur these boundaries. Namely, hybrid solutions such as Byzcoin, Bitcoin-NG, Hybrid Consensus, Casper and Solida, anchor off-chain BFT decisions inside a PoW chain or the other way around. Moreover, innovative solutions in the age of blockchains, such as Honeybadger, ALGORAND, Tendermint, SBFT, and Hot-Stuff, revisit the BFT setting with greater scalability and simplicity.

Confused? Come hear this keynote in which we describe Blockchain in the lens of BFT and BFT in the lens of Blockchain, and provide common algorithmic foundations for both.

## Dahlia Malkhi, VMware Research

Dahlia is an applied and foundational researcher, since the early nineties, in broad aspects of reliability and security in distributed systems.

In 2014, after the closing of the Microsoft Research Silicon Valley lab, Dahlia co-founded VMware Research and became a Principal Researcher at VMware. From 2004-2014, she was a principal researcher at Microsoft Research, Silicon Valley. From 1999-2007, she was a tenured associate professor at the Hebrew University of Jerusalem. In 2004, Dahlia actually left for a brief sabbatical at Microsoft Research, but was bitten by the Silicon Valley bug and stayed there. Dahlia holds a Ph.D., an M.Sc. and a B.Sc. in computer science from the Hebrew University of Jerusalem.

## Break with Refreshments

Essex Ballroom Foyer

### 10:30 am–12:10 pm

Refereed Papers Track I

## Performance

Essex Ballroom North

## Tributary: spot-dancing for elastic services with latency SLOs

Aaron Harlap and Andrew Chung, Carnegie Mellon University; Alexey Tumanov, UC Berkeley; Gregory R. Ganger and Phillip B. Gibbons, Carnegie Mellon University

Available Media

The Tributary elastic control system embraces the uncertain nature of transient cloud resources, such as AWS spot instances, to manage elastic services with latency SLOs more robustly and more cost-effectively. Such resources are available at lower cost, but with the proviso that they can be preempted en masse, making them risky to rely upon for business-critical services. Tributary creates models of preemption likelihood and exploits the partial independence among different resource offerings, selecting collections of resource allocations that satisfy SLO requirements and adjusting them over time, as client workloads change. Although Tributary’s collections are often larger than required in the absence of preemptions, they are cheaper because of both lower spot costs and partial refunds for preempted resources. At the same time, the often-larger sets allow unexpected workload bursts to be absorbed without SLO violation. Over a range of web service workloads, we find that Tributary reduces cost for achieving a given SLO by 81–86% compared to traditional scaling on non-preemptible resources, and by 47–62% compared to the high-risk approach of the same scaling with spot resources.

## FastTrack: Foreground App-Aware I/O Management for Improving User Experience of Android Smartphones

Sangwook Shane Hahn, Seoul National University; Sungjin Lee, DGIST; Inhyuk Yee, AIBrain Asia; Donguk Ryu, Samsung Electronics; Jihong Kim, Seoul National University

Available Media

The quality of user experience on a smartphone is directly affected by how fast a foreground app reacts to user inputs. Although existing Android smartphones properly differentiate a foreground app from background apps for most system activities, one major exception is the I/O service where I/O-priority inversions between a foreground app and background apps are commonly observed. In this paper, we investigate the I/O-priority inversion problem on Android smartphones. From our empirical study with real Android smartphones, we observed that the existing techniques for mitigating I/O-priority inversions are not applicable for smartphones where frequently inverted I/O priorities should be quickly corrected to avoid any user-perceived extra delay. We also identified that most noticeable I/O-priority inversions occur in the page cache and a flash storage device. Based on the analysis results, we propose a foreground app-aware I/O management scheme, called FastTrack, that accelerates foreground I/O requests by 1) preempting background I/O requests in the entire I/O stacks including the storage device and 2) preventing foreground app’s data from being flushed from the page cache. Our experimental results using a prototype FastTrack implementation on four smartphones show that a foreground app can achieve the equivalent level of user-perceived responsiveness regardless of the number of background apps. Over the existing Android I/O implementation, FastTrack can reduce the average user response time by 94% when six I/O-intensive apps run as background apps.

## Mainstream: Dynamic Stem-Sharing for Multi-Tenant Video Processing

Angela H. Jiang, Daniel L.-K. Wong, Christopher Canel, Lilia Tang, and Ishan Misra, Carnegie Mellon University; Michael Kaminsky, Michael A. Kozuch, and Padmanabhan Pillai, Intel Labs; David G. Andersen and Gregory R. Ganger, Carnegie Mellon University

Available Media

Mainstream is a new video analysis system that jointly adapts concurrent applications sharing fixed edge resources to maximize aggregate result quality. Mainstream exploits partial-DNN (deep neural network) compute sharing among applications trained through transfer learning from a common base DNN model, decreasing aggregate per-frame compute time. Based on the available resources and mix of applications running on an edge node, Mainstream automatically determines at deployment time the right trade-off between using more specialized DNNs to improve per-frame accuracy, and keeping more of the unspecialized base model to increase sharing and process more frames per second. Experiments with several datasets and event detection tasks on an edge node confirm that Mainstream improves mean event detection F1-scores by up to 47% relative to a static approach of retraining only the last DNN layer and sharing all others (“Max-Sharing”) and by 87X relative to the common approach of using fully independent per-application DNNs (“No-Sharing”).

## VideoChef: Efficient Approximation for Streaming Video Processing Pipelines

Ran Xu, Jinkyu Koo, Rakesh Kumar, and Peter Bai, Purdue University; Subrata Mitra, Adobe Research; Sasa Misailovic, University of Illinois Urbana-Champaign; Saurabh Bagchi, Purdue University

Available Media

Many video streaming applications require low-latency processing on resource-constrained devices. To meet the latency and resource constraints, developers must often approximate filter computations. A key challenge to successfully tuning approximations is finding the optimal configuration suited for content characteristics, which are changing across and within the input videos. Searching through the entire search space for every frame in the video stream is infeasible, while tuning the pipeline off-line, on a set of training videos, yields suboptimal results.

We present VideoChef, a system for approximate optimization of video pipelines. VideoChef finds the optimal configurations of approximate filters at runtime, by leveraging the previously proposed concept canary inputs (using small inputs to tune the accuracy of the computations and transferring the approximate configurations to full inputs). VideoChef is the first system to show that canary inputs can be used for complex streaming applications. The two key innovations of VideoChef are (1) an accurate error mapping from the approximate processing with downsampled inputs to that with full inputs and (2) a directed search that balances the cost of each search step with the estimated reduction in the run time.

We evaluate our approach on 106 videos obtained from YouTube, on a set of 9 video processing pipelines (in total having 10 distinct filters). Our results show significant performance improvement over the baseline and the previous approach that uses canary inputs. We also perform a user study that shows that the videos produced by VideoChef are often acceptable to human subjects.

Refereed Papers Track II

## Kernel

Session Chair: Shivaram Venkataraman, University of Wisconsin-Madison and Microsoft Research

Essex Ballroom Center

## SOCK: Rapid Task Provisioning with Serverless-Optimized Containers

Edward Oakes, Leon Yang, Dennis Zhou, and Kevin Houck, University of Wisconsin-Madison; Tyler Harter, Microsoft, GSL; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin-Madison

Available Media

Serverless computing promises to provide applications with cost savings and extreme elasticity. Unfortunately, slow application and container initialization can hurt common-case latency on serverless platforms. In this work, we analyze Linux container primitives, identifying scalability bottlenecks related to storage and network isolation. We also analyze Python applications from GitHub and show that importing many popular libraries adds about 100ms to startup. Based on these findings, we implement SOCK, a container system optimized for serverless workloads. Careful avoidance of kernel scalability bottlenecks gives SOCK an 18x speedup over Docker. A generalized-Zygote provisioning strategy yields an additional 3x speedup. A more sophisticated three-tier caching strategy based on Zygotes provides a 45x speedup over SOCK without Zygotes. Relative to AWS Lambda and OpenWhisk, OpenLambda with SOCK reduces platform overheads by 2.8x and 5.3x respectively in an image processing case study.

## DynaMix: Dynamic Mobile Device Integration for Efficient Cross-device Resource Sharing

Dongju Chae, POSTECH; Joonsung Kim and Gwangmu Lee, Seoul National University; Hanjun Kim, POSTECH; Kyung-Ah Chang and Hyogun Lee, Samsung Electronics; Jangwoo Kim, Seoul National University

Available Media

In the era of the Internet of Things, users desire more valuable services by simultaneously utilizing various resources available in remote devices. As a result, cross-device resource sharing, a capability to utilize the resources of a remote device, becomes a desirable feature to enable interesting multi-device services. However, the existing resource sharing mechanisms either have limited resource coverage, involve complex programming efforts for utilizing multiple devices, or more importantly, incur huge inter-device network traffic.

We propose DynaMix, a novel framework that realizes efficient cross-device resource sharing. First, DynaMix maximizes resource coverage by dynamically integrating computation and I/O resources of remote devices with distributed shared memory and I/O request forwarding. Second, DynaMix obviates the need for multi-device programming by providing the resource sharing capability at the low level. Third, DynaMix minimizes inter-device network traffic by adaptively redistributing tasks between devices based on their dynamic resource usage. By doing so, DynaMix achieves efficient resource sharing along with dynamic plug-and-play and reconfigurability. Our example implementation on top of Android and Tizen devices shows that DynaMix enables efficient cross-device resource sharing in multi-device services.

## The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS

Justinien Bouron, Sebastien Chevalley, Baptiste Lepers, and Willy Zwaenepoel, EPFL; Redha Gouicem, Julia Lawall, Gilles Muller, and Julien Sopena, Sorbonne University/Inria/LIP6

Available Media

This paper analyzes the impact on application performance of the design and implementation choices made in two widely used open-source schedulers: ULE, the default FreeBSD scheduler, and CFS, the default Linux scheduler.

We compare ULE and CFS in otherwise identical circumstances. We have ported ULE to Linux, and use it to schedule all threads that are normally scheduled by CFS. We compare the performance of a large suite of applications on the modified kernel running ULE and on the standard Linux kernel running CFS. The observed performance differences are solely the result of scheduling decisions, and do not reflect differences in other subsystems between FreeBSD and Linux.

There is no overall winner. On many workloads the two schedulers perform similarly, but for some workloads there are significant and even surprising differences. ULE may cause starvation, even when executing a single application with identical threads, but this starvation may actually lead to better application performance for some workloads. The more complex load balancing mechanism of CFS reacts more quickly to workload changes, but ULE achieves a better load balance in the long run.

## The Design and Implementation of Hyperupcalls

Nadav Amit and Michael Wei, VMware Research
Awarded Best Paper!

Available Media

The virtual machine abstraction provides a wide variety of benefits which have undeniably enabled cloud computing. Virtual machines, however, are a double-edged sword as hypervisors they run on top of must treat them as a black box, limiting the information which the hypervisor and virtual machine may exchange, a problem known as the semantic gap. In this paper, we present the design and implementation of a new mechanism, hyperupcalls, which enables a hypervisor to safely execute verified code provided by a guest virtual machine in order to transfer information. Hyperupcalls are written in C and have complete access to guest data structures such as page tables. We provide a complete framework which makes it easy to access familiar kernel functions from within a hyperupcall. Compared to state-of-the-art paravirtualization techniques and virtual machine introspection, Hyperupcalls are much more flexible and less intrusive. We demonstrate that hyperupcalls can not only be used to improve guest performance for certain operations by up to 2×but hyperupcalls can also serve as a powerful debugging and security tool.

### 2:00 pm–3:40 pm

Refereed Papers Track I

## Security 1

Session Chair: Julia Lawall, Inria/LIP6

Essex Ballroom North

## AIQL: Enabling Efficient Attack Investigation from System Monitoring Data

Peng Gao, Princeton University; Xusheng Xiao, Case Western Reserve University; Zhichun Li and Kangkook Jee, NEC Laboratories America, Inc.; Fengyuan Xu, National Key Lab for Novel Software Technology, Nanjing University; Sanjeev R. Kulkarni and Prateek Mittal, Princeton University

Available Media

The need for countering Advanced Persistent Threat (APT) attacks has led to the solutions that ubiquitously monitor system activities in each host, and perform timely attack investigation over the monitoring data for analyzing attack provenance. However, existing query systems based on relational databases and graph databases lack language constructs to express key properties of major attack behaviors, and often execute queries inefficiently since their semantics-agnostic design cannot exploit the properties of system monitoring data to speed up query execution.

To address this problem, we propose a novel query system built on top of existing monitoring tools and databases, which is designed with novel types of optimizations to support timely attack investigation. Our system provides (1) domain-specific data model and storage for scaling the storage, (2) a domain-specific query language, Attack Investigation Query Language (AIQL) that integrates critical primitives for attack investigation, and (3) an optimized query engine based on the characteristics of the data and the semantics of the queries to efficiently schedule the query execution. We deployed our system in NEC Labs America comprising 150 hosts and evaluated it using 857 GB of real system monitoring data (containing 2.5 billion events). Our evaluations on a real-world APT attack and a broad set of attack behaviors show that our system surpasses existing systems in both efficiency (124x over PostgreSQL, 157x over Neo4j, and 16x over Greenplum) and conciseness (SQL, Neo4j Cypher, and Splunk SPL contain at least 2.4x more constraints than AIQL).

## Application Memory Isolation on Ultra-Low-Power MCUs

Taylor Hardin, Dartmouth College; Ryan Scott, Clemson University; Patrick Proctor, Dartmouth College; Josiah Hester, Northwestern University; Jacob Sorber, Clemson University; David Kotz, Dartmouth College

Available Media

The proliferation of applications that handle sensitive user data on wearable platforms generates a critical need for embedded systems that offer strong security without sacrificing flexibility and long battery life. To secure sensitive information, such as health data, ultra-low-power wearables must isolate applications from each other and protect the underlying system from errant or malicious application code. These platforms typically use microcontrollers that lack sophisticated Memory Management Units (MMU). Some include a Memory Protection Unit (MPU), but current MPUs are inadequate to the task, leading platform developers to software-based memory-protection solutions. In this paper, we present our memory isolation technique, which leverages compiler inserted code and MPU-hardware support to achieve better runtime performance than software-only counterparts.

## Peeking Behind the Curtains of Serverless Platforms

Liang Wang, UW-Madison; Mengyuan Li and Yinqian Zhang, The Ohio State University; Thomas Ristenpart, Cornell Tech; Michael Swift, UW-Madison

Available Media

Serverless computing is an emerging paradigm in which an application's resource provisioning and scaling are managed by third-party services. Examples include AWS Lambda, Azure Functions, and Google Cloud Functions. Behind these services' easy-to-use APIs are opaque, complex infrastructure and management ecosystems. Taking on the viewpoint of a serverless customer, we conduct the largest measurement study to date, launching more than 50,000 function instances across these three services, in order to characterize their architectures, performance, and resource management efficiency. We explain how the platforms isolate the functions of different accounts, using either virtual machines or containers, which has important security implications. We characterize performance in terms of scalability, coldstart latency, and resource efficiency, with highlights including that AWS Lambda adopts a bin-packing-like strategy to maximize VM memory utilization, that severe contention between functions can arise in AWS and Azure, and that Google had bugs that allow customers to use resources for free.

## Soteria: Automated IoT Safety and Security Analysis

Z. Berkay Celik, Patrick McDaniel, and Gang Tan, The Pennsylvania State University

Available Media

Broadly defined as the Internet of Things (IoT), the growth of commodity devices that integrate physical processes with digital systems have changed the way we live, play and work. Yet existing IoT platforms cannot evaluate whether an IoT app or environment is safe, secure, and operates correctly. In this paper, we present \Soteria, a static analysis system for validating whether an IoT app or IoT environment (collection of apps working in concert) adheres to identified safety, security, and functional properties. Soteria operates in three phases; (a) translation of platform-specific IoT source code into an intermediate representation (IR), (b) extracting a state model from the IR, (c) applying model checking to verify desired properties. We evaluate Soteria on 65 SmartThings market apps through 35 properties and find nine (14%) individual apps violate ten (29%) properties. Further, our study of combined app environments uncovered eleven property violations not exhibited in the isolated apps. Lastly, we demonstrate Soteria on MalIoT, a novel open-source test suite containing 17 apps with 20 unique violations.

Refereed Papers Track II

## Virtualization

Session Chair: Yu Hua, Huazhong University of Science and Technology

Essex Ballroom Center

## Scaling Guest OS Critical Sections with eCS

Sanidhya Kashyap, Georgia Institute of Technology; Changwoo Min, Virginia Tech; Taesoo Kim, Georgia Institute of Technology

Available Media

Multi-core virtual machines (VMs) are now a norm in data center environments. However, one of the well-known problems that VMs suffer from is the vCPU scheduling problem that causes poor scalability behaviors. More specifically, the symptoms of this problem appear as preemption problems in both under- and over-committed scenarios. Although prior research efforts attempted to alleviate these symptoms separately, they fail to address the common root cause of these problems: the missing semantic gap that occurs when a guest OS is preempted while executing its own critical section, thereby leading to degradation of application scalability.

In this work, we strive to address all preemption problems together by bridging the semantic gap between guest OSes and the hypervisor: the hypervisor now knows whether guest OSes are running in critical sections and a guest OS has hypervisor's scheduling context. We annotate all critical sections by using the lightweight para-virtualized APIs, so we called enlightened critical sections (eCS), that provide scheduling hints to both the hypervisor and VMs. The hypervisor uses the hint to reschedule a vCPU to fundamentally overcome the double scheduling problem for these annotated critical sections and VMs use the hypervisor provided hints to further mitigate the blocked-waiter wake-up problem. Our evaluation results show that eCS guarantees the forward progress of a guest OS by 1) decreasing preemption counts by 85--100% while 2) improving the throughput of applications up to 2.5X in an over-committed scenario and 1.6X in an under-committed scenario for various real-world workloads on an 80-core machine.

## KylinX: A Dynamic Library Operating System for Simplified and Efficient Cloud Virtualization

Yiming Zhang, NiceX Lab, NUDT; Jon Crowcroft, University of Cambridge; Dongsheng Li and Chengfen Zhang, NUDT; Huiba Li, Alibaba; Yaozheng Wang and Kai Yu, NUDT; Yongqiang Xiong, Microsoft; Guihai Chen, SJTU

Available Media

Unikernel specializes a minimalistic LibOS and a target application into a standalone single-purpose virtual machine (VM) running on a hypervisor, which is referred to as (virtual) appliance. Compared to traditional VMs, Unikernel appliances have smaller memory footprint and lower overhead while guaranteeing the same level of isolation. On the downside, Unikernel strips off the process abstraction from its monolithic appliance and thus sacrifices flexibility, efficiency, and applicability.

This paper examines whether there is a balance embracing the best of both Unikernel appliances (strong isolation) and processes (high flexibility/efficiency). We present KylinX, a dynamic library operating system for simplified and efficient cloud virtualization by providing the pVM (process-like VM) abstraction. A pVM takes the hypervisor as an OS and the Unikernel appliance as a process allowing both page-level and library-level dynamic mapping. At the page level, KylinX supports pVM fork plus a set of API for inter-pVM communication (IpC). At the library level, KylinX supports shared libraries to be linked to a Unikernel appliance at runtime. KylinX enforces mapping restrictions against potential threats. KylinX can fork a pVM in about 1.3 ms and link a library to a running pVM in a few ms, both comparable to process fork on Linux (about 1 ms). Latencies of KylinX IpCs are also comparable to that of UNIX IPCs.

## Virtualizing Energy Storage Management Using RAIBA

Tzi-cker Chiueh, Mao-Cheng Huang, Kai-Cheung Juang, Shih-Hao Liang, and Welkin Ling, Industrial Technology Research Institute

Available Media

Because of the intermittent nature of renewable energy-based electricity generation, a key building block for a sustainable renewable energy-based electricity infrastructure is cost-effective energy storage management, which is largely determined by the cost of electric batteries. Despite substantial technological advances in recent years, batteries used in consumer devices and electric vehicles are still too expensive to be feasible for large-scale deployment. One promising way to reduce the battery cost of an energy storage system is to leverage retired batteries from electric vehicles. However, because the charging/discharging characteristics of retired batteries tend to vary widely from one another, putting these heterogeneous batteries into the same module or energy storage system pose significant safety risks and efficiency challenges. This paper presents the design, implementation and evaluation of a reconfigurable battery array system called RAIBA that is designed to address the heterogeneity issue in retired battery-based energy storage systems by allowing the inter-battery connectivity to be recofigurable at run time. In addition, RAIBA also enables virtualization of the electrical energy resources in a battery array in the same way as how computing, storage and network resources are virtualized. Empirical measurements on a fully operational RAIBA prototype demonstrate that it can effectively increase the discharge service time by more than 80% under a set of real-world electric load traces.

## Cntr: Lightweight OS Containers

Jörg Thalheim and Pramod Bhatotia, University of Edinburgh; Pedro Fonseca, University of Washington; Baris Kasikci, University of Michigan

Available Media

Container-based virtualization has become the de-facto standard for deploying applications in data centers. However, deployed containers frequently include a wide-range of tools (e.g., debuggers) that are not required for applications in the common use-case, but they are included for rare occasions such as in-production debugging. As a consequence, containers are significantly larger than necessary for the common case, thus increasing the build and deployment time.

Cntr provides the performance benefits of lightweight containers and the functionality of large containers by splitting the traditional container image into two parts: the “fat” image — containing the tools, and the “slim” image — containing the main application. At run-time, Cntr allows the user to efficiently deploy the “slim” image and then expand it with additional tools, when and if necessary, by dynamically attaching the “fat” image.

To achieve this, Cntr transparently combines the two container images using a new nested namespace, without any modification to the application, the container manager, or the operating system. We have implemented Cntr in Rust, using FUSE, and incorporated a range of optimizations. Cntr supports the full Linux filesystem API, and it is compatible with all container implementations (i.e., Docker, rkt, LXC, systemd-nspawn). Through extensive evaluation, we show that Cntr incurs reasonable performance overhead while reducing, on average, by 66.6% the image size of the Top-50 images available on Docker Hub.

## Break with Refreshments

Essex Ballroom Foyer

### 4:10 pm–5:50 pm

Refereed Papers Track I

## Security 2

Session Chair: Marcos Aguilera, VMware Research

Essex Ballroom North

## Throwhammer: Rowhammer Attacks over the Network and Defenses

Andrei Tatar and Radhesh Krishnan Konoth, Vrije Universiteit Amsterdam; Elias Athanasopoulos, University of Cyprus; Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi, Vrije Universiteit Amsterdam

Available Media

Increasingly sophisticated Rowhammer exploits allow an attacker that can execute code on a vulnerable system to escalate privileges and compromise browsers, clouds, and mobile systems. In all these attacks, the common assumption is that attackers first need to obtain code execution on the victim machine to be able to exploit Rowhammer either by having (unprivileged) code execution on the victim machine or by luring the victim to a website that employs a malicious JavaScript application. In this paper, we revisit this assumption and show that an attacker can trigger and exploit Rowhammer bit flips directly from a remote machine by only sending network packets. This is made possible by increasingly fast, RDMA-enabled networks, which are in wide use in clouds and data centers. To demonstrate the new threat, we show how a malicious client can exploit Rowhammer bit flips to gain code execution on a remote key-value server application. To counter this threat, we propose protecting unmodified applications with a new buffer allocator that is capable of fine-grained memory isolation in the DRAM address space. Using two real-world applications, we show that this defense is practical, self-contained, and can efficiently stop remote Rowhammer attacks by surgically isolating memory buffers that are exposed to untrusted network input.

## Varys: Protecting SGX Enclaves from Practical Side-Channel Attacks

Oleksii Oleksenko, Bohdan Trach, Robert Krahn, and André Martin, TU Dresden; Mark Silberstein, Technion; Christof Fetzer, TU Dresden

Available Media

Numerous recent works have experimentally shown that Intel Software Guard Extensions (SGX) are vulnerable to cache timing and page table side-channel attacks which could be used to circumvent the data confidentiality guarantees provided by SGX. Existing mechanisms that protect against these attacks either incur high execution costs, are ineffective against certain attack variants, or require significant code modifications.

We present Varys, a system that protects unmodified programs running in SGX enclaves from cache timing and page table side-channel attacks. Varys takes a pragmatic approach of strict reservation of physical cores to security-sensitive threads, thereby preventing the attacker from accessing shared CPU resources during enclave execution. The key challenge that we are addressing is that of maintaining the core reservation in the presence of an untrusted OS.

Varys fully protects against all L1/L2 cache timing attacks and significantly raises the bar for page table side-channel attacks - all with only 15% overhead on average for Phoenix and PARSEC benchmarks. Additionally, we propose a set of minor hardware extensions that hold the potential to extend Varys' security guarantees to L3 cache and further improve its performance.

## Kernel-Supported Cost-Effective Audit Logging for Causality Tracking

Shiqing Ma, Purdue University; Juan Zhai, Nanjing University; Yonghwi Kwon, Purdue University; Kyu Hyung Lee, University of Georgia; Xiangyu Zhang, Purdue University; Gabriela Ciocarlie, Ashish Gehani, and Vinod Yegneswaran, SRI International; Dongyan Xu, Purdue University; Somesh Jha, University of Wisconsin-Madison

Available Media

The Linux Audit system is widely used as a causality tracking system in real-world deployments for problem diagnosis and forensic analysis. However, it has poor performance. We perform a comprehensive analysis on the Linux Audit system and find that it suffers from high runtime and storage overheads due to the large volume of redundant events. To address these shortcomings, we propose an in-kernel cache-based online log-reduction system to enable high-performance audit logging. It features a multi-layer caching scheme distributed in various kernel data structures, and uses the caches to detect and suppress redundant events. Our technique is designed to reduce the runtime overhead caused by transferring, processing, and writing logs, as well as the space overhead caused by storing them on disk. Compared to existing log reduction techniques that first generate the huge raw logs before reduction, our technique avoids generating redundant events at the first place. Our experimental results of the prototype KCAL (Kernel-supported Cost-effective Audit Logging) on one-month real-world workloads show that KCAL can reduce the runtime overhead from 40+% to 15-%, and reduce space consumption by 90% on average. KCAL achieves such a large reduction with 4% CPU consumption on average, whereas a state-of-the-art user space log-reduction technique has to occupy a processor with 95+% CPU consumption all the time.

## EPTI: Efficient Defence against Meltdown Attack for Unpatched VMs

Zhichao Hua, Dong Du, Yubin Xia, Haibo Chen, and Binyu Zang, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University

Available Media

The Meltdown vulnerability, which exploits the inherent out-of-order execution in common processors like x86, ARM and PowerPC, has shown to break the fundamental isolation boundary between user and kernel space. This has stimulated a non-trivial patch to modern OS to separate page tables for user space and kernel space, namely, KPTI (kernel page table isolation). While this patch stops kernel memory leakages from rouge user processes, it mandates users to patch their kernels (usually requiring a reboot), and is currently only available on the latest versions of OS kernels. Further, it also introduces non-trivial performance overhead due to page table switching during user/kernel crossings. In this paper, we present EPTI, an alternative approach to defending against the Meltdown attack for unpatched VMs (virtual machines) in cloud, yet with better performance than KPTI. Specifically, instead of using two guest page tables, we use two EPTs (extended page tables) to isolate user space and kernel space, and unmap all the kernel space in user’s EPT to achieve the same effort of KPTI. The switching of EPTs is done through a hardware-support feature called EPT switching within guest VMs without hypervisor involvement. Meanwhile, EPT switching does not flush TLB since each EPT has its own TLB, which further reduces the overhead. We have implemented our design and evaluated it on Intel Kaby Lake CPU with different versions of Linux kernel. The results show that EPTI only introduces up to 13% overhead, which is around 45% less than KPTI.

Refereed Papers Track II

## Multicore

Session Chair: Changwoo Min, Virginia Tech

Essex Ballroom Center

## Effectively Mitigating I/O Inactivity in vCPU Scheduling

Weiwei Jia, The University of Hong Kong, New Jersey Institute of Technology; Cheng Wang and Xusheng Chen, The University of Hong Kong; Jianchen Shan and Xiaowei Shang, New Jersey Institute of Technology; Heming Cui, The University of Hong Kong; Xiaoning Ding, New Jersey Institute of Technology; Luwei Cheng, Facebook; Francis C. M. Lau and Yuexuan Wang, The University of Hong Kong; Yuangang Wang, Huawei

Available Media

In clouds where CPU cores are time-shared by virtual CPUs (vCPU), vCPUs are scheduled and descheduled by the virtual machine monitor (VMM) periodically. In each virtual machine (VM), when its vCPUs running I/O bound tasks are descheduled, no I/O requests can be made until the vCPUs are rescheduled. These inactivity periods of I/O tasks cause severe performance issues, one of them being the utilization of I/O resources in the guest OS tends to be low during I/O inactivity periods. Worse, the I/O scheduler in the host OS could suffer from low performance because the I/O scheduler assumes that I/O tasks make I/O requests constantly. Fairness among the VMs within a host can also be at stake. Existing works typically would adjust the time slices of vCPUs running I/O tasks, but vCPUs are still descheduled frequently and cause I/O inactivity.

Our idea is that since each VM often has active vCPUs, we can migrate I/O tasks to active vCPUs, thus mitigating the I/O inactivity periods and maintaining the fairness. We present vMigrater, which runs in the user level of each VM. It incorporates new mechanisms to efficiently monitor active vCPUs and to accurately detect I/O bound tasks. Evaluation on diverse real world applications shows that vMigrater can improve I/O performance by up to 4.42X compared with default Linux KVM. vMigrater can also improve I/O performance by 1.84X to 3.64X compared with two related systems.

## Placement of Virtual Containers on NUMA systems: A Practical and Comprehensive Model

Justin Funston, Maxime Lorrillere, and Alexandra Fedorova, University of British Columbia; Baptiste Lepers, EPFL; David Vengerov and Jean-Pierre Lozi, Oracle Labs; Vivien Quéma, IMAG

Available Media

Our work addresses the problem of placement of threads, or virtual cores, onto physical cores in a multicore NUMA system. Different placements result in varying degrees of contention for shared resources, so choosing the right placement can have a large effect on performance. Prior work has studied this problem, but either addressed hardware with specific properties, leaving us unable to generalize the models to other systems, or modeled much simpler effects than the actual performance in different placements.

Our contribution is a general framework for reasoning about workload placement on machines with shared resources. It enables us to build an accurate performance model for any machine with a hierarchy of known shared resources automatically, with only minimal input from the user. Using our methodology, data center operators can minimize the number of NUMA (CPU+memory) nodes allocated for an application or a service, while ensuring that it meets performance objectives.

## Getting to the Root of Concurrent Binary Search Tree Performance

Maya Arbel-Raviv, Technion; Trevor Brown, IST Austria; Adam Morrison, Tel Aviv University

Available Media

Many systems rely on optimistic concurrent search trees for multi-core scalability. In principle, optimistic trees have a simple performance story: searches are read-only and so run in parallel, with writes to shared memory occurring only when modifying the data structure. However, this paper shows that in practice, obtaining the full performance benefits of optimistic search trees is not so simple.

We focus on optimistic binary search trees (BSTs) and perform a detailed performance analysis of 10 state-of-the-art BSTs on large scale x86-64 hardware, using both microbenchmarks and an in-memory database system. We find and explain significant unexpected performance differences between BSTs with similar tree structure and search implementations, which we trace to subtle performance-degrading interactions of BSTs with systems software and hardware subsystems. We further derive a prescriptive approach to avoid this performance degradation, as well as algorithmic insights on optimistic BST design. Our work underlines the gap between the theory and practice of multi-core performance, and calls for further research to help bridge this gap.

## TerseCades: Efficient Data Compression in Stream Processing

Gennady Pekhimenko, University of Toronto; Chuanxiong Guo, Bytedance Inc.; Myeongjae Jeon, Microsoft Research; Peng Huang, Johns Hopkins University; Lidong Zhou, Microsoft Research

Available Media

This work is the first systematic investigation of stream processing with data compression: we have not only identified a set of factors that influence the benefits and overheads of compression, but have also demonstrated that compression can be effective for stream processing, both in the ability to process in larger windows and in throughput. This is done through a series of (i) optimizations on a stream engine itself to remove major sources of inefficiency, which leads to an order-of-magnitude improvement in throughput (ii) optimizations to reduce the cost of (de)compression, including hardware acceleration, and (iii) a new technique that allows direct execution on compressed data, that leads to a further 50% improvement in throughout. Our evaluation is performed on several real-world scenarios in cloud analytics and troubleshooting, with both microbenchmarks and production stream processing systems.

## Poster Session and Happy Hour

Essex Ballroom South

Posters of the papers presented in Wednesday's Technical Sessions, as well as invited posters, will be on display. View the list of accepted posters.

## Continental Breakfast

Essex Ballroom Foyer

### 9:00 am–10:15 am

Refereed Papers Track I

## Problem Determination

Session Chair: Keith Smith, NetApp

Essex Ballroom North

## Troubleshooting Transiently-Recurring Errors in Production Systems with Blame-Proportional Logging

Liang Luo, University of Washington; Suman Nath, Lenin Ravindranath Sivalingam, and Madan Musuvathi, Microsoft Research; Luis Ceze, University of Washington

Available Media

Many problems in production systems are transiently recurring— they occur rarely, but when they do, they recur for a short period of time. Troubleshooting these problems is hard as they are rare enough to be missed by sampling techniques and traditional postmortem analyses of runtime logs suffers either from low-fidelity of logging too little or from the overhead of logging too much.

This paper proposes AUDIT, a system specifically designed for troubleshooting transiently-recurring problems in cloud-based production systems. The key idea is to use lightweight triggers to identify the first occurrence of a problem and then to use its recurrences to perform blame-proportional logging. When a problem occurs, AUDIT automatically assigns a blame rank to methods in the application based on their likelihood of being relevant to the root-cause of the problem. Then AUDIT enables heavy-weight logging on highly-ranked methods for a short period of time. Over a period of time, logs generated by a method is proportional to how often it is blamed for various misbehaviors, allowing developers to quickly find the root-cause of the problem.

We have implemented AUDIT for cloud applications. We describe how to utilize system events to efficiently implement lightweight triggers and blame ranking algorithm, with negligible to < 1% common-case runtime overheads on real applications. We evaluate AUDIT with five mature open source and commercial applications, for which AUDIT identified previously unknown issues causing slow responses, inconsistent outputs, and application crashes. All the issues were reported to developers, who have acknowledged or fixed them.

## NanoLog: A Nanosecond Scale Logging System

Stephen Yang, Seo Jin Park, and John Ousterhout, Stanford University

Available Media

NanoLog is a nanosecond scale logging system that is 1-2 orders of magnitude faster than existing logging systems such as Log4j2, spdlog, Boost log or Event Tracing for Windows. The system achieves a throughput up to 80 million log messages per second for simple messages and has a typical log invocation overhead of 8 nanoseconds in microbenchmarks and 18 nanoseconds in applications, despite exposing a traditional printf-like API. NanoLog achieves this low latency and high throughput by shifting work out of the runtime hot path and into the compilation and post-execution phases of the application. More specifically, it slims down user log messages at compile-time by extracting static log components, outputs the log in a compacted, binary format at runtime, and utilizes an offline process to re-inflate the compacted logs. Additionally, log analytic applications can directly consume the compacted log and see a performance improvement of over 8x due to I/O savings. Overall, the lower cost of NanoLog allows developers to log more often, log in more detail, and use logging in low-latency production settings where traditional logging mechanisms are too expensive.

## Model Governance: Reducing the Anarchy of Production ML

Vinay Sridhar, Sriram Subramanian, Dulcardo Arteaga, Swaminathan Sundararaman, Drew Roselli, and Nisha Talagala, ParallelM

Available Media

As the influence of machine learning grows over decisions in businesses and human life, so grows the need for Model Governance. In this paper, we motivate the need for, define the problem of, and propose a solution for Model Governance in production ML. We show that through our approach one can meaningfully track and understand the who, where, what, when, and how an ML prediction came to be. To the best of our knowledge, this is the first work providing a comprehensive framework for production Model Governance, building upon previous work in developer-focused Model Management.

Refereed Papers Track II

## Consistency

Session Chair: Siddhartha Sen, Microsoft Research

Essex Ballroom Center

## Fine-grained consistency for geo-replicated systems

Cheng Li, University of Science and Technology of China; Nuno Preguica, NOVA LINCS & FCT, Univ. NOVA de Lisboa; Rodrigo Rodrigues, INESC-ID & Instituto Superior Técnico, Universidade de Lisboa

Available Media

To deliver fast responses to users worldwide, major Internet providers rely on geo-replication to serve requests at data centers close to users. This deployment leads to a fundamental tension between improving system performance and reducing costly cross-site coordination for maintaining service properties such as state convergence and invariant preservation. Previous proposals for managing this trade-off resorted to coarse-grained operations labeling or coordination strategies that were oblivious to the frequency of operations. In this paper, we present a novel fine-grained consistency definition, Partial Order-Restrictions consistency (or short, PoR consistency), generalizing the trade-off between performance and the amount of coordination paid to restrict the ordering of certain operations. To offer efficient PoR consistent replication, we implement Olisipo, a coordination service assigning different coordination policies to various restrictions by taking into account the relative frequency of the confined operations. Our experimental results show that PoR consistency significantly outperforms a state-of-the-art solution (RedBlue consistency) on a 3-data center RUBiS benchmark.

## Log-Free Concurrent Data Structures

Tudor David, IBM Research, Zurich; Aleksandar Dragojevic, MSR Cambridge; Rachid Guerraoui and Igor Zablotchi, EPFL

Available Media

Non-volatile RAM (NVRAM) makes it possible for data structures to tolerate transient failures, assuming however that programmers have designed these structures such that their consistency is preserved upon recovery. Previous approaches are typically transactional and inherently make heavy use of logging, resulting in implementations that are significantly slower than their DRAM counterparts. In this paper, we introduce a set of techniques aimed at lock-free data structures that, in the large majority of cases, remove the need for logging (and costly durable store instructions) both in the data structure algorithm and in the associated memory management scheme. Together, these generic techniques enable us to design what we call log-free concurrent data structures, which, as we illustrate on linked lists, hash tables, skip lists, and BSTs, can provide several-fold performance improvements over previous transaction-based implementations, with overheads of the order of milliseconds for recovery after a failure. We also highlight how our techniques can be integrated into practical systems, by presenting a durable version of Memcached that maintains the performance of its volatile counterpart.

## Stable and Consistent Membership at Scale with Rapid

Lalith Suresh, Dahlia Malkhi, and Parikshit Gopalan, VMware Research; Ivan Porto Carreiro, One Concern; Zeeshan Lokhandwala, VMware

Available Media

We present the design and evaluation of Rapid, a distributed membership service. At Rapid’s core is a scheme for multi-process cut detection (CD) that revolves around two key insights: (i) it suspects a failure of a process only after alerts arrive from multiple sources, and (ii) when a group of processes experience problems, it detects failures of the entire group, rather than conclude about each process individually. Implementing these insights translates into a simple membership algorithm with low communication overhead.

We present evidence that our strategy suffices to drive unanimous detection almost-everywhere, even when complex network conditions arise, such as one-way reachability problems, firewall misconfigurations, and high packet loss. Furthermore, we present both empirical evidence and analyses that proves that the almost-everywhere detection happens with high probability. To complete the design, Rapid contains a leaderless consensus protocol that converts multi-process cut detections into a view-change decision. The resulting membership service works both in fully decentralized as well as logically centralized modes. We present an evaluation of Rapid in moderately scalable cloud settings. Rapid bootstraps 2000 node clusters 2-5.8x faster than prevailing tools such as Memberlist and ZooKeeper, remains stable in face of complex failure scenarios, and provides strong consistency guarantees. It is easy to integrate Rapid into existing distributed applications, of which we demonstrate two.

## Break with Refreshments

Essex Ballroom Foyer

### 10:45 am–12:25 pm

Refereed Papers Track I

## Big Data Faster

Session Chair: Angela Demke Brown, University of Toronto

Essex Ballroom North

## On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Arijit Khan, Nanyang Technological University, Singapore; Gustavo Segovia, ETH Zurich, Switzerland; Donald Kossmann, Microsoft Research, Redmond, USA

Available Media

We study online graph queries that retrieve nearby nodes of a query node in a large network. To answer such queries with high throughput and low latency, we partition the graph and process in parallel across a cluster of servers. Existing distributed graph systems place each partition on a separate server, where query answering over that partition takes place. This design has two major disadvantages. First, the router maintains a fixed routing table (or, policy), thus less flexible for query routing, fault tolerance, and graph updates. Second, the graph must be partitioned so that the workload across servers is balanced, and the inter machine communication is minimized. To maintain good-quality partitions, it is also required to update the existing partitions based on workload changes. However, graph partitioning, online monitoring of workloads, and dynamically updating the partitions are expensive.

We mitigate these problems by decoupling graph storage from query processors, and by developing smart routing strategies with graph landmarks and embedding. Since a query processor is no longer assigned any fixed part of the graph, it is equally capable of handling any request, thus facilitating load balancing and fault tolerance. Moreover, due to our smart routing strategies, query processors can effectively leverage their cache, reducing the impact of how the graph is partitioned across storage servers. Our experiments with several real-world, large graphs demonstrate that the proposed framework gRouting, even with simple hash partitioning, achieves up to an order of magnitude better query throughput compared to existing distributed graph systems that employ expensive graph partitioning and re-partitioning strategies.

## Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Yanhao Chen and Ari B. Hayes, Rutgers University; Chi Zhang, University of Pittsburgh; Timothy Salmon and Eddy Z. Zhang, Rutgers University

Available Media

This paper tackles the cache thrashing problem caused by the non-deterministic scheduling feature of bulk synchronous parallel (BSP) execution in GPUs. In the BSP model, threads can be executed and interleaved in any order before reaching a barrier synchronization point, which requires the entire working set to be in cache for maximum data reuse over time. However, it is not always possible to fit all the data in cache at once. Thus, we propose a locality-aware software throttling framework that throttles the number of active execution tasks, prevents cache thrashing, and enhances data reuse over time. Our locality-aware software throttling framework focuses on an important class of applications that operate on sparse matrices (graphs). These applications come from the domains of linear algebra, graph processing, machine learning and scientific simulation. Evaluated on over 200 real sparse matrices and graphs that suffer from cache thrashing in the Florida sparse matrix collection, our technique achieves an average of 2.01X speedup, a maximum of 6.45X speedup, and a maximum performance loss ≤5%.

## Accelerating PageRank using Partition-Centric Processing

Kartik Lakhotia, University of Southern California; Rajgopal Kannan, US Army Research Lab; Viktor Prasanna, University of Southern California

Available Media

PageRank is a fundamental link analysis algorithm that also functions as a key representative of the performance of Sparse Matrix-Vector (SpMV) multiplication. The traditional PageRank implementation generates fine granularity random memory accesses resulting in large amount of wasteful DRAM traffic and poor bandwidth utilization. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) to compute PageRank, that drastically reduces the amount of DRAM communication while achieving high sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled with the Gather-Apply-Scatter (GAS) programming model. By carefully examining how a PCPM based implementation impacts communication characteristics of the algorithm, we propose several system optimizations that improve the execution time substantially. More specifically, we develop (1) a new data layout that significantly reduces communication and random DRAM accesses, and (2) branch avoidance mechanisms to get rid of unpredictable data-dependent branches.

We perform detailed analytical and experimental evaluation of our approach using 6 large graphs and demonstrate an average 2.7x speedup in execution time and 1.7x reduction in communication volume, compared to the state-of-the-art. We also show that unlike other GAS based implementations, PCPM is able to further reduce main memory traffic by taking advantage of intelligent node labeling that enhances locality. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.

## CGraph: A Correlations-aware Approach for Efficient Concurrent Iterative Graph Processing

Yu Zhang, Xiaofei Liao, Hai Jin, and Lin Gu, Huazhong University of Science and Technology; Ligang He, University of Warwick; Bingsheng He, National University of Singapore; Haikun Liu, Huazhong University of Science and Technology

Available Media

With the fast growing of iterative graph analysis applications, the graph processing platform have to efficiently handle massive Concurrent iterative Graph Processing (CGP) jobs. Although it has been extensively studied to optimize the execution of a single job, existing solutions face high ratio of data access cost to computation for the CGP jobs due to significant cache interference and memory wall, which incurs low throughput. In this work, we observed that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex. Based on this observation, this paper proposes a correlations-aware execution model, together with a core-subgraph based scheduling algorithm, to enable these CGP jobs to efficiently share the graph structure data in cache/memory and their accesses by fully exploiting such correlations. It is able to achieve the efficient execution of the CGP jobs by effectively reducing their average ratio of data access cost to computation and therefore delivers a much higher throughput. In order to demonstrate the efficiency of the proposed approaches, a system called CGraph is developed and extensive experiments have been conducted. The experimental results show that CGraph improves the throughput of the CGP jobs by up to 2.31 times in comparison with the existing solutions.

Refereed Papers Track II

## Availability

Session Chair: Tanakorn Leesatapornwongsa, Samsung Research America

Essex Ballroom Center

## Don't share, Don't lock: Large-scale Software Connection Tracking with Krononat

Fabien André, Stéphane Gouache, Nicolas Le Scouarnec, and Antoine Monsifrot, Technicolor

Available Media

To simplify software updates and provide new services, ISPs are interested in migrating network functions implemented in residential gateways (such as DSL or Cable modems) to the cloud. Two key functions of residential gateways are Network Address Translation (NAT) and stateful firewalling which both rely on connection tracking. To date, these functions cannot be efficiently implemented in the cloud: current OSes connection tracking is unable to meet the scale and reliability needs of ISPs, while hardware appliances are often too expensive.

In this paper, we present Krononat, a distributed software NAT that runs on a cluster of commodity servers, providing a cost-efficient solution with an excellent reliability. To achieve these results, Krononat relies on 3 key ideas: (i) sharding the connection tracking state across multiple servers, down to the core level; (ii) steering traffic exploiting the features of entry-level switches; and (iii) avoiding all locks and data sharing on the data path.

Krononat supports a rate of 77 million packets per second on only 12 cores, tracking up to 60M connections. Krononat is immune to single node failures, and accommodates elastic workloads through a fast reconfiguration mechanism (< 500ms).

## Accurate Timeout Detection Despite Arbitrary Processing Delays

Sixiang Ma and Yang Wang, The Ohio State University

Available Media

Timeout is widely used for failure detection. This paper proposes SafeTimer, a mechanism to enhance existing timeout detection protocols to tolerate long delays in the OS and the application: at the heartbeat receiver side, SafeTimer checks whether there are any pending heartbeats before reporting a failure; at the sender side, SafeTimer blocks the sender if the sender cannot send out heartbeats in time. We have proved that SafeTimer can prevent false failure report despite arbitrary delays in the OS and the application. This property allows existing protocols to relax their timing assumptions and use a shorter timeout interval for faster failure detection. Evaluation shows that the overhead of SafeTimer is negligible and applying SafeTimer to existing systems is easy.

## Improving Service Availability of Cloud Systems by Predicting Disk Error

Yong Xu and Kaixin Sui, Microsoft Research, China; Randolph Yao, Microsoft Azure, USA; Hongyu Zhang, The University of Newcastle, Australia; Qingwei Lin, Microsoft Research, China; Yingnong Dang, Microsoft Azure, USA; Peng Li, Nankai University, China; Keceng Jiang, Wenchi Zhang, and Jian-Guang Lou, Microsoft Research, China; Murali Chintalapati, Microsoft Azure, USA; Dongmei Zhang, Microsoft Research, China

Available Media

High service availability is crucial for cloud systems. A typical cloud system uses a large number of physical hard disk drives. Disk errors are one of the most important reasons that lead to service unavailability. Disk error (such as sector error and latency error) can be seen as a form of gray failure, which are fairly subtle failures that are hard to be detected, even when applications are afflicted by them. In this paper, we propose to predict disk errors proactively before they cause more severe damage to the cloud system. The ability to predict faulty disks enables the live migration of existing virtual machines and allocation of new virtual machines to the healthy disks, therefore improving service availability. To build an accurate online prediction model, we utilize both disk-level sensor (SMART) data as well as systemlevel signals. We develop a cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future. We evaluate our approach using real-world data collected from a production cloud system. The results confirm that the proposed approach is effective and outperforms related methods. Furthermore, we have successfully applied the proposed approach to improve service availability of Microsoft Azure.

## RAFI: Risk-Aware Failure Identification to Improve the RAS in Erasure-coded Data Centers

Juntao Fang, Wuhan National Laboratory for Optoelectronics, Huazhong University of Sci. and Tech.; Shenggang Wan, School of Computer Science and Technology, Huazhong University of Sci. and Tech.; Xubin He, Department of Computer and Information Sciences, Temple University

Available Media

Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS.

To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously.

We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI.

## Conference Luncheon

Essex Ballroom South

### 2:00 pm–3:40 pm

Refereed Papers Track I

## Big Data 1

Session Chair: Yang Wang, Ohio State University

Essex Ballroom North

## Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics

Shuhao Liu, Li Chen, and Baochun Li, University of Toronto

Available Media

It is increasingly common that large volumes of production data originate from geographically distributed datacenters. Processing such datasets with existing data parallel frameworks may suffer from significant slowdowns due to the much lower availability of inter-datacenter bandwidth. Thus, it is critical to optimize the delivery of inter-datacenter traffic, especially coflows that imply application-level semantics, to improve the performance of such geo-distributed applications.

In this paper, we present Siphon, a building block integrated in existing data parallel frameworks (e.g., Apache Spark) to expedite their generated inter-datacenter coflows at runtime. Specifically, Siphon serves as a transport service that accelerates and schedules the inter-datacenter traffic with the awareness of workload-level dependencies and performance, while being completely transparent to analytics applications. Novel intra-coflow and inter-coflow scheduling and routing strategies have been designed and implemented in Siphon, based on a software-defined networking architecture.

On our cloud-based testbeds, we have extensively evaluated Siphon's performance in accelerating coflows generated by a broad range of workloads. With a variety of Spark jobs, Siphon can reduce the completion time of a single coflow by up to 76%. With respect to the average coflow completion time, Siphon outperforms the state-of-the-art scheme by 10%.

## PerfIso: Performance Isolation for Commercial Latency-Sensitive Services

Călin Iorgulescu, EPFL; Reza Azimi, Brown University; Youngjin Kwon, U. Texas at Austin; Sameh Elnikety, Manoj Syamala, and Vivek Narasayya, Microsoft Research; Herodotos Herodotou, Cyprus University of Technology; Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang, Microsoft Bing

Available Media

Large commercial latency-sensitive services, such as web search, run on dedicated clusters provisioned for peak load to ensure responsiveness and tolerate data center outages. As a result, the average load is far lower than the peak load used for provisioning, leading to resource under-utilization. The idle resources can be used to run batch jobs, completing useful work and reducing overall data center provisioning costs. However, this is challenging in practice due to the complexity and stringent tail-latency requirements of latency-sensitive services. Left unmanaged, the competition for machine resources can lead to severe response-time degradation and unmet service-level objectives (SLOs).

This work describes PerfIso, a performance isolation framework which has been used for nearly three years in Microsoft Bing, a major search engine, to colocate batch jobs with production latency-sensitive services on over 90,000 servers. We discuss the design and implementation of PerfIso, and conduct an experimental evaluation in a production environment. We show that colocating CPU-intensive jobs with latency-sensitive services increases average CPU utilization from 21% to 66% for off-peak load without impacting tail latency.

## On the diversity of cluster workloads and its impact on research results

George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, and Garth A. Gibson, Carnegie Mellon University; Elisabeth Baseman and Nathan DeBardeleben, Los Alamos National Laboratory

Available Media

Six years ago, Google released an invaluable set of scheduler logs which has already been used in more than 450 publications. We find that the scarcity of other data sources, however, is leading researchers to overfit their work to Google's dataset characteristics. We demonstrate this overfitting by introducing four new traces from two private and two High Performance Computing (HPC) clusters. Our analysis shows that the private cluster workloads, consisting of data analytics jobs expected to be more closely related to the Google workload, display more similarity to the HPC cluster workloads. This observation suggests that additional traces should be considered when evaluating the generality of new research.

To aid the community in moving forward, we release the four analyzed traces, including: the longest publicly available trace spanning all 61 months of an HPC cluster's lifetime and a trace from a 300,000-core HPC cluster, the largest cluster with a publicly available trace. We present an analysis of the private and HPC cluster traces that spans job characteristics, workload heterogeneity, resource utilization, and failure rates. We contrast our findings with the Google trace characteristics and identify affected work in the literature. Finally, we demonstrate the importance of dataset plurality and diversity by evaluating the performance of a job runtime predictor using all four of our traces and the Google trace.

## SLAOrchestrator: Reducing the Cost of Performance SLAs for Cloud Data Analytics

Jennifer Ortiz, Brendan Lee, and Magdalena Balazinska, University of Washington; Johannes Gehrke, Microsoft; Joseph L. Hellerstein, eScience Institute

Available Media

SLAOrchestrator is a new system designed to reduce the price increases necessary to support performance SLAs in cloud analytics systems. SLAOrchestrator is designed for SLAs that guarantee per-query execution times. Its core architecture consists of a double learning loop that improves both SLAs and resource management over time. It further utilizes an efficient combination of elastic query scheduling and multi-tenant resource provisioning algorithms to reduce the costs of performance guarantees.

Refereed Papers Track II

## Analyzing Code

Session Chair: Gala Yadgar, Technion Israel Institute of Technology

Essex Ballroom Center

## Spindle: Informed Memory Access Monitoring

Haojie Wang, Tsinghua University, Qatar Computing Research Institute; Jidong Zhai, Tsinghua University; Xiongchao Tang, Tsinghua University, Qatar Computing Research Institute; Bowen Yu, Tsinghua University; Xiaosong Ma, Qatar Computing Research Institute; Wenguang Chen, Tsinghua University

Available Media

Memory monitoring is of critical use in understanding applications and evaluating systems. Due to the dynamic nature in programs' memory accesses, common practice today leaves large amounts of address examination and data recording at runtime, at the cost of substantial performance overhead (and large storage time/space consumption if memory traces are collected).

Recognizing the memory access patterns available at compile time and redundancy in runtime checks, we propose a novel memory access monitoring and analysis framework, Spindle. Unlike methods delaying all checks to runtime or performing task-specific optimization at compile time, Spindle performs common static analysis to identify predictable memory access patterns into a compact program structure summary. Custom memory monitoring tools can then be developed on top of Spindle, leveraging the structural information extracted to dramatically reduce the amount of instrumentation that incurs heavy runtime memory address examination or recording. We implement Spindle in the popular LLVM compiler, supporting both single-thread and multi-threaded programs. Our evaluation demonstrated the effectiveness of two Spindle-based tools, performing memory bug detection and trace collection respectively, with a variety of programs from areas such as scientific computing, data analytics, graph processing, and key-value store. Results show that these tools are able to aggressively prune online memory monitoring processing, fulfilling desired tasks with performance overhead significantly reduced (2.54x on average for memory bug detection and over 200x on average for access tracing, over state-of-the-art solutions).

## Touchstone: Generating Enormous Query-Aware Test Databases

Yuming Li and Rong Zhang, East China Normal University; Xiaoyan Yang and Zhenjie Zhang, Singapore R&D, Yitu Technology Ltd.; Aoying Zhou, East China Normal University

Available Media

Query-aware synthetic data generation is an essential and highly challenging task, important for database management system (DBMS) testing, database application testing and application-driven benchmarking. Prior studies on query-aware data generation suffer common problems of limited parallelization, poor scalability, and excessive memory consumption, making these systems unsatisfactory to terabyte scale data generation. In order to fill the gap between the existing data generation techniques and the emerging demands of enormous query-aware test databases, we design and implement our new data generator, called {\em Touchstone}. {\em Touchstone} adopts the random sampling algorithm instantiating the query parameters and the new data generation schema generating the test database, to achieve fully parallel data generation, linear scalability and austere memory consumption. Our experimental results show that {\em Touchstone} consistently outperforms the state-of-the-art solution on TPC-H workload by a 1000$\times$ speedup without sacrificing accuracy.

## DSAC: Effective Static Analysis of Sleep-in-Atomic-Context Bugs in Kernel Modules

Jia-Ju Bai and Yu-Ping Wang, Tsinghua University; Julia Lawall, Sorbonne Université/Inria/LIP6; Shi-Min Hu, Tsinghua University

Available Media

In a modern OS, kernel modules often use spinlocks and interrupt handlers to monopolize a CPU core for executing concurrent code in atomic context. In this situation, if the kernel module performs an operation that can sleep at runtime, a system hang may occur in execution. We refer to this kind of concurrency bug as a sleep-in-atomic-context (SAC) bug. In practice, SAC bugs have received insufficient attention and are hard to find, as they do not always cause problems in real executions.

In this paper, we propose a practical static approach named DSAC, to effectively detect SAC bugs and automatically recommend patches to help fix them. DSAC uses four key techniques: (1) a hybrid of flow-sensitive and -insensitive analysis to perform accurate and efficient code analysis; (2) a heuristics-based method to accurately extract sleep-able kernel interfaces that can sleep at runtime; (3) a path-check method to effectively filter out repeated reports and false bugs; (4) a pattern-based method to automatically generate recommended patches to help fix the bugs.

We evaluate DSAC on kernel modules (drivers, file systems, and network modules) of the Linux kernel, and on the FreeBSD and NetBSD kernels, and in total find 401 new real bugs. 272 of these bugs have been confirmed by the relevant kernel maintainers, and 43 patches generated by DSAC have been applied by kernel maintainers.

## Coccinelle: 10 Years of Automated Evolution in the Linux Kernel

Julia Lawall and Gilles Muller, Sorbonne University/Inria/LIP6

Available Media

The Coccinelle C-program matching and transformation tool was first released in 2008 to facilitate specification and automation in the evolution of Linux kernel code. The novel contribution of Coccinelle was that it allows software developers to write code manipulation rules in terms of the code structure itself, via a generalization of the patch syntax. Over the years, Coccinelle has been extensively used in Linux kernel development, resulting in over 6000 commits to the Linux kernel, and has found its place as part of the Linux kernel development process. This paper studies the impact of Coccinelle on Linux kernel development and the features of Coccinelle that have made it possible. It can provide guidance on how other research-based tools can achieve practical impact in the open-source development community.

## Break with Refreshments

Essex Ballroom Foyer

### 4:10 pm–5:50 pm

Refereed Papers Track I

## Big Data 2

Session Chair: Patrick Stuedi, IBM Research

Essex Ballroom North

## Albis: High-Performance File Format for Big Data Systems

Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler, IBM Research, Zurich

Available Media

Over the last decade, a variety of external file formats such as Parquet, ORC, Arrow, etc., have been developed to store large volumes of relational data in the cloud. As high-performance networking and storage devices are used pervasively to process this data in frameworks like Spark and Hadoop, we observe that none of the popular file formats are capable of delivering data access rates close to the hardware. Our analysis suggests that multiple antiquated notions about the nature of I/O in a distributed setting, and the preference for the "storage efficiency" over performance is the key reason for this gap.

In this paper we present Albis, a high-performance file format for storing relational data on modern hardware. Albis is built upon two key principles: (i) reduce the CPU cost by keeping the data/metadata storage format simple; (ii) use a binary API for an efficient object management to avoid unnecessary object materialization. In our evaluation, we demonstrate that in micro-benchmarks Albis delivers 1.9-21.4x faster bandwidths than other formats. At the workload-level, Albis in Spark/SQL reduces the runtimes of TPC-DS queries up to a margin of 3x.

## Litz: Elastic Framework for High-Performance Distributed Machine Learning

Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Abutalib Aghayev, Carnegie Mellon University; Weiren Yu, Petuum, Inc. and Beihang University; Haoyang Chen and Qirong Ho, Petuum, Inc.; Garth A. Gibson, Carnegie Mellon University and Vector Institute; Eric P. Xing, Petuum, Inc. and Carnegie Mellon University

Available Media

Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications.

Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes --- stateful workers, model scheduling, and relaxed consistency --- which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution.

We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.

## Putting the "Micro" Back in Microservice

Sol Boucher, Anuj Kalia, and David G. Andersen, Carnegie Mellon University; Michael Kaminsky, Intel Labs

Available Media

Modern cloud computing environments strive to provide users with fine-grained scheduling and accounting, as well as seamless scalability. The most recent face to this trend is the “serverless” model, in which individual functions, or microservices, are executed on demand. Popular implementations of this model, however, operate at a relatively coarse granularity, occupying resources for minutes at a time and requiring hundreds of milliseconds for a cold launch. In this paper, we describe a novel design for providing “functions as a service” (FaaS) that attempts to be truly micro: cold launch times in microseconds that enable even finer-grained resource accounting and support latency-critical applications. Our proposal is to eschew much of the traditional serverless infrastructure in favor of language-based isolation. The result is microsecond-granularity launch latency, and microsecond-scale preemptive scheduling using high-precision timers.

## Fast and Concurrent RDF Queries using RDMA-assisted GPU Graph Exploration

Siyuan Wang, Chang Lou, Rong Chen, and Haibo Chen, Shanghai Jiao Tong University

Available Media

RDF graph has been increasingly used to store and represent information shared over the Web, including social graphs and knowledge bases. With the increasing scale of RDF graphs and the concurrency level of SPARQL queries, current RDF systems are confronted with inefficient concurrent query processing on massive data parallelism, which usually leads to suboptimal response time (latency) as well as throughput.

In this paper, we present Wukong+G, the first graph-based distributed RDF query processing system that efficiently exploits the hybrid parallelism of CPU and GPU. Wukong+G is made fast and concurrent with three key designs. First, Wukong+G utilizes GPU to tame random memory accesses in graph exploration by efficiently mapping data between CPU and GPU for latency hiding, including a set of techniques like query-aware prefetching, pattern-aware pipelining and fine-grained swapping. Second, Wukong+G scales up by introducing a GPU-friendly RDF store to support RDF graphs exceeding GPU memory size, by using techniques like predicate- based grouping, pairwise caching and look-ahead replacing to narrow the gap between host and device memory scale. Third, Wukong+G scales out through a communication layer that decouples the transferring process for query metadata and intermediate results, and leverages both native and GPUDirect RDMA to enable efficient communication on a CPU/GPU cluster.

We have implemented Wukong+G by extending a state-of-the-art distributed RDF store (i.e., Wukong) with distributed GPU support. Evaluation on a 5-node CPU/GPU cluster (10 GPU cards) with RDMA-capable network shows that Wukong+G outperforms Wukong by 2.3X-9.0X in the single heavy query latency and improves latency and throughput by more than one order of magnitude when facing hybrid workloads.

Refereed Papers Track II

## SSDs

Session Chair: Cheng Li, VMware

Essex Ballroom Center

## MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

Bo Peng, Shanghai Jiao Tong University, Intel; Haozhong Zhang, Intel; Jianguo Yao, Shanghai Jiao Tong University; Yaozu Dong, Intel; Yu Xu and Haibing Guan, Shanghai Jiao Tong University

Available Media

The fast access to data and high parallel processing in high-performance computing instigates an urgent demand on the I/O improvement of the NVMe storage within datacenters. However, unsatisfactory performance of the former NVMe virtualization demonstrates that NVMe storage devices are often underutilized within cloud computing platforms. NVMe virtualization with high performance and device sharing has captured the attention of researchers. This paper introduces MDev-NVMe, a new virtualization implementation for NVMe storage device with: (1) full NVMe storage virtualization running native NVMe driver in guest, and (2) a mediated pass-through mechanism with an active polling mode which can achieve both high throughput, low latency performance and a good device scalability. This paper subsequently evaluates MDev-NVMe on Intel OPTANE and P3600 NVMe SSD by comparison with the mainstream virtualization mechanisms using application-level I/O benchmarks. With polling, MDev-NVMe can demonstrate a 142% improvement over native (interrupt-driven) throughput and over 2.5 times the Virtio throughput with only 70% native average latency and 31% Virtio average latency with a reliable scalability. Finally, the advantages of MDev-NVMe and the importance of polling are discussed, offering evidence that MDev-NVMe is a superior virtualization choice with high performance and promising levels of maintenance.

## AutoSSD: an Autonomic SSD Architecture

Bryan S. Kim, Seoul National University; Hyun Suk Yang, Hongik University; Sang Lyul Min, Seoul National University

Available Media

## Geriatrix: Aging what you see and what you don’t see. A file system aging approach for modern storage systems

Saurabh Kadekodi, Vaishnavh Nagarajan, and Gregory R. Ganger, Carnegie Mellon University; Garth A. Gibson, Carnegie Mellon University, Vector Institute

Available Media

File system performance on modern primary storage devices (Flash-based SSDs) is greatly affected by aging of the free space, much more so than were mechanical disk drives. We introduce Geriatrix, a simple-to-use profile driven file system aging tool that induces target levels of fragmentation in both allocated files (what you see) and remaining free space (what you don't see), unlike previous approaches that focus on just the former. This paper describes and evaluates the effectiveness of Geriatrix, showing that it recreates both fragmentation effects better than previous approaches. Using Geriatrix, we show that measurements presented in many recent file systems papers are higher than should be expected, by up to 30% on mechanical (HDD) and up to 75% on Flash (SSD) disks. Worse, in some cases, the performance rank ordering of file system designs being compared are different from the published results.

Geriatrix will be released as open source software with eight built-in aging profiles, in the hopes that it can address the need created by the increased performance impact of file system aging in modern SSD-based storage.

## Can’t We All Get Along? Redesigning Protection Storage for Modern Workloads

Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale, Dell EMC

Available Media

Deduplication systems for traditional backups have optimized for large sequential writes and reads. Over time, new applications have resulted in nonsequential accesses, patterns reminiscent of primary storage systems. The Data Domain File System (\ddfs) needs to evolve to support these modern workloads by providing high performance for nonsequential accesses without degrading performance for traditional backup workloads.

Based on our experience with thousands of deployed systems, we have updated our storage software to distinguish user workloads and apply optimizations including leveraging solid-state disk (SSD) caches. Since SSDs are still significantly more expensive than magnetic disks, we make our system cost-effective by caching metadata and file data rather than moving everything to SSD. We dynamically detect access patterns to decide when to cache, prefetch, and perform numerous other optimizations. We find that on a workload with nonsequential accesses, with SSDs for caching metadata alone, we measured a 5.7$\times$ improvement on input/output operations per second (IOPS) when compared to a baseline without SSDs. Combining metadata and data caching in SSDs, we measured a further 1.7$\times$ IOPS increase. Adding software optimizations throughout our system added an additional 2.7$\times$ IOPS improvement for nonsequential workloads. Overall, we find that both hardware and software changes are necessary to support the new mix of sequential and nonsequential workloads at acceptable cost. Our updated system is sold to customers worldwide.

## Poster Session and Reception

Essex Ballroom South

Posters of the papers presented in the Technical Sessions on Thursday and Friday will be on display. View the list of accepted posters.

## Continental Breakfast

Essex Ballroom Foyer

### 9:00 am–10:15 am

Refereed Papers Track I

## The Network

Session Chair: Lalith Suresh, VMware Research

Essex Ballroom North

## STMS: Improving MPTCP Throughput Under Heterogeneous Networks

Hang Shi and Yong Cui, Tsinghua University; Xin Wang, Stony Brook University; Yuming Hu and Minglong Dai, Tsinghua University; Fanzhao Wang and Kai Zheng, Huawei Technologies

Available Media

Using multiple interfaces on mobile devices to get high throughput is promising to improve the user experience. However, Multipath TCP (MPTCP), the de-facto standardized solution, suffers when different paths have heterogeneous quality. This problem is especially severe when the difference is the path latency. Our experimental results show that it causes the burst sending of packets from the fast path, which requires the in-network buffer to be big to achieve the full benefit of the bandwidth aggregation. In addition, it also requires bigger host buffer to fully utilize the fast path. To solve these problems, we propose and implement a new scheduler, which pre-allocates packets to send over the fast path for in-order arrival. Instead of relying on the estimation of network path condition, our scheduler dynamically adapts the MPTCP-level send window based on the packets acknowledged. Our evaluation shows that our scheduler can improve the throughput by 30% when the in-network buffer is limited, 15% when the host buffer is limited.

## Pantheon: the training ground for Internet congestion-control research

Francis Y. Yan, Jestin Ma, and Greg D. Hill, Stanford University; Deepti Raghavan, Massachusetts Institute of Technology; Riad S. Wahby, Philip Levis, and Keith Winstein, Stanford University
Awarded Best Paper!

Available Media

Internet transport algorithms are foundational to the performance of network applications. But a number of practical challenges make it difficult to evaluate new ideas and algorithms in a reproducible manner. We present the Pantheon, a system that addresses this by serving as a community "training ground" for research on Internet transport protocols and congestion control (https://pantheon.stanford.edu). It allows network researchers to benefit from and contribute to a common set of benchmark algorithms, a shared evaluation platform, and a public archive of results.

We present three results showing the Pantheon's value as a research tool. First, we describe a measurement study from more than a year of data, indicating that congestion-control schemes vary dramatically in their relative performance as a function of path dynamics. Second, the Pantheon generates calibrated network emulators that capture the diverse performance of real Internet paths. These enable reproducible and rapid experiments that closely approximate real-world results. Finally, we describe the Pantheon's contribution to developing new congestion-control schemes, two of which were published at USENIX NSDI 2018, as well as data-driven neural-network-based congestion-control schemes that can be trained to achieve good performance over the real Internet.

## ClickNF: a Modular Stack for Custom Network Functions

Massimo Gallo and Rafael Laufer, Nokia Bell Labs

Available Media

Network function virtualization has recently allowed specialized equipment to be replaced with equivalent software implementation. The Click router was a first step in this direction, defining a modular platform for generalized packet processing. Despite its major impact, however, Click does not provides native L4 implementation and only uses nonblocking I/O, limiting its scope to L2-L3 network functions. To overcome these limitations we introduce ClickNF, which provides modular transport and application-layer building blocks for the development of middleboxes and server-side network functions. We evaluate ClickNF to highlight its state-of-the-art performance and showcase its modularity by composing complex functions from simple elements. ClickNF is open source and publicly available.

Refereed Papers Track II

## Storage 1

Session Chair: Xing Lin, NetApp

Essex Ballroom Center

## Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics

Ana Klimovic, Stanford University; Heiner Litz, UC Santa Cruz; Christos Kozyrakis, Stanford University

Available Media

Data analytics are an important class of data-intensive workloads on public cloud services. However, selecting the right compute and storage configuration for these applications is difficult as the space of available options is large and the interactions between options are complex. Moreover, the different data streams accessed by analytics workloads have distinct characteristics that may be better served by different types of storage devices.

We present Selecta, a tool that recommends near-optimal configurations of cloud compute and storage resources for data analytics workloads. Selecta uses latent factor collaborative filtering to predict how an application will perform across different configurations, based on sparse data collected by profiling training workloads. We evaluate Selecta with over one hundred Spark SQL and ML applications, showing that Selecta chooses a near-optimal performance configuration (within 10% of optimal) with 94% probability and a near-optimal cost configuration with 80% probability. We also use Selecta to draw significant insights about cloud storage systems, including the performance-cost efficiency of NVMe Flash devices, the need for cloud storage with support for fine-grain capacity and bandwidth allocation, and the motivation for end-to-end storage optimizations.

## Remote regions: a simple abstraction for remote memory

Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei, VMware

Available Media

We propose an intuitive abstraction for a process to export its memory to remote hosts, and to access the memory exported by others. This abstraction provides a simpler interface to RDMA and other remote memory technologies compared to the existing verbs interface. The key idea is that a process can export parts of its memory as files, called remote regions, that can be accessed through the usual file system operations (read, write, memory map, etc). We built this abstraction and evaluated it. We show that remote regions are easy to use and perform close to RDMA. We demonstrate it via micro-benchmarks and by modifying two in-memory single-host applications to use remote memory: R and Metis. These modifications amount to ≈100 lines of code; they allow R to exceed the physical memory of a host while running fat; and they allow Metis scale its performance across 8 hosts.

## Understanding Ephemeral Storage for Serverless Analytics

Ana Klimovic, Yawen Wang, and Christos Kozyrakis, Stanford University; Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi, IBM Research

Available Media

Serverless computing frameworks allow users to launch thousands of concurrent tasks with high elasticity and fine-grain resource billing without explicitly managing computing resources. While already successful for IoT and web microservices, there is increasing interest in leveraging serverless computing to run data-intensive jobs, such as interactive analytics. A key challenge in running analytics workloads on serverless platforms is enabling tasks in different execution stages to efficiently communicate data between each other via a shared data store. In this paper, we explore the suitability of different cloud storage services (e.g., object stores and distributed caches) as remote storage for serverless analytics. Our analysis leads to key insights to guide the design of an ephemeral cloud storage system, including the performance and cost efficiency of Flash storage for serverless application requirements and the need for a pay-what-you-use storage service that can support the high throughput demands of highly parallel applications.

## Break with Refreshments

Essex Ballroom Foyer

### 10:45 am–12:25 pm

Refereed Papers Track I

## Transactions

Essex Ballroom North

## Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage

Tao Zhu, East China Normal University; Zhuoyue Zhao and Feifei Li, University of Utah; Weining Qian and Aoying Zhou, East China Normal University; Dong Xie and Ryan Stutsman, University of Utah; Haining Li, Bank of Communications; Huiqi Hu, East China Normal University; Bank of Communications

Available Media

Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents Solar, a distributed relational database system that has been successfully deployed at a large commercial bank. The key features of Solar include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, Solar outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

## Toward Coordination-free and Reconfigurable Mixed Concurrency Control

Dixin Tang and Aaron J. Elmore, University of Chicago

Available Media

Recent studies show that mixing concurrency control protocols within a single database can significantly outperform a single protocol. However, prior projects to mix concurrency control either are limited to specific pairs of protocols (e.g mixing two-phase locking (2PL) and optimistic concurrency control (OCC)) or introduce extra concurrency control overhead to guarantee their general applicability, which can be a performance bottleneck. In addition, due to unknown and shifting access patterns within a workload, candidate protocols should be chosen dynamically in response to workload changes. This requires changing candidate protocols online without having to stop the whole system, which prior work does not fully address. To resolve these two issues, we present CormCC, a general mixed concurrency control framework with no coordination overhead across candidate protocols while supporting the ability to change a protocol online with minimal overhead. Based on this framework, we build a prototype main-memory multi-core database to dynamically three popular protocols. Our experiments show CormCC has significantly higher throughput compared with single protocols and state-of-the-art mixed concurrency control approaches.

## Scaling Hardware Accelerated Network Monitoring to Concurrent and Dynamic Queries With *Flow

John Sonchack, University of Pennsylvania; Oliver Michel, University of Colorado Boulder; Adam J. Aviv, United States Naval Academy; Eric Keller, University of Colorado Boulder; Jonathan M. Smith, University of Pennsylvania

Available Media

Measurement plays a key role in network operation and management. An important but unaddressed practical requirement in high speed networks is supporting concurrent applications with diverse and potentially dynamic measurement objectives. We introduce Flow, a switch accelerated telemetry system for efficient, concurrent, and dynamic measurement. The design insight is to carefully partition processing between switch ASICs and application software. In Flow, the switch ASIC implements a pipeline that exports telemetry data in a flexible format that allows applications to efficiently compute many different statistics. Applications can operate concurrently and dynamically on identical streams without impacting each other. We implement *Flow as a line rate P4 program for a 3.2 Tb/s commodity switch and evaluate it with four example monitoring applications. The applications can operate concurrently and dynamically, while scaling to measure terabit rate traffic with a single commodity server.

## Applying Hardware Transactional Memory for Concurrency-Bug Failure Recovery in Production Runs

Yuxi Chen, Shu Wang, and Shan Lu, University of Chicago; Karthikeyan Sankaralingam, University of Wisconsin — Madison

Available Media

Concurrency bugs widely exist and severely threaten system availability. Techniques that help recover from concurrency-bug failures during production runs are highly desired. This paper proposes BugTM, an approach that leverages Hardware Transactional Memory (HTM) on commodity machines for production-run concurrency-bug recovery. Requiring no knowledge about where are concurrency bugs, BugTM uses static analysis and code transformation to insert HTM instructions into multi-threaded programs. These BugTM-transformed programs will then be able to recover from a concurrency-bug failure by rolling back and re-executing the recent history of a failure thread. BugTM greatly improves the recovery capability of state-of-the-art techniques with low run-time overhead and no changes to OS or hardware, while guarantees not to introduce new bugs.

Refereed Papers Track II

## Storage 2

Session Chair: George Amvrosiadis, Carnegie Mellon University

Essex Ballroom Center

## Tailwind: Fast and Atomic RDMA-based Replication

Yacine Taleb, Univ Rennes, Inria, CNRS, IRISA; Ryan Stutsman, University of Utah; Gabriel Antoniu, Univ Rennes, Inria, CNRS, IRISA; Toni Cortes, BSC, UPC

Available Media

Replication is essential for fault-tolerance. However, in in-memory systems, it is a source of high overhead. Remote direct memory access (RDMA) is attractive to create redundant copies of data, since it is low-latency and has no CPU overhead at the target. However, existing approaches still result in redundant data copying and active receivers. To ensure atomic data transfers, receivers check and apply only fully received messages. Tailwind is a zero-copy recovery-log replication protocol for scale-out in-memory databases. Tailwind is the first replication protocol that eliminates {\em all} CPU-driven data copying and fully bypasses target server CPUs, thus leaving backups idle. Tailwind ensures all writes are atomic by leveraging a protocol that detects incomplete RDMA transfers. Tailwind substantially improves replication throughput and response latency compared with conventional RPC-based replication. In symmetric systems where servers both serve requests and act as replicas, Tailwind also improves normal-case throughput by freeing server CPU resources for request processing. We implemented and evaluated Tailwind on RAMCloud, a low-latency in-memory storage system. Experiments show Tailwind improves RAMCloud's normal-case request processing throughput by 1.7$\times$. It also cuts down writes median and 99\textsuperscript{th} percentile latencies by 2x and 3x respectively.

## On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes

Oleg Kolosov, School of Electrical Engineering, Tel Aviv University; Gala Yadgar, Computer Science Department, Technion and School of Electrical Engineering, Tel Aviv University; Matan Liram, Computer Science Department, Technion; Itzhak Tamo, School of Electrical Engineering, Tel Aviv University; Alexander Barg, Department of ECE/ISR, University of Maryland

Available Media

Erasure codes are used in large-scale storage systems to allow recovery of data from a failed node. A recently developed class of erasure codes, termed locally repairable codes (LRCs), offers tradeoffs between storage overhead and repair cost. LRCs facilitate more efficient recovery scenarios by storing additional parity blocks in the system, but these additional blocks may eventually increase the number of blocks that must be reconstructed. Existing codes differ in their use of the additional parity blocks, but also in their locality semantics and in the parameters for which they are defined. As a result, existing theoretical models cannot be used to directly compare different LRCs to determine which code will offer the best recovery performance, and at what cost.

In this study, we perform the first systematic comparison of existing LRC approaches. We analyze Xorbas, Azure’s LRCs, and the recently proposed Optimal-LRCs in light of two new metrics: the average degraded read cost, and the normalized repair cost. We show the tradeoff between these costs and the code’s fault tolerance, and that different approaches offer different choices in this tradeoff. Our experimental evaluation on a Ceph cluster deployed on Amazon EC2 further demonstrates the different effects of realistic network and storage bottlenecks on the benefit from each examined LRC approach. Despite these differences, the normalized repair cost metric can reliably identify the LRC approach that would achieve the lowest repair cost in each setup.

## TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, and Tianyu Cheng, The University of Texas at Austin; Vijay Chidambaram, The University of Texas at Austin and VMware Research; Emmett Witchel, The University of Texas at Austin
Awarded Best Paper!

Available Media

We introduce TxFS, a novel transactional file system that builds upon a file system’s atomic-update mechanism such as journaling. Though prior work has explored a number of transactional file systems, TxFS has a unique set of properties: a simple API, portability across different hardware, high performance, low complexity (by building on the journal), and full ACID transactions. We port SQLite and Git to use TxFS, and experimentally show that TxFS provides strong crash consistency while providing equal or better performance.

## Towards Better Understanding of Black-box Auto-Tuning: A Comparative Analysis for Storage Systems

Zhen Cao, Stony Brook University; Vasily Tarasov, IBM Research - Almaden; Sachin Tiwari and Erez Zadok, Stony Brook University

Available Media

Modern computer systems come with a large number of configurable parameters that control their behavior. Tuning system parameters can provide significant gains in performance but is challenging because of the immense number of configurations and complex, nonlinear system behavior. In recent years, several studies attempted to automate the tuning of system configurations; but they all applied only one or few optimization methods. In this paper, for the first time, we apply and then perform comparative analysis of multiple blackbox optimization techniques on storage systems, which are often the slowest components of computing systems. Our experiments were conducted on a parameter space consisting of nearly 25,000 unique configurations and over 450,000 data points. We compared these methods for their ability to find near-optimal configurations, convergence time, and instantaneous system throughput during auto-tuning. We found that optimal configurations differed by hardware, software, and workloads -- and that no one technique was superior to all others. Based on the results and domain expertise, we begin to explain the efficacy of these important automated blackbox optimization methods from a systems perspective.

### 2:00 pm–3:40 pm

Refereed Papers Track I

## Data Center/Machine Learning

Essex Ballroom North

## HeavyKeeper: An Accurate Algorithm for Finding Top-k Elephant Flows

Junzhi Gong, Tong Yang, Haowei Zhang, and Hao Li, Peking University; Steve Uhlig, Queen Mary, University of London; Shigang Chen, University of Florida; Lorna Uden, Staffordshire University; Xiaoming Li, Peking University

Available Media

Finding top-k elephant flows is a critical task in network traffic measurement, with many applications in congestion control, anomaly detection and traffic engineering. As the line rates keep increasing in today's networks, designing accurate and fast algorithms for online identification of elephant flows becomes more and more challenging. The prior algorithms are seriously limited in achieving accuracy under the constraints of heavy traffic and small on-chip memory in use. We observe that the basic strategies adopted by these algorithms either require significant space overhead to measure the sizes of all flows or incur significant inaccuracy when deciding which flows to keep track of. In this paper, we adopt a new strategy, called count-with-exponential-decay, to achieve space-accuracy balance by actively removing small flows through decaying, while minimizing the impact on large flows, so as to achieve high precision in finding top-k elephant flows. Moreover, the proposed algorithm called HeavyKeeper incurs small, constant processing overhead per packet and thus supports high line rates. Experimental results show that HeavyKeeper algorithm achieves 99.99% precision with a small memory size, and reduces the error by around 3 orders of magnitude on average compared to the state-of-the-art.

## SAND: Towards High-Performance Serverless Computing

Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt, Nokia Bell Labs

Available Media

Serverless computing has emerged as a new cloud computing paradigm, where an application consists of individual functions that can be separately managed and executed. However, existing serverless platforms normally isolate and execute functions in separate containers, and do not exploit the interactions among functions for performance. These practices lead to high startup delays for function executions and inefficient resource usage. This paper presents SAND, a new serverless computing system that provides lower latency, better resource efficiency and more elasticity than existing serverless platforms. To achieve these properties, SAND introduces two key techniques: 1) application-level sandboxing, and 2) a hierarchical message bus. We have implemented and deployed a complete SAND system. Our results show that SAND outperforms the state-of-the-art serverless platforms significantly. For example, in a commonly-used image processing application, SAND achieves a 43% speedup compared to Apache OpenWhisk.

## Cavs: An Efficient Runtime System for Dynamic Neural Networks

Shizhen Xu, Carnegie Mellon University, Tsinghua University; Hao Zhang, Graham Neubig, and Wei Dai, Carnegie Mellon University, Petuum Inc.; Jin Kyu Kim, Carnegie Mellon University; Zhijie Deng, Tsinghua University; Qirong Ho, Petuum Inc.; Guangwen Yang, Tsinghua University; Eric P. Xing, Petuum Inc.

Available Media

Recent deep learning (DL) models are moving more and more to dynamic neural network (NN) architectures, where the NN structure changes for every data sample. However, existing DL programming models are inefficient in handling dynamic network architectures because of: (1) substantial overhead caused by repeating dataflow graph construction and processing every example; (2) difficulties in batched execution of multiple samples; (3) inability to incorporate graph optimization techniques such as those used in static graphs. In this paper, we present Cavs'', a runtime system that overcomes these bottlenecks and achieves efficient training and inference of dynamic NNs. Cavs represents a dynamic NN as a static vertex function $\mathcal{F}$ and a dynamic instance-specific graph $\mathcal{G}$. It avoids the overhead of repeated graph construction by only declaring and constructing $\mathcal{F}$ once, and allows for the use of static graph optimization techniques on pre-defined operations in $\mathcal{F}$. Cavs performs training and inference by scheduling the execution of $\mathcal{F}$ following the dependencies in $\mathcal{G}$, hence naturally exposing batched execution opportunities over different samples. Experiments comparing Cavs to state-of-the-art frameworks for dynamic NNs (TensorFlow Fold, PyTorch and DyNet) demonstrate the efficacy of our approach: Cavs achieves a near one order of magnitude speedup on training of dynamic NN architectures, and ablations verify the effectiveness of our proposed design and optimizations.

## DeepCPU: Serving RNN-based Deep Learning Models 10x Faster

Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research

Available Media

Recurrent neural networks (RNNs) are an important class of deep learning (DL) models. Existing DL frameworks have unsatisfying performance for online serving: many RNN models suffer from long serving latency and high cost, preventing their deployment in production.

This work characterizes RNN performance and identifies low data reuse as a root cause. We develop novel techniques and an efficient search strategy to squeeze more data reuse out of this intrinsically challenging workload. We build DeepCPU, a fast serving library on CPUs, to integrate these optimizations for efficient RNN computation. Our evaluation on various RNN models shows that DeepCPU improves latency and efficiency by an order of magnitude on CPUs compared with existing DL frameworks such as TensorFlow. It also empowers CPUs to beat GPUs on RNN serving. In production services of Microsoft, DeepCPU transforms many models from non-shippable (due to latency SLA violation) to shippable (well-fitting latency requirements) and saves millions of dollars of infrastructure costs.

Refereed Papers Track II

## Key/Value Storage

Session Chair: Carlos Maltzahn, University of California, Santa Cruz

Essex Ballroom Center

## Closing the Performance Gap Between Volatile and Persistent Key-Value Stores Using Cross-Referencing Logs

Yihe Huang, Harvard University; Matej Pavlovic, EPFL; Virendra Marathe, Margo Seltzer, Tim Harris, and Steve Byan, Oracle Labs

Available Media

Key-Value (K-V) stores are an integral building block in modern datacenter applications. With byteaddressable persistent memory (PM) technologies, such as Intel/Micron’s 3D XPoint, on the horizon, there has been an influx of new high performance K-V stores that leverage PM for performance. However, there remains a significant performance gap between PM optimized K-V stores and DRAM resident ones, largely reflecting the gap between projected PM latency relative to that of DRAM. We address that performance gap with Bullet, a K-V store that leverages both the byte-addressability of PM and the lower latency of DRAM, using a technique called cross-referencing logs (CRLs) to keep PM updates off the critical path. Bullet delivers performance approaching that of DRAM resident K-V stores by maintaining two hash tables, one in the slower (backend) PM and the other in the faster (frontend) DRAM. CRLs are a scalable persistent logging mechanism that keeps the two copies mutually consistent. Bullet also incorporates several critical optimizations, such as dynamic load balancing between frontend and backend threads, support for nonblocking Gets, and opportunistic omission of stale updates in the backend. This combination of implementation techniques delivers performance within 5% of that of DRAM-only key-value stores for realistic (read-heavy) workloads. Our general approach, based on CRLs, is “universal” in that it can be used to turn any volatile K-V store into a persistent one (or vice-versa, provide a fast cache for a persistent K-V store).

## Metis: Robustly Tuning Tail Latencies of Cloud Systems

Zhao Lucis Li, USTC; Chieh-Jan Mike Liang, Microsoft Research; Wenjia He, USTC; Lianjie Zhu, Wenjun Dai, and Jin Jiang, Microsoft Bing Ads; Guangzhong Sun, USTC

Available Media

Tuning configurations is essential for operating modern cloud systems, but the difficulty arises from the cloud system’s diverse workloads, large system scale, and vast parameter space. The systems community has recently demonstrated the potential of predictive regression in minimizing the cost of searching for the optimal system configuration. However, we argue that cloud systems introduce challenges to the robustness of auto-tuning. First, system evaluation metrics such as tail latencies are typically sensitive to non-trivial noises. Second, while treating target systems as a black box promotes applicability, it complicates the process of selectively sample the unknown configuration-vs-performance space for modeling. To this end, Metis is an auto-tuning service used by several Microsoft services, and it implements customized Bayesian optimization to robustly improve auto-tuning: (1) the diagnostic model to find potential data outliers for re-sampling, and (2) a mixture of acquisition functions to balance exploitation, exploration and re-sampling. This paper uses the Bing Ads key- value store cluster as the running example – production results show that Metis has helped to lower the 99-percentile query lookup latency by more than 20.4%. In addition, Metis-tuned configurations outperform expert-tuned configurations, while reducing the tuning time from weeks to hours.

## Redesigning LSMs for Nonvolatile Memory with NoveLSM

Sudarsun Kannan, University of Wisconsin-Madison; Nitish Bhat and Ada Gavrilovska, Georgia Tech; Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau, University of Wisconsin-Madison

Available Media

We present NoveLSM, a persistent LSM-based key-value storage system designed to exploit non-volatile memories and deliver low latency and high throughput to applications. We utilize three key techniques – a byte- addressable skip list, direct mutability of persistent state, and opportunistic read parallelism – to deliver high performance across a range of workload scenarios. Our analysis with popular benchmarks and real-world workload reveal up to a 3.8x and 2x reduction in write and read access latency compared to LevelDB. Storing all the data in a persistent skip list and avoiding block I/O provides more than 5x and 1.9x higher write throughput over LevelDB and RocksDB. Recovery time improves substantially with NoveLSM’s persistent skip list.

## HashKV: Enabling Efficient Updates in KV Storage via Hashing

Helen H. W. Chan, The Chinese University of Hong Kong; Yongkun Li, University of Science and Technology of China; Patrick P. C. Lee, The Chinese University of Hong Kong; Yinlong Xu, University of Science and Technology of China

Available Media

Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its high garbage collection (GC) overhead in value storage.We propose HashKV, which aims for high update performance atop KV separation under update-intensive workloads. HashKV uses hash-based data grouping, which deterministically maps values to storage space so as to make both updates and GC efficient. We further relax the restriction of such deterministic mappings via simple but useful design extensions. We compare HashKV with state-of-the-art KV stores via extensive testbed experiments, and show that HashKV achieves 4.6× throughput and 53.4% less write traffic compared to the current KV separation design.