# USENIX ATC '19 Technical Sessions

### USENIX ATC '19 Program Grid

View the program in mobile-friendly grid format.

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

### This content is available to:

USENIX ATC '19 Attendee List (PDF)
USENIX ATC '19 Proceedings Web Archive (ZIP)
View mode:

## Continental Breakfast

Grand Ballroom Prefunction

## Opening Remarks and Awards

Grand Ballroom

Program Co-Chairs: Dahlia Malkhi, VMware Research and Calibra, and Dan Tsafrir, Technion—Israel Institute of Technology & VMware Research

Grand Ballroom

## Measure, Then Build

This talk will discuss an approach to systems research based upon an iterative, empirically-driven approach, loosely known as "measure, then build." By first carefully analyzing the state of the art, one can learn what the real problems in today's systems are; by then designing and implementing new systems to solve said problems, one can ensure that one's work is both relevant and important. The talk will draw on examples in research done at the University of Wisconsin-Madison over nearly two decades, including research into Linux file systems, enterprise storage, key-value storage systems, and distributed systems.

## Remzi Arpaci-Dusseau, University of Wisconsin—Madison

Remzi Arpaci-Dusseau is the Grace Wahba professor of Computer Sciences at UW-Madison. He co-leads a research group with Professor Andrea Arpaci-Dusseau. Together, they have graduated 24 Ph.D. students and won numerous best-paper awards; many of their innovations are used by commercial systems. For their work, Andrea and Remzi received the 2018 ACM-SIGOPS Weiser award for "outstanding leadership, innovation, and impact in storage and computer systems research." Remzi has won the SACM Professor-of-the-Year award six times, the Rosner "Excellent Educator" award, and the Chancellor's Distinguished Teaching Award. Andrea and Remzi's operating systems book (www.ostep.org) is downloaded millions of times yearly and used at numerous institutions worldwide.

## Wednesday Lightning Talks

Lightning Talks Co-Chairs: Deniz Altinbuken, Google, and Aasheesh Kolli, The Pennsylvania State University and VMware Research

Grand Ballroom

Available Media

Get a sneak peek of the USENIX ATC '19 papers being presented on Wednesday, July 10. This playlist includes video previews of Wednesday's lightning talks.

## Break with Refreshments

Grand Ballroom Prefunction

Track I

## Real-World, Deployed Systems

Session Chair: Yang Wang, Ohio State University

Grand Ballroom I–VI

## The Design and Operation of CloudLab

Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, and Kirk Webb, University of Utah; Aditya Akella, University of Wisconsin—Madison; Kuangching Wang, Clemson University; Glenn Ricart, US Ignite; Larry Landweber, University of Wisconsin—Madison; Chip Elliott, Raytheon; Michael Zink and Emmanuel Cecchet, University of Massachusetts Amherst; Snigdhaswin Kar and Prabodh Mishra, Clemson University

Available Media

Given the highly empirical nature of research in cloud computing, networked systems, and related fields, testbeds play an important role in the research ecosystem. In this paper, we cover one such facility, CloudLab, which supports systems research by providing raw access to programmable hardware, enabling research at large scales, and creating a shared platform for repeatable research.

We present our experiences designing CloudLab and operating it for four years, serving nearly 4,000 users who have run over 79,000 experiments on 2,250 servers, switches, and other pieces of datacenter equipment. From this experience, we draw lessons organized around two themes. The first set comes from analysis of data regarding the use of CloudLab: how users interact with it, what they use it for, and the implications for facility design and operation. Our second set of lessons comes from looking at the ways that algorithms used "under the hood," such as resource allocation, have important---and sometimes unexpected---effects on user experience and behavior. These lessons can be of value to the designers and operators of IaaS facilities in general, systems testbeds in particular, and users who have a stake in understanding how these systems are built.

## Everyone Loves File: File Storage Service (FSS) in Oracle Cloud Infrastructure

Bradley C. Kuszmaul, Matteo Frigo, Justin Mazzola Paluska, and Alexander (Sasha) Sandler, Oracle Corporation

Available Media

File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service in Oracle Cloud Infrastructure. Using a pipelined Paxos implementation, we implemented a scalable block store that provides linearizable multipage limited-size transactions. On top of the block store, we built a scalable B-tree that provides linearizable multikey limited-size transactions. By using self-validating B-tree nodes and performing all Btree housekeeping operations as separate transactions, each key in a B-tree transaction requires only one page in the underlying block transaction. The B-tree holds the filesystem metadata. The filesystem provides snapshots by using versioned key-value pairs. The entire system is programmed using a nonblocking lock-free programming style. The presentation servers maintain no persistent local state, with any state kept in the B-tree, making it easy to scale up and failover the presentation servers. We use a non-scalable Paxos-replicated hash table to store configuration information required to bootstrap the system. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication.

## Zanzibar: Google’s Consistent, Global Authorization System

Ruoming Pang, Ramon Caceres, Mike Burrows, Zhifeng Chen, Pratik Dave, Nathan Germer, Alexander Golynski, Kevin Graney, and Nina Kang, Google; Lea Kissner, Humu, Inc.; Jeffrey L. Korn, Google; Abhishek Parmar, Carbon, Inc.; Christopher D. Richards and Mengzhi Wang, Google

Available Media

Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.

## IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services

Biswaranjan Panda and Deepthi Srinivasan, Nutanix Inc.; Huan Ke, University of Chicago; Karan Gupta and Vinayak Khot, Nutanix Inc.; Haryadi S. Gunawi, University of Chicago

Available Media

We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.

Track II

## Runtimes

Session Chair: Mark Silberstein, Technion—Israel Institute of Technology

Grand Ballroom VII–IX

## PARTISAN: Scaling the Distributed Actor Runtime

Christopher S. Meiklejohn and Heather Miller, Carnegie Mellon University; Peter Alvaro, UC Santa Cruz

Available Media

We present the design of an alternative runtime system for improved scalability and reduced latency in actor applications called PARTISAN. PARTISAN provides higher scalability by allowing the application developer to specify the network overlay used at runtime without changing application semantics, thereby specializing the network communication patterns to the application. PARTISAN reduces message latency through a combination of three predominately automatic optimizations: parallelism, named channels, and affinitized scheduling. We implement a prototype of PARTISAN in Erlang and demonstrate that PARTISAN achieves up to an order of magnitude increase in the number of nodes the system can scale to through runtime overlay selection, up to a 38.07x increase in throughput, and up to a 13.5x reduction in latency over Distributed Erlang.

## Unleashing the Power of Learning: An Enhanced Learning-Based Approach for Dynamic Binary Translation

Changheng Song, Fudan University; Wenwen Wang, Pen-Chung Yew, and Antonia Zhai, University of Minnesota; Weihua Zhang, Fudan University

Available Media

Dynamic binary translation (DBT) is a key system technology that enables many important system applications such as system virtualization and emulation. To achieve good performance, it is important for a DBT system to be equipped with high-quality translation rules. However, most translation rules in existing DBT systems are created manually with high engineering efforts and poor quality. To solve this problem, a learning-based approach was recently proposed to automatically learn semantically-equivalent translation rules, and symbolic verification is used to prove the semantic equivalence of such rules. But, they still suffer from some shortcomings.

In this paper, we first give an in-depth analysis on the constraints of prior learning-based methods and observe that the equivalence requirements are often unduly restrictive. It excludes many potentially high-quality rule candidates from being included and applied. Based on this observation, we propose an enhanced learning-based approach that relaxes such equivalence requirements but supplements them with constraining conditions to make them semantically equivalent when such rules are applied. Experimental results on SPEC CINT2006 show that the proposed approach can improve the dynamic coverage of the translation from 55.7\% to 69.1\% and the static coverage from 52.2\% to 61.8\%, compared to the original approach. Moreover, up to 1.65X performance speedup with an average of 1.19X are observed.

## Transactuations: Where Transactions Meet the Physical World

Aritra Sengupta, Tanakorn Leesatapornwongsa, and Masoud Saeida Ardekani, Samsung Research; Cesar A. Stuardo, University of Chicago

Awarded Best Paper!

Available Media

A large class of IoT applications read sensors, execute application logic, and actuate actuators. However, the lack of high-level programming abstractions compromises correctness especially in presence of failures and unwanted interleaving between applications. A key problem arises when operations on IoT devices or the application itself fails, which leads to inconsistencies between the physical state and application state, breaking application semantics and causing undesired consequences. Transactions are a well-established abstraction for correctness, but assume properties that are absent in an IoT context. In this paper, we study one such environment, smart home, and establish inconsistencies manifesting out of failures. We propose an abstraction called transactuation that empowers developers to build reliable applications. Our runtime, Relacs, implements the abstraction atop a real smart-home platform. We evaluate programmability, performance, and effectiveness of transactuations to demonstrate its potential as a powerful abstraction and execution model.

## Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code

Abhinav Jangda, Bobby Powers, Emery D. Berger, and Arjun Guha, University of Massachusetts Amherst

Available Media

All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average 10% slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm , a significant extension to Browsix that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of 2.08$\times$ (Firefox) and 2.5$\times$ (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.

Track I

## Filesystems

Grand Ballroom I–VI

## Extension Framework for File Systems in User space

Ashish Bijlani and Umakishore Ramachandran, Georgia Institute of Technology

Available Media

User file systems offer numerous advantages over their in-kernel implementations, such as ease of development and better system reliability. However, they incur heavy performance penalty. We observe that existing user file system frameworks are highly general; they consist of a minimal interposition layer in the kernel that simply forwards all low-level requests to user space. While this design offers flexibility, it also severely degrades performance due to frequent kernel-user context switching.

This work introduces ExtFUSE, a framework for developing extensible user file systems that also allows applications to register "thin" specialized request handlers in the kernel to meet their specific operative needs, while retaining the complex functionality in user space. Our evaluation with two FUSE file systems shows that ExtFUSE can improve the performance of user file systems with less than a few hundred lines on average. ExtFUSE is available on GitHub.

## FlexGroup Volumes: A Distributed WAFL File System

Ram Kesavan, Google; Jason Hennessey, Richard Jernigan, Peter Macko, Keith A. Smith, Daniel Tennant, and Bharadwaj V. R., NetApp

Available Media

The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.

In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.

## EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices

Xiang Gao, Huawei Technologies Co., Ltd.; Mingkai Dong, Shanghai Jiao Tong University; Xie Miao, Wei Du, and Chao Yu, Huawei Technologies Co., Ltd.; Haibo Chen, Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.

Available Media

Smartphones usually have limited storage and runtime memory. Compressed read-only file systems can dramatically decrease the storage used by read-only system resources. However, existing compressed read-only file systems use fixed-sized input compression, which causes significant I/O amplification and unnecessary computation. They also consume excessive runtime memory during decompression and deteriorate the performance when the runtime memory is scarce. In this paper, we describe EROFS, a new compression-friendly read-only file system that leverages fixed-sized output compression and memory-efficient decompression to achieve high performance with little extra memory overhead. We also report our experience of deploying EROFS on tens of millions of smartphones. Evaluation results show that EROFS outperforms existing compressed read-only file systems with various micro-benchmarks and reduces the boot time of real-world applications by up to 22.9% while nearly halving the storage usage.

## QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage

Xiaokang Hu and Fuzong Wang, Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.; Weigang Li, Intel Asia-Pacific R&D Ltd.; Jian Li and Haibing Guan, Shanghai Jiao Tong University

Available Media

Track II

## Big-Data Programming Models & Frameworks

Session Chair: Keith Winstein, Stanford University

Grand Ballroom VII–IX

## Apache Nemo: A Framework for Building Distributed Dataflow Optimization Policies

Youngseok Yang and Jeongyoon Eo, Seoul National University; Geon-Woo Kim, Viva Republica; Joo Yeon Kim, Samsung Electronics; Sanha Lee, Naver Corp.; Jangho Seo, Won Wook Song, and Byung-Gon Chun, Seoul National University

Available Media

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance, and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation, optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario.

## Tangram: Bridging Immutable and Mutable Abstractions for Distributed Data Analytics

Yuzhen Huang, Xiao Yan, Guanxian Jiang, Tatiana Jin, James Cheng, An Xu, Zhanhao Liu, and Shuo Tu, The Chinese University of Hong Kong

Available Media

Data analytics frameworks that adopt immutable data abstraction usually provide better support for failure recovery and straggler mitigation, while those that adopt mutable data abstraction are more efficient for iterative workloads thanks to their support for in-place state updates and asynchronous execution. Most existing frameworks adopt either one of the two data abstractions and do not enjoy the benefits of the other. In this paper, we propose a novel programming model named MapUpdate, which can determine whether a distributed dataset is mutable or immutable in an application. We show that MapUpdate not only offers good expressiveness, but also allows us to enjoy the benefits of both mutable and immutable abstractions. MapUpdate naturally supports iterative and asynchronous execution, and can use different recovery strategies adaptively according to failure scenarios. We implemented MapUpdate in a system, called Tangram, with novel system designs such as lightweight local task management, partition-based progress control, and context-aware failure recovery. Extensive experiments verified the benefits of Tangram on a variety of workloads including bulk processing, graph analytics, and iterative machine learning.

## STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model

Jin Kyu Kim and Abutalib Aghayev, Carnegie Mellon University; Garth A. Gibson, Carnegie Mellon University, Vector Institute, University of Toronto; Eric P. Xing, Petuum Inc, Carnegie Mellon University

Available Media

It is a daunting task for a data scientist to convert sequential code for a Machine Learning (ML) model, published by an ML researcher, to a distributed framework that runs on a cluster and operates on massive datasets. The process of fitting the sequential code to an appropriate programming model and data abstractions determined by the framework of choice requires significant engineering and cognitive effort. Furthermore, inherent constraints of frameworks sometimes lead to inefficient implementations, delivering suboptimal performance.

We show that it is possible to achieve automatic and efficient distributed parallelization of familiar sequential ML code by making a few mechanical changes to it while hiding the details of concurrency control, data partitioning, task parallelization, and fault-tolerance. To this end, we design and implement a new distributed ML framework, STRADS-Automatic Parallelization (AP), and demonstrate that it simplifies distributed ML programming significantly, while outperforming a popular data-parallel framework with a non-familiar programming model, and achieving performance comparable to an ML-specialized framework.

## SOPHIA: Online Reconfiguration of Clustered NoSQL Databases for Time-Varying Workloads

Ashraf Mahgoub, Purdue University; Paul Wood, Johns Hopkins University; Alexander Medoff, Purdue University; Subrata Mitra, Adobe Research; Folker Meyer, Argonne National Lab; Somali Chaterji and Saurabh Bagchi, Purdue University

Available Media

## Break with Refreshments

Grand Ballroom Prefunction

Track I

## Invited Presentations #1

Session Chair: Chia-Che Tsai, Texas A&M University

Grand Ballroom I–VI

## DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching

Zaoxing Liu and Zhihao Bai, Johns Hopkins University; Zhenming Liu, College of William and Mary; Xiaozhou Li, Celer Network; Changhoon Kim, Barefoot Networks; Vladimir Braverman and Xin Jin, Johns Hopkins University; Ion Stoica, UC Berkeley

Best Paper at FAST '19: Link to Paper

Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence.

We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching.

## Protocol-Aware Recovery for Consensus-Based Storage

Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

Best Paper at FAST '18: Link to Paper

We introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-tolerant replication (CTRL), a PAR mechanism specific to replicated state machine (RSM) systems. We experimentally show that the CTRL versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the CTRL versions have little performance overhead.

## Orca: Differential Bug Localization in Large-Scale Services

Ranjita Bhagwan, Rahul Kumar, Chandra Sekhar Maddila, and Adithya Abraham Philip, Microsoft Research India

Best Paper at OSDI '18: Link to Paper

Today, we depend on numerous large-scale services for basic operations such as email. These services are complex and extremely dynamic as developers continously commit code and introduce new features, fixes and, consequently, new bugs. Hundreds of commits may enter deployment simultaneously. Therefore one of the most time-critical, yet complex tasks towards mitigating service disruption is to localize the bug to the right commit.

This paper presents the concept of differential bug localization that uses a combination of differential code analysis and software provenance tracking to effectively pin-point buggy commits. We have built Orca, a customized code search-engine that implements differential bug localization. Orca is actively being used by the On-Call Engineers (OCEs) of a large enterprise email and collaboration service to localize bugs to the appropriate buggy commits. Our evaluation shows that Orca correctly localizes 77% of bugs for which it has been used. We also show that it causes a 4x reduction in the work done by the OCE.

## LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation

Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University

Best Paper at OSDI '18: Link to Paper

The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve heterogeneity, elasticity, resource utilization, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it. We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, LegoOS cleanly separates processor, memory, and storage devices both at the hardware level and the OS level. We implemented LegoOS from scratch and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS’s performance is comparable to monolithic Linux servers, while largely improving resource packing and failure rate over monolithic clusters.

Track II

## Security #1: Kernel

Session Chair: John Criswell, University of Rochester

Grand Ballroom VII–IX

## libmpk: Software Abstraction for Intel Memory Protection Keys (Intel MPK)

Soyeon Park, Georgia Institute of Technology; Sangho Lee, Microsoft Research; Wen Xu, Georgia Institute of Technology; Hyungon Moon, Ulsan National Institute of Science and Technology; Taesoo Kim, Georgia Institute of Technology

Available Media

Intel Memory Protection Keys (MPK) is a new hardware primitive to support thread-local permission control on groups of pages without requiring modification of page tables. Unfortunately, its current hardware implementation and software support suffer from security, scalability, and semantic problems: (1) vulnerable to protection-key-use-after-free; (2) providing the limited number of protection keys; and (3) incompatible with mprotect()’s process-based permission model.

In this paper, we propose libmpk, a software abstraction for MPK. It virtualizes the hardware protection keys to eliminate the protection-key-use-after-free problem while providing accesses to an unlimited number of virtualized keys. To support legacy applications, it also provides a lazy inter-thread key synchronization. To enhance the security of MPK itself, libmpk restricts unauthorized writes to its metadata. We apply libmpk to three real-world applications: OpenSSL, JavaScript JIT compiler, and Memcached for memory protection and isolation. Our evaluation shows that it introduces negligible performance overhead (<1%) compared with the original, unprotected versions and improves performance by 8.1× compared with the secure equivalents using mprotect(). The source code of libmpk is publicly available and maintained as an open source project.

## Effective Static Analysis of Concurrency Use-After-Free Bugs in Linux Device Drivers

Jia-Ju Bai, Tsinghua University; Julia Lawall, Sorbonne Université/Inria/LIP6; Qiu-Liang Chen and Shi-Min Hu, Tsinghua University

Available Media

In Linux device drivers, use-after-free (UAF) bugs can cause system crashes and serious security problems. According to our study of Linux kernel commits, 42% of the driver commits fixing use-after-free bugs involve driver concurrency. We refer to these use-after-free bugs as concurrency use-after-free bugs. Due to the non-determinism of concurrent execution, concurrency use-after-free bugs are often more difficult to reproduce and detect than sequential use-after-free bugs.

In this paper, we propose a practical static analysis approach named DCUAF, to effectively detect concurrency use-after-free bugs in Linux device drivers. DCUAF combines a local analysis analyzing the source code of each driver with a global analysis statistically analyzing the local results of all drivers, forming a local-global analysis, to extract the pairs of driver interface functions that may be concurrently executed. Then, with these pairs, DCUAF performs a summary-based lockset analysis to detect concurrency use-after-free bugs. We have evaluated DCUAF on the driver code of Linux 4.19, and found 640 real concurrency use-after-free bugs. We have randomly selected 130 of the real bugs and reported them to Linux kernel developers, and 95 have been confirmed.

## LXDs: Towards Isolation of Kernel Subsystems

Vikram Narayanan, University of California, Irvine; Abhiram Balasubramanian, Charlie Jacobsen, Sarah Spall, Scott Bauer, and Michael Quigley, University of Utah; Aftab Hussain, Abdullah Younis, Junjie Shen, Moinak Bhattacharyya, and Anton Burtsev, University of California, Irvine

Available Media

Modern operating systems are monolithic. Today, however, lack of isolation is one of the main factors undermining security of the kernel. Inherent complexity of the kernel code and rapid development pace combined with the use of unsafe, low-level programming language results in a steady stream of errors. Even after decades of efforts to make commodity kernels more secure, i.e., development of numerous static and dynamic approaches aimed to prevent exploitation of most common errors, several hundreds of serious kernel vulnerabilities are reported every year. Unfortunately, in a monolithic kernel a single exploitable vulnerability potentially provides an attacker with access to the entire kernel.

Modern kernels need isolation as a practical means of confining the effects of exploits to individual kernel subsystems. Historically, introducing isolation in the kernel is hard. First, commodity hardware interfaces provide no support for efficient, fine-grained isolation. Second, the complexity of a modern kernel prevents a naive decomposition effort. Our work on Lightweight Execution Domains (LXDs) takes a step towards enabling isolation in a full-featured operating system kernel. LXDs allow one to take an existing kernel subsystem and run it inside an isolated domain with minimal or no modifications and with a minimal overhead. We evaluate our approach by developing isolated versions of several performance-critical device drivers in the Linux kernel.

## JumpSwitches: Restoring the Performance of Indirect Branches In the Era of Spectre

Nadav Amit, VMware Research; Fred Jacobs, VMware; Michael Wei, VMware Research

Available Media

The Spectre family of security vulnerabilities show that speculative execution attacks on modern processors are practical. Spectre variant 2 attacks exploit the speculation of indirect branches, which enable processors to execute code from arbitrary locations in memory. Retpolines serve as the state-of-the-art defense, effectively disabling speculative execution for indirect branches. While retpolines succeed in protecting against Spectre, they come with a significant penalty — 20% on some workloads.

In this paper, we describe and implement an alternative mechanism: the JumpSwitch, which enables speculative execution of indirect branches on safe targets by leveraging indirect call promotion, transforming indirect calls into direct calls. Unlike traditional inlining techniques which apply call promotion at compile time, JumpSwitches aggressively learn targets at runtime and leverages an optimized patching infrastructure to perform just-in-time promotion without the overhead of binary translation.

We designed and optimized JumpSwitches for common patterns we have observed. If a JumpSwitch cannot learn safe targets, we fall back to the safety of retpolines. JumpSwitches seamlessly integrate into Linux, and are evaluated in Linux v4.18. In addition, we have submitted patches enabling JumpSwitch upstream to the Linux community. We show that JumpSwitches can improve performance over retpolines by up to 20% for a range of workloads. In some cases,JumpSwitches even show improvement over a system without retpolines by directing speculative execution into direct calls just-in-time and reducing mispredictions.

## Poster Session and Happy Hour

Lake Washington Ballroom

Take part in discussions with your colleagues over complimentary food and drinks. Posters of all papers presented in the Technical Sessions on Wednesday and on Thursday morning will be on display. View the list of accepted posters.

## Continental Breakfast

Grand Ballroom Prefunction

## Thursday Lightning Talks

Lightning Talks Co-Chairs: Deniz Altinbuken, Google, and Aasheesh Kolli, The Pennsylvania State University and VMware Research

Grand Ballroom

Available Media

Get a sneak peek of the USENIX ATC '19 papers being presented on Thursday, July 11. This playlist includes video previews of Thursday's lightning talks.

Track I

## Invited Presentations #2

Session Chair: Amy Tai, VMware Research

Grand Ballroom I–VI

## Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly

Yatish Turakhia and Gill Bejerano, Stanford University; William J. Dally, Stanford University and NVIDIA Research

Best Paper at ASPLOS '18: Link to Paper (external site)

## Cross-dataset Time Series Anomaly Detection for Cloud Systems

Xu Zhang, Microsoft Research, Nanjing University; Qingwei Lin, Yong Xu, and Si Qin, Microsoft Research; Hongyu Zhang, The University of Newcastle; Bo Qiao, Microsoft Research; Yingnong Dang, Xinsheng Yang, Qian Cheng, Murali Chintalapati, Youjiang Wu, and Ken Hsieh, Microsoft; Kaixin Sui, Xin Meng, Yaohai Xu, and Wenchi Zhang, Microsoft Research; Furao Shen, Nanjing University; Dongmei Zhang, Microsoft Research

Available Media

In recent years, software applications are increasingly deployed as online services on cloud computing platforms. It is important to detect anomalies in cloud systems in order to maintain high service availability. However, given the velocity, volume, and diversified nature of cloud monitoring data, it is difficult to obtain sufficient labelled data to build an accurate anomaly detection model. In this paper, we propose cross-dataset anomaly detection: detect anomalies in a new unlabelled dataset (the target) by training an anomaly detection model on existing labelled datasets (the source). Our approach, called ATAD (Active Transfer Anomaly Detection), integrates both transfer learning and active learning techniques. Transfer learning is applied to transfer knowledge from the source dataset to the target dataset, and active learning is applied to determine informative labels of a small part of samples from unlabelled datasets. Through experiments, we show that ATAD is effective in cross-dataset time series anomaly detection. Furthermore, we only need to label about 1%-5% of unlabelled data and can still achieve significant performance improvement.