Selling Stuff That's Free: the Commercial Side of Free Software

Jidong Xiao; Andrew Wang; Lei Lu; Tej Chajed; Hai Huang; Taesoo Kim; Haining Wang; Pavel Emelyanov; Anish Arora; Nickolai Zeldovich; Jennifer Rexford; Laurent Vanbever; Haibing Guan; Leif Walsh; Michael A. Bender; Martin Farach-Colton; Rob Johnson; Bradley C. Kuszmaul; Donald E. Porter; Luis Oviedo; Matteo Migliavacca; Derek McAuley

help promote

USENIX ATC '16

Get
Help Promote graphics!

connect with us

twitter

Tweets by @usenix

Technical Sessions

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
USENIX ATC '16 Full Proceedings (PDF)
USENIX ATC '16 Proceedings Interior (PDF, best for mobile devices)
USENIX ATC '16 Errata Slip (PDF)

Full Proceedings ePub (for iPad and most eReaders)
USENIX ATC '16 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
USENIX ATC '16 Full Proceedings (Mobi)

Downloads for Registered Conference Attendees

Attendee Files

USENIX ATC '16 Web Proceedings Archive (ZIP, includes attendee list)

USENIX ATC '16 Attendee List

Wednesday, June 22, 2016

7:30 am–9:00 am	Wednesday
Continental Breakfast Ballroom Foyer
8:45 am–9:00 am	Wednesday
Opening Remarks and Awards Colorado Ballroom F–J Program Co-Chairs: Ajay Gulati, ZeroStack, Inc., and Hakim Weatherspoon, Cornell University
9:00 am–10:00 am	Wednesday
Keynote Address Colorado Ballroom F–J Session Chair: Ajay Gulati, ZeroStack, Inc. The Future of Infrastructure Martin Casado, General Partner, Andreessen Horowitz Martin Casado is a general partner at the venture capital firm Andreessen Horowitz. He was previously the cofounder and chief technology officer at Nicira, which was acquired by VMware in 2012. While at VMware, Martin served as senior vice president and general manager of the Networking and Security Business Unit. Martin started his career at Lawrence Livermore National Laboratory where he worked on large-scale simulations for the Department of Defense before moving over to work with the intelligence community on networking and cybersecurity. These experiences inspired his work at Stanford where he created the software-defined networking (SDN) movement, leading to a new paradigm of network virtualization. While at Stanford he also cofounded Illuminics Systems, an IP analytics company, which was acquired by Quova Inc. in 2006. For his work, Martin was awarded both the ACM Grace Murray Hopper award and the NEC C&C award, and he’s an inductee of the Lawrence Livermore Lab’s Entrepreneur’s Hall of Fame. He holds both a PhD and Masters degree in Computer Science from Stanford University. Martin Casado is a general partner at the venture capital firm Andreessen Horowitz. He was previously the cofounder and chief technology officer at Nicira, which was acquired by VMware in 2012. While at VMware, Martin served as senior vice president and general manager of the Networking and Security Business Unit. Martin started his career at Lawrence Livermore National Laboratory where he worked on large-scale simulations for the Department of Defense before moving over to work with the intelligence community on networking and cybersecurity. These experiences inspired his work at Stanford where he created the software-defined networking (SDN) movement, leading to a new paradigm of network virtualization. While at Stanford he also cofounded Illuminics Systems, an IP analytics company, which was acquired by Quova Inc. in 2006. For his work, Martin was awarded both the ACM Grace Murray Hopper award and the NEC C&C award, and he’s an inductee of the Lawrence Livermore Lab’s Entrepreneur’s Hall of Fame. He holds both a Ph.D. and Masters degree in Computer Science from Stanford University. Martin Casado is a general partner at the venture capital firm Andreessen Horowitz. He was previously the cofounder and chief technology officer at Nicira, which was acquired by VMware in 2012. While at VMware, Martin served as senior vice president and general manager of the Networking and Security Business Unit. Martin started his career at Lawrence Livermore National Laboratory where he worked on large-scale simulations for the Department of Defense before moving over to work with the intelligence community on networking and cybersecurity. These experiences inspired his work at Stanford where he created the software-defined networking (SDN) movement, leading to a new paradigm of network virtualization. While at Stanford he also cofounded Illuminics Systems, an IP analytics company, which was acquired by Quova Inc. in 2006. For his work, Martin was awarded both the ACM Grace Murray Hopper award and the NEC C&C award, and he’s an inductee of the Lawrence Livermore Lab’s Entrepreneur’s Hall of Fame. He holds both a Ph.D. and Masters degree in Computer Science from Stanford University. Read more about The Future of Infrastructure
10:00 am–10:30 am	Wednesday
Break with Refreshments Ballroom Foyer

10:30 am–12:10 pm

Wednesday

Datacenter Networking

Colorado Ballroom F

Session Chair: Rodrigo Fonseca, Brown University

FLICK: Developing and Running Application-Specific Network Services

Abdul Alim, Richard G. Clegg, Luo Mai, Lukas Rupprecht, and Eric Seckler, Imperial College London;; Paolo Costa, Microsoft Research and Imperial College London;; Peter Pietzuch and Alexander L. Wolf, Imperial College London;; Nik Sultana, Jon Crowcroft, Anil Madhavapeddy, Andrew W. Moore, and Richard Mortier, University of Cambridge; Masoud Koleni, Luis Oviedo, and Derek McAuley, University of Nottingham; Matteo Migliavacca, University of Kent

Data centre networks are increasingly programmable, with application-specific network services proliferating, from custom load-balancers to middleboxes providing caching and aggregation. Developers must currently implement these services using traditional low-level APIs, which neither support natural operations on application data nor provide efficient performance isolation.

We describe FLICK, a framework for the programming and execution of application-specific network services on multi-core CPUs. Developers write network services in the FLICK language, which offers high-level processing constructs and application-relevant data types. FLICK programs are translated automatically to efficient, parallel task graphs, implemented in C++ on top of a user-space TCP stack. Task graphs have bounded resource usage at runtime, which means that the graphs of multiple services can execute concurrently without interference using cooperative scheduling. We evaluate FLICK with several services (an HTTP load-balancer, a Memcached router and a Hadoop data aggregator), showing that it achieves good performance while reducing development effort.

Available Media

SoftFlow: A Middlebox Architecture for Open vSwitch

Ethan J. Jackson, University of California, Berkeley; Melvin Walls, Penn State Harrisburg and University of California, Berkeley; Aurojit Panda, University of California, Berkeley; Justin Pettit, Ben Pfaff, and Jarno Rajahalme, VMware, Inc.; Teemu Koponen, Styra, Inc.; Scott Shenker, University of California, Berkeley, and International Computer Science Institute

Open vSwitch is a high-performance multi-layer virtual switch that serves as a flexible foundation for building virtualized, stateless Layer 2 and 3 network services in multitenant datacenters. As workloads become more sophisticated, providing tenants with virtualized middlebox services is an increasingly important and recurring theme, yet it remains difficult to integrate these stateful services efficiently into Open vSwitch and its OpenFlow forwarding model: middleboxes perform complex operations that depend on internal state and inspection of packet payloads – functionality which is impossible to express in OpenFlow. In this paper, we present SoftFlow, an extension of Open vSwitch that seamlessly integratesmiddlebox functionality whilemaintaining the familiar OpenFlow forwarding model and performing significantly better than alternative techniques for middlebox integration.

Available Media

Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers

Guo Chen, Tsinghua University and Microsoft Research; Yuanwei Lu, University of Science and Technology of China and Microsoft Research; Yuan Meng, Tsinghua University; Bojie Li, University of Science and Technology of China and Microsoft Research; Kun Tan, Microsoft Research; Dan Pei, Tsinghua University; Peng Cheng, Layong (Larry) Luo, and Yongqiang Xiong, Microsoft Research; Xiaoliang Wang, Nanjing University; Youjian Zhao, Tsinghua University

To achieve low TCP flow completion time (FCT) in data center networks (DCNs), it is critical and challenging to rapidly recover loss without adding extra congestion. Therefore, in this paper we propose a novel loss recovery approach FUSO that exploits multi-path diversity in DCN for transport loss recovery. In FUSO, when a multi-path transport sender suspects loss on one subflow, recovery packets are immediately sent over another sub-flow that is not or less lossy and has spare congestion window slots. FUSO is fast in that it does not need to wait for timeout on the lossy sub-flow, and it is cautious in that it does not violate congestion control algorithm. Testbed experiments and simulations show that FUSO decreases the latency-sensitive flows’ 99th percentile FCT by up to ~82.3% in a 1Gbps testbed, and up to ~87.9% in a 10Gpbs large-scale simulated network.

Available Media

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Kenichi Yasukata, Keio University; Michio Honda, Douglas Santry, and Lars Eggert, NetApp

StackMap leverages the best aspects of kernel-bypass networking into a new low-latency OS network service based on the full-featured TCP kernel implementation, by dedicating network interfaces to applications and offering an extended version of the netmap API for zero-copy, low-overhead data path alongside control path based on socket API. For small-message, transactional workloads, StackMap outperforms baseline Linux by 4 to 78 % in latency and 42 to 133 % in throughput. It also achieves comparable performance with Seastar, a highly-optimized user-level TCP/IP stack that runs on top of DPDK.

Available Media

Industry Talks

Colorado Ballroom G–J

Session Chair: Benjamin Reed, Facebook

Opportunities and Challenges in Adopting Microservice Architectures for Enterprise Workloads

Shriram Rajagopalan, Hani Jamjoom, Tamar Eilam, and Priya Nagpurkar, IBM T. J. Watson Research Center

Available Media

Read more about Opportunities and Challenges in Adopting Microservice Architectures for Enterprise Workloads

Jenkins Job Builder - What? Why? How?

Chris Lee, DreamHost

Read more about Jenkins Job Builder - What? Why? How?

Bring Your Own Dilemma: OEM Laptops and Windows 10 Issues

Mark Loveless, Duo Security, Inc.

Available Media

Read more about Bring Your Own Dilemma: OEM Laptops and Windows 10 Issues

In-place Resumable Partial Decompression Technique and Its Application to A Real Deduplicated Storage System

Guanlin Lu, EMC Corporation

Available Media

Read more about In-place Resumable Partial Decompression Technique and Its Application to A Real Deduplicated Storage System

12:10 pm–1:40 pm	Wednesday
Lunch (on your own)

1:40 pm–3:20 pm

Wednesday

File and Key-Value Systems

Colorado Ballroom F

Session Chair: Angela Demke Brown, University of Toronto

SLIK: Scalable Low-Latency Indexes for a Key-Value Store

Ankita Kejriwal, Arjun Gopalan, Ashish Gupta, Zhihao Jia, Stephen Yang, and John Ousterhout, Stanford University

Many large-scale key-value storage systems sacrifice features like secondary indexing and/or consistency in favor of scalability or performance. This limits the ease and efficiency of application development on such systems. Implementing secondary indexing in a large-scale memory based system is challenging because the goals for low latency, high scalability, consistency and high availability often conflict with each other. This paper shows how a large-scale key-value storage system can be extended to provide secondary indexes while meeting those goals. The architecture, called SLIK, enables multiple secondary indexes for each table. SLIK represents index B+ trees using objects in the underlying key-value store. It allows indexes to be partitioned and distributed independently of the data in tables while providing reasonable consistency guarantees using a lightweight ordered write approach. Our implementation of this design on RAMCloud (a main memory key-value store) performs indexed reads in 11 μs and writes in 30 μs. The architecture supports indexes spanning thousands of nodes, and provides linear scalability for throughput.

Available Media

Understanding Manycore Scalability of File Systems

Changwoo Min, Sanidhya Kashyap, Steffen Maass, Woonhak Kang, and Taesoo Kim, Georgia Institute of Technology

We analyze the manycore scalability of five widelydeployed file systems, namely, ext4, XFS, btrfs, F2FS, and tmpfs, by using our open source benchmark suite, FXMARK. FXMARK implements 19 microbenchmarks to stress specific components of each file system and includes three application benchmarks to measure the macroscopic scalability behavior. We observe that file systems are hidden scalability bottlenecks in many I/Ointensive applications even when there is no apparent contention at the application level. We found 25 scalability bottlenecks in file systems, many of which are unexpected or counterintuitive. We draw a set of observations on file system scalability behavior and unveil several core aspects of file system design that systems researchers must address.

Available Media

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang, Jiwu Shu, and Youyou Lu, Tsinghua University

File system designs are undergoing rapid evolution to exploit the potentials of flash memory. However, the internal parallelism, a key feature of flash devices, is hard to be leveraged in the file system level, due to the semantic gap caused by the flash translation layer (FTL).We observe that even flash-optimized file systems have serious garbage collection problems, which lead to significant performance degradation, for write-intensive workloads on multi-channel flash devices.

In this paper, we propose ParaFS to exploit the internal parallelism while ensuring efficient garbage collection. ParaFS is a log-structured file system over a simpli- fied block-level FTL that exposes the physical layout. With the knowledge of device information, ParaFS first proposes 2-D data allocation, to maintain the hot/cold data grouping in flash memory while exploiting channel- level parallelism. ParaFS then coordinates the garbage collection in both FS and FTL levels, to make garbage collection more efficient. In addition, ParaFS sched- ules read/write/erase requests over multiple channels to achieve consistent performance. Evaluations show that ParaFS effectively improves system performance for write-intensive workloads by 1.6x to 3.1x, compared to the flash-optimized F2FS file system.

Available Media

FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication

Wen Xia, Huazhong University of Science and Technology and Sangfor Technologies Co., Ltd.; Yukun Zhou, Huazhong University of Science and Technology; Hong Jiang, University of Texas at Arlington; Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, and Qing Liu, Huazhong University of Science and Technology

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection abil- ity. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut- points by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we pro- pose FastCDC, a Fast and efficient CDC approach, that builds and improves on the latest Gear-based CDC ap- proach, one of the fastest CDC methods to our knowl- edge. The key idea behind FastCDC is the combined use of three key techniques, namely, simplifying and enhanc- ing the hash judgment to address our observed challenges facing Gear-based CDC, skipping sub-minimum chunk cut-point to further speed up CDC, and normalizing the chunk-size distribution in a small specified region to ad- dress the problem of the decreased deduplication ratio stemming from the cut-point skipping. Our evaluation results show that, by using a combination of the three techniques, FastCDC is about 10x faster than the best of open-source Rabin-based CDC, and about 3x faster than the state-of-the-art Gear- and AE-based CDC, while achieving nearly the same deduplication ratio as the clas- sic Rabin-based approach.

Available Media

Mobile and Apps

Colorado Ballroom G–J

Session Chair: Rodrigo Fonseca, Brown University

Unsafe Time Handling in Smartphones

Abhilash Jindal, Prahlad Joshi, Y. Charlie Hu, and Samuel Midkiff, Purdue University

Time manipulation, typically done using gettime() and settime(), happens extensively across all software layers in smartphones, from the kernel, to the framework, to millions of apps. This paper presents the first study of a new class of software bugs on smartphones called sleep-induced time bugs (SITB). SITB happenswhen the phone is suspended, due to the aggressive sleeping policy adopted in smartphones, in the middle of a time critical section where time is being manipulated and delay caused by unexpected phone suspension alters the intended program behavior.

We first characterize time usages in the Android kernel, framework, and 978 apps into four categories and study their vulnerabilities to system suspension. Our study shows time manipulation happens extensively in all three software layers, totaling 1047, 1737 and 7798 times, respectively, and all four usage patterns are vulnerable to SITBs. We then present a tool called KLOCK, that makes use of a set of static analyses to systematically identify sleep-induced time bugs in three of the four time usage categories. When applied to five differentAndroid Linux kernels, KLOCK correctly flagged 63 SITBvulnerable time manipulation instances as time bugs.

Available Media

Energy Discounted Computing on Multicore Smartphones

Meng Zhu and Kai Shen, University of Rochester

Multicore processors are not energy proportional: the first running CPU core that activates shared resources incurs much higher power cost than each additional core does. On the other hand, typical smartphone applications exhibit little parallelism and therefore when one core is activated by an interactive application, computing resources at other cores are available at a deep energy discount. By non-work-conserving scheduling, we exploit energy-discounted co-run opportunities to process best-effort smartphone tasks that involve no direct user interaction (e.g., data compression / encryption for cloud backup, background sensing, and offline bytecode compilation). We show that, for optimal co-run energy discount, the best-effort processing must not elevate the overall system power state (specifically, no reduction of the multicore CPU idle state, no increase of the core frequency, and no impact on the system suspension period). In addition, we use available ARM performance counters to identify co-run resource contention on the multicore processor and throttle best-effort task when it interferes with interactivity. Experimental results on a multicore smartphone show that we can reach up to 63% energy discount in the best-effort task processing with little performance impact on the interactive applications.

Available Media

Beam: Ending Monolithic Applications for Connected Devices

Chenguang Shen, University of California, Los Angeles; Rayman Preet Singh, Samsung Research; Amar Phanishayee, Aman Kansal, and Ratul Mahajan, Microsoft Research

The proliferation of connected sensing devices (or Internet of Things) can in theory enable a range of applications that make rich inferences about users and their environment. But in practice developing such applications today is arduous because they must implement all data sensing and inference logic, even as devices move or are temporarily disconnected. We develop Beam, a framework that simplifies IoT applications by letting them specify “what should be sensed or inferred,” without worrying about “how it is sensed or inferred.” Beam introduces the key abstraction of an inference graph to decouple applications from the mechanics of sensing and drawing inferences. The inference graph allows Beam to address three important challenges: (1) device selection in heterogeneous environments, (2) efficient resource usage, and (3) handling device disconnections. Using Beam we develop two diverse applications that use several different types of devices and show that their implementations required up to 12x fewer source lines of code while resulting in up to 3x higher inference accuracy.

Available Media

Caching Doesn't Improve Mobile Web Performance (Much)

Jamshed Vesuna and Colin Scott, University of California, Berkeley; Michael Buettner and Michael Piatek, Google; Arvind Krishnamurthy, University of Washington; Scott Shenker, University of California, Berkeley, and International Computer Science Institute

A recent NSDI paper [1] reported that increasing the cache hit ratio for an HTTP proxy from 22% to 32% improved median page load time (PLT) formobile clients by less than 2%. We argue that there are two main causes for this weak improvement: objects on the critical path are often not cached, and the limited computational power of mobile devices causes computational delays to comprise a large portion of the critical path.

Both of these factors were, in fact, outlined by a previous analysis of desktop web performance [2]. However, we (as the authors of the HTTP proxy [1]) did not properly understand the analysis and could have saved ourselves substantial engineering costs ifwe had. We therefore argue for the need to highlight this prior analysis, and extend the analysis to include mobile devices with slow CPUs, precise cache hit ratios, and a controlled reproduction of the HTTP proxy caching results [1]. In the extreme case of a perfect cache hit ratio, desktop page load times are improved notably by 34% compared to no caching, but mobile page load times only improve by 13% in the median case. We extract a back-of-envelope performance model from these results to help understand their underlying causes.

Available Media

3:20 pm–3:50 pm

Wednesday

Break with Refreshments

Ballroom Foyer

3:50 pm–5:30 pm Wednesday

Systems and Network Security

Colorado Ballroom F

Session Chair: Mohit Aron, Cohesity

Secure and Efficient Application Monitoring and Replication

Stijn Volckaert, University of California, Irvine, and Ghent University; Bart Coppens, Ghent University; Alexios Voulimeneas, University of California, Irvine; Andrei Homescu, Immunant, Inc.; Per Larsen, University of California, Irvine, and and Immunant, Inc.; Bjorn De Sutter, Ghent University; Michael Franz, University of California, Irvine

Memory corruption vulnerabilities remain a grave threat to systems software written in C/C++. Current best practices dictate compiling programs with exploit mitigations such as stack canaries, address space layout randomization, and control-flow integrity. However, adversaries quickly find ways to circumvent such mitigations, sometimes even before these mitigations are widely deployed.

In this paper, we focus on an “orthogonal” defense that amplifies the effectiveness of traditional exploit mitigations. The key idea is to create multiple diversified replicas of a vulnerable program and then execute these replicas in lockstep on identical inputs while simultaneously monitoring their behavior. A malicious input that causes the diversified replicas to diverge in their behavior will be detected by the monitor; this allows discovery of previously unknown attacks such as zero-day exploits.

So far, such multi-variant execution environments (MVEEs) have been held back by substantial runtime overheads. This paper presents a new design, ReMon, that is non-intrusive, secure, and highly efficient. Whereas previous schemes either monitor every system call or none at all, our system enforces cross-checking only for security critical system calls while supporting more relaxed monitoring policies for system calls that are not security critical. We achieve this by splitting the monitoring and replication logic into an in-process component and a cross-process component. Our evaluation shows that Re- Mon offers same level of security as conservative MVEEs and run realistic server benchmarks at near-native speeds.

Available Media

Blockstack: A Global Naming and Storage System Secured by Blockchains

Muneeb Ali and Jude Nelson, Princeton University and Blockstack Labs; Ryan Shea, Blockstack Labs; Michael J. Freedman, Princeton University

Blockchains like Bitcoin and Namecoin and their respective P2P networks have seen significant adoption in the past few years and show promise as naming systems with no trusted parties. Users can register human meaningful names and securely associate data with them, and only the owner of the particular private keys that registered them can write or update the name-value pair. In theory, many decentralized systems can be built using these blockchain networks, such as new, decentralized versions of DNS and PKI. As the technology is relatively new and evolving rapidly, however, little production data or experience is available to guide design tradeoffs.

In this paper, we describe our experiences operating a large deployment of a decentralized PKI service built on top of the Namecoin blockchain. We present various challenges pertaining to network reliability, throughput, and security that we needed to overcome while registering and updating over 33,000 entries and 200,000 transactions on the Namecoin blockchain. Further, we discuss how our experience informed the design of a new blockchain-based naming and storage system called Blockstack. We detail why we switched from the Namecoin network to the Bitcoin network for the new system, and present operational lessons from this migration. Blockstack is released as open source software and currently powers a production PKI system for 55,000 users.

Available Media

Satellite: Joint Analysis of CDNs and Network-Level Interference

Will Scott, Thomas Anderson, Tadayoshi Kohno, and Arvind Krishnamurthy, University of Washington

Awarded Best Student Paper

Satellite is a methodology, tool chain, and data-set for understanding global trends in website deployment and accessibility using only a single or small number of standard measurement nodes. Satellite collects information on DNS resolution and resource availability around the Internet by probing the IPv4 address space. These measurements are valuable in their breadth and sustainability - they do not require the use of a distributed measurement infrastructure, and therefore can be run at low cost and by multiple organizations. We demonstrate a clustering procedure which accurately captures the IP footprints of CDN deployments, and then show how this technique allows for more accurate determination of correct and incorrect IP resolutions. Satellite has multiple applications. It reveals the prevalence of CDNs by showing that 20% of the top 10,000 Alexa domains are hosted on shared infrastructure, and that CloudFlare alone accounts for nearly 10% of these sites. The same data-set detects 4,819 instances of ISP level DNS hijacking in 117 countries.

Available Media

Subversive-C: Abusing and Protecting Dynamic Message Dispatch

Julian Lettner, University of California, Irvine; Benjamin Kollenda, Ruhr-Universität Bochum; Andrei Homescu, Immunant, Inc.; Per Larsen, University of California, Irvine, and and Immunant, Inc.; Felix Schuster, Microsoft Research; Lucas Davi and Ahmad-Reza Sadeghi, Technische Universität Darmstadt; Thorsten Holz, Ruhr-Universität Bochum; Michael Franz, University of California, Irvine

The lower layers in the modern computing infrastructure are written in languages threatened by exploitation of memory management errors. Recently deployed exploit mitigations such as control-flow integrity (CFI) can prevent traditional return-oriented programming (ROP) exploits but are much less effective against newer techniques such as Counterfeit Object-Oriented Programming (COOP) that execute a chain of C++ virtual methods. Since these methods are valid control-flow targets, COOP attacks are hard to distinguish from benign computations. Code randomization is likewise ineffective against COOP. Until now, however, COOP attacks have been limited to vulnerable C++ applications which makes it unclear whether COOP is as general and portable a threat as ROP.

This paper demonstrates the first COOP-style exploit for Objective-C, the predominant programming language on Apple’s OS X and iOS platforms. We also retrofit the Objective-C runtime with the first practical and efficient defense against our novel attack. Our defense is able to protect complex, real-world software such as iTunes without recompilation. Our performance experiments show that the overhead of our defense is low in practice.

Available Media

Industry Talk

Colorado Ballroom G–J

Session Chair: Geoff Kuenning, Harvey Mudd College

The Next Generation of Apache Hadoop: Open Problems in Distributed Storage and Resource Management

Karthik Kambatla and Andrew Wang, Cloudera, Inc.

Available Media

Read more about The Next Generation of Apache Hadoop: Open Problems in Distributed Storage and Resource Management

Best of the Rest I

Colorado Ballroom G–J

Session Chair: Geoff Kuenning, Harvey Mudd College

Optimizing Every Operation in a Write-optimized File System

Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, and Pooja Deo, Stony Brook University; Zardosht Kasheff, Facebook; Leif Walsh, Two Sigma; Michael A. Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology; Donald E. Porter, Stony Brook University

Best Paper at FAST '16: Link to Paper

File systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not obtained all of these performance gains without sacrificing performance on other operations, such as file deletion, file or directory renaming, or sequential writes.

Using three techniques, late-binding journaling, zoning, and range deletion, we show that there is no fundamental trade-off in write-optimization. These dramatic improvements can be retained while matching conventional file systems on all other operations.

BetrFS 0.2 delivers order-of-magnitude better performance than conventional file systems on directory scans and small random writes and matches the performance of conventional file systems on rename, delete, and sequential I/O. For example, BetrFS 0.2 performs directory scans 2.2x faster, and small random writes over two orders of magnitude faster, than the fastest conventional file system. But unlike BetrFS 0.1, it renames and deletes files commensurate with conventional file systems and performs large sequential I/O at nearly disk bandwidth. The performance benefits of these techniques extend to applications as well. BetrFS 0.2 continues to outperform conventional file systems on many applications, such as as rsync, git-diff, and tar, but improves git-clone performance by 35% over BetrFS 0.1, yielding performance comparable to other file systems.

Available Media

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca, Brown University

Presented at SOSP '15: Link to Paper

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today – logs, counters, and metrics – have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

Available Media

Environmental Conditions and Disk Reliability in Free-cooled Datacenters

Ioannis Manousakis, Rutgers University; Sriram Sankar, GoDaddy; Gregg McKnight, Microsoft; Thu D. Nguyen, Rutgers University; Ricardo Bianchini, Microsoft

Best Paper at FAST '16: Link to Paper

Free cooling lowers datacenter costs significantly, but may also expose servers to higher and more variable temperatures and relative humidities. It is currently unclear whether these environmental conditions have a significant impact on hardware component reliability. Thus, in this paper, we use data from nine hyperscale datacenters to study the impact of environmental conditions on the reliability of server hardware, with a particular focus on disk drives and free cooling. Based on this study, we derive and validate a new model of disk lifetime as a function of environmental conditions. Furthermore, we quantify the tradeoffs between energy consumption, environmental conditions, component reliability, and datacenter costs. Finally, based on our analyses and model, we derive server and datacenter design lessons.

We draw many interesting observations, including (1) relative humidity seems to have a dominant impact on component failures; (2) disk failures increase significantly when operating at high relative humidity, due to controller/adaptor malfunction; and (3) though higher relative humidity increases component failures, software availability techniques can mask them and enable free-cooled operation, resulting in significantly lower infrastructure and energy costs that far outweigh the cost of the extra component failures.

Available Media

6:30 pm–8:00 pm

Wednesday

USENIX ATC '16 Conference Reception

Note: Due to inclement weather, the ATC '16 Reception will be moved indoors to the Penrose Room, which is one level up from the sessions, across from Prospect's Urban Kitchen and Bar.

Mingle with fellow attendees at the Conference Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.

Thursday, June 23, 2016

8:00 am–9:00 am	Thursday
Continental Breakfast Ballroom Foyer
9:00 am–10:00 am	Thursday
Keynote Address Colorado Ballroom F–J Session Chair: Hakim Weatherspoon, Cornell University A Wardrobe for the Emperor: Stitching Practical Bias into Systems Software Research Bryan Cantrill, CTO, Joyent Bryan Cantrill is the CTO at Joyent, where he oversees worldwide development of the SmartOS and SmartDataCenter platforms, and the Node.js platform. Prior to joining Joyent, Bryan served as a Distinguished Engineer at Sun Microsystems, where he spent over a decade working on system software, from the guts of the kernel to client-code on the browser. In particular, he co-designed and implemented DTrace, a facility for dynamic instrumentation of production systems that won the Wall Street Journal's top Technology Innovation Award in 2006 and the USENIX Software Tools User Group Award in 2008. Bryan also co-founded the Fishworks group at Sun, where he designed and implemented the DTrace-based analytics facility for the Sun Storage 7000 series of appliances. Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University. Bryan Cantrill is the CTO at Joyent, where he oversees worldwide development of the SmartOS and SmartDataCenter platforms, and the Node.js platform. Prior to joining Joyent, Bryan served as a Distinguished Engineer at Sun Microsystems, where he spent over a decade working on system software, from the guts of the kernel to client-code on the browser. In particular, he co-designed and implemented DTrace, a facility for dynamic instrumentation of production systems that won the Wall Street Journal's top Technology Innovation Award in 2006 and the USENIX Software Tools User Group Award in 2008. Bryan also co-founded the Fishworks group at Sun, where he designed and implemented the DTrace-based analytics facility for the Sun Storage 7000 series of appliances. Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University. Bryan Cantrill is the CTO at Joyent, where he oversees worldwide development of the SmartOS and SmartDataCenter platforms, and the Node.js platform. Prior to joining Joyent, Bryan served as a Distinguished Engineer at Sun Microsystems, where he spent over a decade working on system software, from the guts of the kernel to client-code on the browser. In particular, he co-designed and implemented DTrace, a facility for dynamic instrumentation of production systems that won the Wall Street Journal's top Technology Innovation Award in 2006 and the USENIX Software Tools User Group Award in 2008. Bryan also co-founded the Fishworks group at Sun, where he designed and implemented the DTrace-based analytics facility for the Sun Storage 7000 series of appliances. Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University. Available Media Read more about A Wardrobe for the Emperor: Stitching Practical Bias into Systems Software Research

10:00 am–10:30 am

Thursday

Break with Refreshments

Ballroom Foyer

10:30 am–12:10 pm

Thursday

Cloud, Coordination, and Consensus

Colorado Ballroom F

Session Chair: Dilma Da Silva, Texas A&M University

Callinicos: Robust Transactional Storage for Distributed Data Structures

Ricardo Padilha, Enrique Fynn, Robert Soulé, and Fernando Pedone, Universitá della Svizzera Italiana (USI)

This paper presents Callinicos, a robust storage system with a novel transaction protocol that generalizes minitransactions. This protocol allows Callinicos to cope with Byzantine failures, support cross-partition communication with transactions, and implement on-demand contention management. We have evaluated Callinicos with a set of micro-benchmarks, and two realistic applications: a Twitter-like social network and a distributed message queue. Our experiments show that: (i) cross-partition communication improves performance by reducing the number of aborts, and (ii) the conflict resolution protocol results in no aborts in the presence of contention and no overhead in the absence of contention.

Available Media

Filo: Consolidated Consensus as a Cloud Service

Parisa Jalili Marandi, Christos Gkantsidis, Flavio Junqueira, and Dushyanth Narayanan, Microsoft Research

Consensus is at the core of many production-grade distributed systems. Given the prevalence of these systems, it is important to offer consensus as a cloud service. To match the multi-tenant requirements of the cloud, consensus as a service must provide performance guarantees, and prevent aggressive tenants from disrupting the others. Fulfilling this goal is not trivial without overprovisioning and under-utilizing resources.

We present Filo, the first system to provide consensus as a multi-tenant cloud service with throughput guarantees and efficient utilization of cloud resources. Tenants request an SLA by specifying their target throughput and degree of fault-tolerance. Filo then efficiently consolidates tenants on a shared set of servers using a novel placement algorithm that respects constraints imposed by the consensus problem. To respond to the load variations at runtime, Filo proposes a novel distributed controller that piggybacks on the consensus protocol to coordinate resource allocations across the servers and distribute the unused capacity fairly. Using a real testbed and simulations, we show that our placement algorithm is efficient at consolidating tenants, and while obtaining comparable efficiency and fairness, our distributed controller is ~5x faster than the centralized baseline approach.

Available Media

Modular Composition of Coordination Services

Kfir Lev-Ari, Technion—Israel Institute of Technology; Edward Bortnikov, Yahoo Research; Idit Keidar, Technion—Israel Institute of Technology and Yahoo Research; Alexander Shraer, Google

Coordination services like ZooKeeper, etcd, Doozer, and Consul are increasingly used by distributed applications for consistent, reliable, and high-speed coordination. When applications execute in multiple geographic regions, coordination service deployments trade-off between performance, (achieved by using independent services in separate regions), and consistency.

We present a system design for modular composition of services that addresses this trade-off. We implement ZooNet, a prototype of this concept over ZooKeeper. ZooNet allows users to compose multiple instances of the service in a consistent fashion, facilitating applications that execute in multiple regions. In ZooNet, clients that access only local data suffer no performance penalty compared to working with a standard single ZooKeeper. Clients that use remote and local ZooKeepers show up to 7x performance improvement compared to consistent solutions available today.

Available Media

Cheap and Available State Machine Replication

Rong Shi and Yang Wang, The Ohio State University

This paper presents that, by combining on-demand instantiation and lazy recovery, we can reduce the cost of asynchronous state machine replication protocols, such as Paxos and UpRight, while maintaining their high availability. To reduce cost, we incorporate on-demand instantiation, which activates a subset of replicas first and activates backup ones when active ones fail. To solve its key limitation—the system can be halted for long when activating a backup replica, we apply lazy recovery, allowing the system to proceed while recovering backup nodes in the background. The key contribution of this paper is to identify that, when agreement nodes and execution nodes are logically separated, they each presents a unique property that enables lazy recovery. We have applied this idea to Paxos and built ThriftyPaxos, which, as shown in the evaluation, can achieve higher throughput and similar availability comparing to standard Paxos, despite the fact that ThriftyPaxos activates fewer replicas.

Available Media

Architectural Interaction

Colorado Ballroom G–J

Session Chair: Nisha Talagala, Parallel Machines

Horton Tables: Fast Hash Tables for In-Memory Data-Intensive Computing

Alex D. Breslow, AMD Research and University of California, San Diego; Dong Ping Zhang, Joseph L. Greathouse, and Nuwan Jayasena, AMD Research; Dean M. Tullsen, University of California, San Diego

Hash tables are important data structures that lie at the heart of important applications such as key-value stores and relational databases. Typically bucketized cuckoo hash tables (BCHTs) are used because they provide highthroughput lookups and load factors that exceed 95%. Unfortunately, this performance comes at the cost of reduced memory access efficiency. Positive lookups (key is in the table) and negative lookups (where it is not) on average access 1.5 and 2.0 buckets, respectively, which results in 50 to 100% more table-containing cache lines to be accessed than should be minimally necessary.

To reduce these surplus accesses, this paper presents the Horton table, a revamped BCHT that reduces the expected cost of positive and negative lookups to fewer than 1.18 and 1.06 buckets, respectively, while still achieving load factors of 95%. The key innovation is remap entries, small in-bucket records that allow (1) more elements to be hashed using a single, primary hash function, (2) items that overflow buckets to be tracked and rehashed with one of many alternate functions while maintaining a worst-case lookup cost of 2 buckets, and (3) shortening the vast majority of negative searches to 1 bucket access. With these advancements, Horton tables outperform BCHTs by 17% to 89%.

Available Media

Ginseng: Market-Driven LLC Allocation

Liran Funaro, Orna Agmon Ben-Yehuda, and Assaf Schuster, Technion—Israel Institute of Technology

Cloud providers must dynamically allocate their physical resources to the right client to maximize the benefit that they can get out of given hardware. Cache Allocation Technology (CAT) makes it possible for the provider to allocate last level cache to virtual machines to prevent cache pollution. The provider can also allocate the cache to optimize client benefit. But how should it optimize client benefit, when it does not even know what the client plans to do?

We present an auction-based mechanism that dynamically allocates cache while optimizing client benefit and improving hardware utilization. We evaluate our mechanism on benchmarks from the Phoronix Test Suite. Experimental results show that Ginseng for cache allocation improved clients’ aggregated benefit by up to 42:8x compared with state-of-the-art static and dynamic algorithms.

Available Media

Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading

Xi Yang and Stephen M. Blackburn, Australian National University; Kathryn S. McKinley, Microsoft Research

Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); turn off simultaneous multithreading (SMT); and execute only a single service—wasting hardware, energy, and money. Although co-running batch jobs with latency critical requests to utilize multiple SMT hardware contexts (lanes) is appealing, unmitigated sharing of core resources induces non-linear effects on tail latency and SLO violations.

We introduce principled borrowing to control SMT hardware execution in which batch threads borrow core resources. A batch thread executes in a reserved batch SMT lane when no latency-critical thread is executing in the partner request lane. We instrument batch threads to quickly detect execution in the request lane, step out of the way, and promptly return the borrowed resources. We introduce the nanonap system call to stop the batch thread’s execution without yielding its lane to the OS scheduler, ensuring that requests have exclusive use of the core’s resources. We evaluate our approach for colocating batch workloads with latency-critical requests using the Apache Lucene search engine. A conservative policy that executes batch threads only when request lane is idle improves utilization between 90% and 25% on one core depending on load, without compromising request SLOs. Our approach is straightforward, robust, and unobtrusive, opening the way to substantially improved resource utilization in datacenters running latency-critical workloads.

Available Media

Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing

Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen, University of Rochester

The efficiency of modern multiprogrammed multicore machines is heavily impacted by traffic due to data sharing and contention due to competition for shared resources. In this paper, we demonstrate the importance of identifying latency tolerance coupled with instructionlevel parallelism on the benefits of colocating threads on the same socket or physical core for parallel efficiency. By adding hardware counted CPU stall cycles due to cache misses to the measured statistics, we show that it is possible to infer latency tolerance at low cost. We develop and evaluate SAM-MPH, a multicore CPU scheduler that combines information on sources of traffic with tolerance for latency and need for computational resources. We also show the benefits of using a history of past intervals to introduce hysteresis when making mapping decisions, thereby avoiding oscillatory mappings and transient migrations that would impact performance. Experiments with a broad range of multiprogrammed parallel, graph processing, and data management workloads on 40-CPU and 80-CPU machines show that SAMMPH obtains ideal performance for standalone applications and improves performance by up to 61% over the default Linux scheduler for mixed workloads.

Available Media

12:10 pm–1:40 pm

Thursday

Conference Luncheon

Colorado Ballroom E

1:40 pm–3:20 pm

Thursday

Caching and Indexing

Colorado Ballroom F

Session Chair: Jon Howell, Google

Replex: A Scalable, Highly Available Multi-Index Data Store

Amy Tai, VMware Research and Princeton University; Michael Wei, VMware Research and University of California, San Diego; Michael J. Freedman, Princeton University; Ittai Abraham and Dahlia Malkhi, VMware Research

Awarded Best Paper

The need for scalable, high-performance datastores has led to the development of NoSQL databases, which achieve scalability by partitioning data over a single key. However, programmers often need to query data with other keys, which data stores provide by either querying every partition, eliminating the benefits of partitioning, or replicating additional indexes, wasting the benefits of data replication.

In this paper, we show there is no need to compromise scalability for functionality. We present Replex, a datastore that enables efficient querying on multiple keys by rethinking data placement during replication. Traditionally, a data store is first globally partitioned, then each partition is replicated identically to multiple nodes. Instead, Replex relies on a novel replication unit, termed replex, which partitions a full copy of the data based on its unique key. Replexes eliminate any additional overhead to maintaining indices, at the cost of increasing recovery complexity. To address this issue, we also introduce hybrid replexes, which enable a rich design space for trading off steady-state performance with faster recovery. We build, parameterize, and evaluate Replex on multiple dimensions and find that Replex surpasses the steady-state and failure recovery performance of Hyper- Dex, a state-of-the-art multi-key data store.

Available Media

Kinetic Modeling of Data Eviction in Cache

Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Peking University; Chen Ding, University of Rochester; Zhenlin Wang, Michigan Technological University

The reuse distance (LRU stack distance) is an essential metric for performance prediction and optimization of storage and CPU cache. Over the last four decades, there have been steady improvements in the algorithmic efficiency of reuse distance measurement. This progress is accelerating in recent years both in theory and practical implementation.

In this paper, we present a kinetic model of LRU cache memory, based on the average eviction time (AET) of the cached data. The AET model enables fast measurement and low-cost sampling. It can produce the miss ratio curve (MRC) in linear time with extremely low space costs. On both CPU and storage benchmarks, AET reduces the time and space costs compare to former techniques. Furthermore, AET is a composable model that can characterize shared cache behavior through modeling individual programs.

Available Media

Scalable In-Memory Transaction Processing with HTM

Yingjun Wu and Kian-Lee Tan, National University of Singapore

We propose a new HTM-assisted concurrency control protocol, called HTCC, that achieves high scalability and robustness when processing OLTP workloads. HTCC attains its goal using a two-pronged strategy that exploits the strengths of HTM. First, it distinguishes between hot and cold records, and deals with each type differently – while accesses to highly contended data are protected using conventional fine-grained locks, accesses to cold data are HTM-guarded. This remarkably reduces the database transaction abort rate and exploits HTM’s effectiveness in executing low-contention critical sections. Second, to minimize the overhead inherited from successive restarts of aborted database transactions, HTCC caches the internal execution states of a transaction for performing delta-restoration, which partially updates the maintained read/write set and bypasses redundant index lookups during transaction re-execution at best effort. This approach is greatly facilitated by HTM’s speedy hardware mechanism for ensuring atomicity and isolation. We evaluated HTCC in a main-memory database prototype running on a 4 socket machine (40 cores in total), and confirmed that HTCC can scale near-linearly, yielding high transaction rate even under highly contended workloads.

Available Media

Erasing Belady’s Limitations: In Search of Flash Cache Offline Optimality

Yue Cheng, Virginia Polytechnic Institute and State University; Fred Douglis, Philip Shilane, Michael Trachtman, and Grant Wallace, EMC Corporation; Peter Desnoyers, Northeastern University; Kai Li, Princeton University

NAND-based solid-state (flash) drives are known for providing better performance than magnetic disk drives, but they have limits on endurance, the number of times data can be erased and overwritten. Furthermore, the unit of erasure can be many times larger than the basic unit of I/O; this leads to complexity with respect to consolidating live data and erasing obsolete data. When flash drives are used as a cache for a larger, disk-based storage system, the choice of a cache replacement algorithm can make a significant difference in both performance and endurance. While there are many cache replacement algorithms, their effectiveness is hard to judge due to the lack of a baseline against which to compare them: Belady’s MIN, the usual offline best-case algorithm, considers read hit ratio but not endurance.

We explore offline algorithms for flash caching in terms of both hit ratio and flash lifespan. We design and implement a multi-stage heuristic by synthesizing several techniques that manage data at the granularity of a flash erasure unit (which we call a container) to approximate the offline optimal algorithm. We find that simple techniques contribute most of the available erasure savings. Our evaluation shows that the container-optimized offline heuristic is able to provide the same optimal read hit ratio as MIN with 67% fewer flash erasures. More fundamentally, our investigation provides a useful approximate baseline for evaluating any online algorithm, highlighting the importance of comparing new policies for caching compound blocks in flash.

Available Media

Energy vs. Performance

Colorado Ballroom G–J

Session Chair: Benjamin Reed, Facebook

Unlocking Energy

Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trigonakis, École Polytechnique Fédérale de Lausanne (EPFL)

Locks are a natural place for improving the energy efficiency of software systems. First, concurrent systems are mainstream and when their threads synchronize, they typically do it with locks. Second, locks are well-defined abstractions, hence changing the algorithm implementing them can be achieved without modifying the system. Third, some locking strategies consume more power than others, thus the strategy choice can have a real effect. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with improving their throughput. It is a win-win situation.

We make our case for this throughput/energyefficiency correlation through a series of observations obtained from an exhaustive analysis of the energy efficiency of locks on two modern processors and six software systems: Memcached, MySQL, SQLite, RocksDB, HamsterDB, and Kyoto Kabinet. We propose simple lock-based techniques for improving the energy efficiency of these systems by 33% on average, driven by higher throughput, and without modifying the systems.

Available Media

Greening the Video Transcoding Service with Low-Cost Hardware Transcoders

Peng Liu, University of Wisconsin—Madison; Jongwon Yoon, Hanyang University; Lance Johnson, University of Minnesota; Suman Banerjee, University of Wisconsin—Madison

Video transcoding plays a critical role in a video streaming service. Content owners and publishers need video transcoders to adapt their videos to different formats, bitrates, and qualities before streaming them to end users with the best quality of service. In this paper, we report our experience to develop and deploy VideoCoreCluster, a low-cost, highly efficient video transcoder cluster for live video streaming services. We implemented the video transcoder cluster with low-cost single board computers, specifically the Raspberry Pi Model B. The quality of the transcoded video delivered by our cluster is comparable with the best open source softwarebased video transcoder, and our video transcoders consume much less energy. We designed a scheduling algorithm based on priority and capacity so that the cluster manager can leverage the characteristics of adaptive bitrate video streaming technologies to provide a reliable and scalable service for the video streaming infrastructure. We have replaced the software-based transcoders for some TV channels in a live TV streaming service deployment on our university campus with this cluster.

Available Media

MEANTIME: Achieving Both Minimal Energy and Timeliness with Approximate Computing

Anne Farrell and Henry Hoffmann, University of Chicago

Energy efficiency and timeliness (i.e., predictable job latency) are two essential – yet opposing – concerns for embedded systems. Hard timing guarantees require conservative resource allocation while energy minimization requires aggressively releasing resources and occasionally violating timing constraints. Recent work on approximate computing, however, opens up a new dimension of optimization: application accuracy. In this paper, we use approximate computing to achieve both hard timing guarantees and energy efficiency. Specifically, we propose MEANTIME: a runtime system that delivers hard latency guarantees and energy-minimal resource usage through small accuracy reductions. We test MEANTIME on a real Linux/ARM system with six applications. Overall, we find that MEANTIME never violates real-time deadlines and sacrifices a small amount (typically less than 2%) of accuracy while reducing energy to 54% of a conservative, full accuracy approach.

Available Media

3:20 pm–3:50 pm

Thursday

Break with Refreshments

Ballroom Foyer

3:50 pm–5:30 pm

Thursday

Network Design and Usage Studies

Colorado Ballroom F

Session Chair: Fred Douglis, EMC

Design Guidelines for High Performance RDMA Systems

Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G. Andersen, Carnegie Mellon University

Awarded Best Student Paper

Modern RDMA hardware offers the potential for exceptional performance, but design choices including which RDMA operations to use and how to use them significantly affect observed performance. This paper lays out guidelines that can be used by system designers to navigate the RDMA design space. Our guidelines emphasize paying attention to low-level details such as individual PCIe transactions and NIC architecture. We empirically demonstrate how these guidelines can be used to improve the performance of RDMA-based systems: we design a networked sequencer that outperforms an existing design by 50x, and improve the CPU effciency of a prior highperformance key-value store by 83%. We also present and evaluate several new RDMA optimizations and pitfalls, and discuss how they affect the design of RDMA systems.

Available Media

Balancing CPU and Network in the Cell Distributed B-Tree Store

Christopher Mitchell, Kate Montgomery, and Lamont Nelson, New York University; Siddhartha Sen, Microsoft Research; Jinyang Li, New York University

In traditional client-server designs, all requests are processed at the server storing the state, thereby maintaining strict locality between computation and state. The adoption of RDMA (Remote Direct Memory Access) makes it practical to relax locality by letting clients fetch server state and process requests themselves. Such client-side processing improves performance when the server CPU, instead of the network, is the bottleneck.We observe that combining server-side and client-side processing allows systems to balance and adapt to the available CPU and network resources with minimal configuration, and can free resources for other CPU-intensive work.

We present Cell, a distributed B-tree store that combines client-side and server-side processing. Cell distributes a global B-tree of “fat” (64MB) nodes across machines for server-side searches. Within each fat node, Cell organizes keys as a local B-tree of RDMA-friendly small nodes for client-side searches. Cell clients dynamically select whether to use client-side or server-side processing in response to available resources and the current workload. Our evaluation on a large RDMA-capable cluster show that Cell scales well and that its dynamic selector effectively responds to resource availability and workload properties.

Available Media

An Evolutionary Study of Linux Memory Management for Fun and Profit

Jian Huang, Moinuddin K. Qureshi, and Karsten Schwan, Georgia Institute of Technology

We present a comprehensive and quantitative study on the development of the Linux memory manager. The study examines 4587 committed patches over the last five years (2009-2015) since Linux version 2.6.32. Insights derived from this study concern the development process of the virtual memory system, including its patch distribution and patterns, and techniques for memory optimizations and semantics. Specifically, we find that the changes to memory manager are highly centralized around the key functionalities, such as memory allocator, page fault handler and memory resource controller. The well-developed memory manager still suffers from increasing number of bugs unexpectedly. And the memory optimizations mainly focus on data structures, memory policies and fast path. To the best of our knowledge, this is the first such study on the virtual memory system.

Available Media

Getting Back Up: Understanding How Enterprise Data Backups Fail

George Amvrosiadis, University of Toronto; Medha Bhadkamkar, Veritas Labs

In the enterprise world, retaining data backups is the de-facto solution against data loss in the event of catastrophic failures. As backup software evolves to achieve faster backup and recovery times, however, backup systems deploying it become increasingly complex to administer. This complexity stems from optimizations targeted to specific applications, which increase the number of configuration parameters for the system. Still, there is no work in the literature that attempts to study the error characteristics of enterprise backup systems, despite our reliance on the guarantees they provide.

With this study we aim to help researchers and practitioners understand how backup system jobs fail, and identify factors that can be used to predict these failures. Our results are derived from an analysis of data on 775 million jobs, collected from more than 20,000 backup software installations over a span of 3 years. We confirm that trends reported in the software reliability literature also hold for backup systems, such as that the majority of job errors are due to misconfigurations. For the systems in our dataset, we find that error rates remain stable across software versions and over time. To better understand these errors, we investigate the effect of several factors on the system’s error rate, such as job sizes and policy complexity, and demonstrate their predictive power for future errors.

Available Media

Best of the Rest II

Colorado Ballroom G–J

Session Chair: Peter Pietzuch, Imperial College London

Using Crash Hoare Logic for Certifying the FSCQ File System

Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich, MIT CSAIL

Presented at SOSP '15: Link to Paper

FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data.

To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

Available Media

COZ: Finding Code that Counts with Causal Profiling

Charlie Curtsinger, Grinnell College; Emery D. Berger, University of Massachusetts Amherst

Presented at SOSP '15: Link to Paper

Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities.

This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus “virtually” speeding it up.

We present COZ, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. COZ identifies previously unknown optimization opportunities that are both significant and targeted. Guided by COZ, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.

Available Media

All Your Biases Belong to Us: Breaking RC4 in WPA-TKIP and TLS

Mathy Vanhoef and Frank Piessens, Katholieke Universiteit Leuven

Best Student Paper at USENIX Security '15: Link to Paper

We present new biases in RC4, break the Wi-Fi Protected Access Temporal Key Integrity Protocol (WPA-TKIP), and design a practical plaintext recovery attack against the Transport Layer Security (TLS) protocol. To empirically find new biases in the RC4 keystream we use statistical hypothesis tests. This reveals many new biases in the initial keystream bytes, as well as several new longterm biases. Our fixed-plaintext recovery algorithms are capable of using multiple types of biases, and return a list of plaintext candidates in decreasing likelihood. To break WPA-TKIP we introduce a method to generate a large number of identical packets. This packet is decrypted by generating its plaintext candidate list, and using redundant packet structure to prune bad candidates. From the decrypted packet we derive the TKIP MIC key, which can be used to inject and decrypt packets. In practice the attack can be executed within an hour. We also attack TLS as used by HTTPS, where we show how to decrypt a secure cookie with a success rate of 94% using 9•2²⁷ ciphertexts. This is done by injecting known data around the cookie, abusing this using Mantin’s ABSAB bias, and brute-forcing the cookie by traversing the plaintext candidates. Using our traffic generation technique, we are able to execute the attack in merely 75 hours.

Available Media

Under-Constrained Symbolic Execution: Correctness Checking for Real Code

David A. Ramos and Dawson Engler, Stanford University

Best Paper at USENIX Security '15: Link to Paper

Software bugs are a well-known source of security vulnerabilities. One technique for finding bugs, symbolic execution, considers all possible inputs to a program but suffers from scalability limitations. This paper uses a variant, under-constrained symbolic execution, that improves scalability by directly checking individual functions, rather than whole programs. We present UC-KLEE, a novel, scalable framework for checking C/C++ systems code, along with two use cases. First, we use UC-KLEE to check whether patches introduce crashes. We check over 800 patches from BIND and OpenSSL and find 12 bugs, including two OpenSSL denial-of-service vulnerabilities. We also verify (with caveats) that 115 patches do not introduce crashes. Second, we use UC-KLEE as a generalized checking framework and implement checkers to find memory leaks, uninitialized data, and unsafe user input. We evaluate the checkers on over 20,000 functions from BIND, OpenSSL, and the Linux kernel, find 67 bugs, and verify that hundreds of functions are leak free and that thousands of functions do not access uninitialized data.

Available Media

6:30 pm–8:00 pm

Thursday

Poster Session and Happy Hour

Colorado Ballroom A–E

Check out the cool new ideas and the latest preliminary work on display at the Poster Session and Happy Hour. Take advantage of an opportunity to mingle with colleagues who may be interested in the same area while enjoying complimentary food and drinks. The list of accepted posters is now available.

Friday, June 24, 2016

8:00 am–9:00 am

Friday

Continental Breakfast

Ballroom Foyer

9:00 am–10:40 am

Friday

Data Is Now Big Data

Colorado Ballroom F

Session Chair: Peter Pietzuch, Imperial College London

SplitJoin: A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision

Mohammadreza Najafi, Technische Universität München; Mohammad Sadoghi, IBM T. J. Watson Research Center; Hans-Arno Jacobsen, Middleware Systems Research Group

Paper presented by Kaiwen Zhang, Technische Universität München

There is a rising interest in accelerating stream processing through modern parallel hardware, yet it remains a challenge as how to exploit the available resources to achieve higher throughput without sacrificing latency due to the increased length of processing pipeline and communication path and the need for central coordination. To achieve these objectives, we introduce a novel top-down data flow model for stream join processing (arguably, one of the most resource-intensive operators in streamprocessing), called SplitJoin, that operates by splitting the join operation into independent storing and processing steps that gracefully scale with respect to the number of cores. Furthermore, SplitJoin eliminates the need for global coordination while preserving the order of input streams by re-thinking how streams are channeled into distributed join computation cores and maintaining the order of output streams by proposing a novel distributed punctuation technique. Throughout our experimental analysis, SplitJoin offered up to 60%improvement in throughputwhile reducing latency by up to 3.3X compared to state-of-the-art solutions.

Available Media

Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing

Keval Vora, University of California, Riverside; Guoqing Xu, University of California, Irvine; Rajiv Gupta, University of California, Riverside

Single-PC, disk-based processing of big graphs has recently gained much popularity. At the core of an efficient disk-based system is a well-designed partition structure that can minimize random disk accesses. All existing systems use static partitions that are created before processing starts. These partitions have static layouts and are loaded entirely into memory in every single iteration even though much of the edge data is not changed across many iterations, causing these unchanged edges to have zero new impact on the computation of vertex values.

This work provides a general optimization that removes this I/O inefficiency by employing dynamic partitions whose layouts are dynamically adjustable. Our implementation of this optimization in GraphChi — a representative out-of-core vertex-centric graph system — yielded speedups of 1.5—2.8× on six large graphs. Our idea is generally applicable to other systems as well.

Available Media

Version Traveler: Fast and Memory-Efficient Version Switching in Graph Processing Systems

Xiaoen Ju, University of Michigan; Dan Williams and Hani Jamjoom, IBM T. J. Watson Research Center; Kang G. Shin, University of Michigan

Multi-version graph processing, where each version corresponds to a snapshot of an evolving graph, is a common scenario in large-scale graph processing. Straightforward application of existing graph processing systems often yields suboptimal performance due to high version-switching cost. We present Version Traveler (VT), a graph processing system featuring fast and memory- efficient version switching. VT achieves fast version switching by (i) representing differences among versions as deltas and (ii) constructing the next version by integrating the in-memory graph representation of the current version with the delta(s) relating the two versions. Furthermore, VT maintains high computation performance and memory compactness. Our evaluation using multi-version processing workloads with realistic datasets shows that VT outperforms PowerGraph— running 23x faster with a 15% memory overhead. VT is also superior to four multi-version processing systems, achieving up to 90% improvement when jointly considering processing time and resource consumption.

Available Media

Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store

Anastasios Papagiannis, Foundation of Research and Technology-Hellas (FORTH) and University of Crete; Giorgos Saloustros, Foundation of Research and Technology-Hellas (FORTH); Pilar González-Férez, Foundation of Research and Technology-Hellas (FORTH) and University of Murcia; Angelos Bilas, Foundation of Research and Technology-Hellas (FORTH) and University of Crete

Given current technology trends towards fast storage devices and the need for increasing data processing density, it is important to examine key-value store designs that reduce CPU overhead. However, current key-value stores are still designed mostly for hard disk drives (HDDs) that exhibit a large difference between sequential and random access performance, and they incur high CPU overheads.

In this paper we present Tucana, a feature-rich keyvalue store that achieves low CPU overhead. Our design starts from a Be –tree approach to maintain asymptotic properties for inserts and uses three techniques to reduce overheads: copy-on-write, private allocation, and direct device management. In our design we favor choices that reduce overheads compared to sequential device accesses and large I/Os.

We evaluate our approach against RocksDB, a stateof- the-art key-value store, and show that our approach improves CPU efficiency by up to 9:2x and an average of 6x across all workloads we examine. In addition, Tucana improves throughput compared to RocksDB by up to 7x. Then, we use Tucana to replace the storage engine of HBase and compare it to native HBase and Cassandra two of the most popular NoSQL stores. Our results show that Tucana outperforms HBase by up to 8x in CPU efficiency and by up to 10x in throughput. Tucana’s improvements are even higher when compared to Cassandra.

Available Media

Virtualization

Colorado Ballroom G–J

Session Chair: Dilma Da Silva, Texas A&M University

Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions

Shiru Ren, Le Tan, Chunqi Li, and Zhen Xiao, Peking University; Weijia Song, Cornell University

Deterministic replay, which provides the ability to travel backward in time and reconstruct the past execution flow of a multiprocessor system, has many prominent applications. Prior research in this area can be classified into two categories: hardware-only schemes and software-only schemes. While hardware-only schemes deliver high performance, they require significant modifications to the existing hardware which makes them difficult to deploy in real systems. In contrast, software-only schemes work on commodity hardware, but suffer from excessive performance overhead and huge logs caused by tracing every single memory access in the software layer.

In this paper, we present the design and implementation of a novel system, Samsara, which uses the hardware-assisted virtualization (HAV) extensions to achieve efficient and practical deterministic replay without requiring any hardware modification. Unlike prior software schemes which trace every single memory access to record interleaving, Samsara leverages the HAV extensions on commodity processors to track the read-set and write-set for implementing a chunk-based recording scheme in software. By doing so, we avoid all memory access detections, which is a major source of overhead in prior works. We implement and evaluate our system in KVM on commodity Intel Haswell processor. Evaluation results show that compared with prior software-only schemes, Samsara significantly reduces the log file size to 1/70th on average, and further reduces the recording overhead from about 10x, reported by state-of-the-art works, to 2.3x on average.

Available Media

Hardware-Assisted On-Demand Hypervisor Activation for Efficient Security Critical Code Execution on Mobile Devices

Yeongpil Cho, Seoul National University; Junbum Shin, Samsung Electronics; Donghyun Kwon, Seoul National University; MyungJoo Ham and Yuna Kim, Samsung Electronics; Yunheung Paek, Seoul National University

As more and more mobile applications need to run security critical codes (SCCs) for secure transactions and critical information handling, the demand for a Trusted Execution Environment (TEE) to ensure safe execution of SCCs is rapidly escalating. Although a number of studies have implemented TEEs using TrustZone or hypervisors and have evinced the effectiveness in terms of security, they face major challenges when considering deployment in mobile devices. TrustZone-based approaches bloat the TCB of the system as they must increase the code base size of the most privileged software. Hypervisor-based approaches incur performance overhead on mobile devices that are already suffering from resource restrictions.

To alleviate these problems, in this paper, we propose a hybrid approach that utilizes both TrustZone and a hypervisor. Our approach basically implements a TEE using a hypervisor, while mitigating performance overhead by activating the hypervisor only when the TEE is demanded by SCCs. This scheme, called on-demand hypervisor activation, has been efficiently and securely implemented by leveraging the memory protection capability of TrustZone. We have implemented and experimented our system with real world applications. The results show that our system can successfully protect SCCs without any noticeable delay (< 100 μs), while limiting the overhead increase due to our hypervisor during its hibernation near 0 %.

Available Media

gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space

Mochi Xue, Shanghai Jiao Tong University and Intel Corporation; Kun Tian, Intel Corporation; Yaozu Dong, Shanghai Jiao Tong University and Intel Corporation; Jiacheng Ma, Jiajun Wang, and Zhengwei Qi, Shanghai Jiao Tong University; Bingsheng He, National University of Singapore; Haibing Guan, Shanghai Jiao Tong University

With increasing GPU-intensive workloads deployed on cloud, the cloud service providers are seeking for practical and efficient GPU virtualization solutions. However, the cutting-edge GPU virtualization techniques such as gVirt still suffer from the restriction of scalability, which constrains the number of guest virtual GPU instances.

This paper introduces gScale, a scalable GPU virtualization solution. By taking advantage of the GPU programming model, gScale presents a dynamic sharing mechanism which combines partition and sharing together to break the hardware limitation of global graphics memory space. Particularly, we propose three approaches for gScale: (1) the private shadow graphics translation table, which enables global graphics memory space sharing among virtual GPU instances, (2) ladder mapping and fence memory space pool, which allows the CPU to access host physical memory space (serving the graphics memory) bypassing global graphics memory space, (3) slot sharing, which improves the performance of vGPU under a high density of instances.

The evaluation shows that gScale scales up to 15 guest virtual GPU instances in Linux or 12 guest virtual GPU instances in Windows, which is 5x and 4x scalability, respectively, compared to gVirt. At the same time, gScale incurs a slight runtime overhead on the performance of gVirt when hosting multiple virtual GPU instances.

Available Media

A General Persistent Code Caching Framework for Dynamic Binary Translation (DBT)

Wenwen Wang, Pen-Chung Yew, Antonia Zhai, and Stephen McCamant University of Minnesota, Twin Cities

Dynamic binary translation (DBT) translates binary code from one instruction set architecture (ISA) to another (same or different) ISA at runtime, which makes it very useful in many applications such as system virtualization, whole program analysis, system debugging, and system security. Many techniques have been proposed to improve the efficiency of DBT systems for long-running and loop-intensive applications. However, for applications with short running time or long-running but with few hot code regions such as JavaScript and C# applications in web services, such techniques have difficulty in amortizing the overhead incurred during binary translation.

To reduce the translation overhead for such applications, this paper presents a general persistent code caching framework, which allows the reuse of translated binary code across different executions for the same or different applications. Compared to existing approaches, the proposed approach can seamlessly handle even dynamically generated code, which is very popular in script applications today. A prototype of the proposed framework has been implemented in an existing retargetable DBT system. Experimental results on a list of applications, including C/C++ and JavaScript, demonstrate that it can achieve 76.4% performance improvement on average compared to the original DBT system without helper threads for dynamic binary translation, and 9% performance improvement on average over the same DBT system with helper threads when code reuse is combined with help threads.

Available Media

10:40 am–11:00 am

Friday

Break with Refreshments

Ballroom Foyer

11:00 am–12:40 pm Friday

Operating Systems

Colorado Ballroom F

Session Chair: Geoff Kuenning, Harvey Mudd College

Instant OS Updates via Userspace Checkpoint-and-Restart

Sanidhya Kashyap, Changwoo Min, Byoungyoung Lee, and Taesoo Kim, Georgia Institute of Technology; Pavel Emelyanov, CRIU and Odin, Inc.

In recent years, operating systems have become increasingly complex and thus prone to security and performance issues. Accordingly, system updates to address these issues have become more frequently available and increasingly important. To complete such updates, users must reboot their systems, resulting in unavoidable downtime and further loss of the states of running applications.

We present , a practical OS update mechanism that uses a userspace checkpoint-and-restart mechanism, by using an optimized data structure for checkpointing and a memory persistence mechanism across update, combined with a fast in-place kernel switch. This allows for instant kernel updates spanning across major kernel versions without any kernel modifications.

Our evaluation shows that KUP can support any type of real kernel patches (e.g., security, minor or even major releases) with large-scale applications that include memcached, mysql, or in the middle of the Linux kernel compilation, unlike well-known dynamic hot-patching techniques (e.g., ksplice). Not only that, KUP can update a running Linux kernel in 3 seconds (overall downtime).

Available Media

Apps with Hardware: Enabling Run-time Architectural Customization in Smart Phones

Michael Coughlin, Ali Ismail, and Eric Keller, University of Colorado, Boulder

In this paper we present a novel system which incorporates programmable hardware (an FPGA) into a smartphone to enable a vision where apps can include both software and hardware components, or apps with hardware. We introduce a novel mechanism to enable sharing the FPGA in a practical manner by leveraging the unique deployment model of mobile applications - namely that deployment is via an app store, where we introduce a new cloud-based compilation. We present our prototype smart phone using the Zedboard, which pairs a Xilinx Zynq FPGA with an embedded Cortex A9, running an Android-based system which we extended to provide run-time system support for dynamically managing apps with hardware and providing a secure loading system. With this prototype, our evaluation demonstrates the performance gains for an AES encryption module (representing cryptography), a QAM modulation module (representing software-defined radio) of 3x to several orders of magnitude, with room for improvement and a hardware-based memory scanner (representing custom co-processors). We demonstrate the feasibility of our cloud-based compilation within the context of real app store statistics. Finally, we present a case study of a complete integration of hardware into an existing application (the Orbot Tor client).

Available Media

Testing Error Handling Code in Device Drivers Using Characteristic Fault Injection

Jia-Ju Bai, Yu-Ping Wang, Jie Yin, and Shi-Min Hu, Tsinghua University

Device drivers may encounter errors when communicating with OS kernel and hardware. However, error handling code often gets insufficient attention in driver development and testing, because these errors rarely occur in real execution. For this reason, many bugs are hidden in error handling code. Previous approaches for testing error handling code often neglect the characteristics of device drivers, which limits their efficiency and accuracy. In this paper, we first study the source code of Linux drivers to find useful characteristics of error handling code. Then we use these characteristics in fault injection testing, and propose a practical approach named EH-Test, which can automatically and efficiently test error handling code in drivers. To improve the representativeness of injected faults, we design a pattern-based extraction strategy to automatically and accurately extract target functions which can actually fail and trigger error handling code. During execution, we use a monitor to record runtime information and pair checkers to check resource usages. We have evaluated EH-Test on 15 real Linux device drivers and found 50 new bugs in Linux 3.17.2. The code coverage is also effectively increased. Comparison experiments to some previous approaches also show the availability of EH-Test.

Available Media

Multicore Locks: The Case Is Not Closed Yet

Hugo Guiroux and Renaud Lachaize, Université Grenoble Alpes and Laboratoire d'Informatique de Grenoble; Vivien Quéma, Université Grenoble Alpes, Grenoble Institute of Technology, and Laboratoire d'Informatique de Grenoble

NUMA multicore machines are pervasive and many multithreaded applications are suffering from lock contention. To mitigate this issue, application and library developers can choose from the plethora of optimized mutex lock algorithms that have been designed over the past 25 years. Unfortunately, there is currently no broad study of the behavior of these optimized lock algorithms on realistic applications. In this paper, we fill this gap. We perform a performance study of 19 state-of-the-art mutex lock algorithms on 36 realistic applications. Our study shows that regarding locking on multicore machines, the case is not closed yet. Indeed, our conclusions include the following findings: (i) no single lock is the best for more than 50% of the studied workloads; (ii) every lock is harmful for several applications, even if the application parallelism is properly tuned; (iii) for several applications, the optimal lock changes when varying the number of applications or the workload. These findings call for further research on optimized lock algorithms and dynamic adaptation of contention management.

Available Media

Best of the Rest III

Colorado Ballroom G–J

Session Chair: Jon Howell, Google

Passive Wi-Fi: Bringing Low Power to Wi-Fi Transmissions

Bryce Kellogg, Vamsi Talla, Shyamnath Gollakota, and Joshua R. Smith, University of Washington

Best Paper at NSDI '16: Read the Paper

Wi-Fi has traditionally been considered a power-consuming communication system and has not been widely adopting in the sensor network and IoT space. We introduce Passive Wi-Fi that demonstrates for the first time that one can generate 802.11b transmissions using backscatter communication, while consuming 3–4 orders of magnitude lower power than existing Wi-Fi chipsets. Passive Wi-Fi transmissions can be decoded on any Wi-Fi device including routers, mobile phones and tablets. Building on this, we also present a network stack design that enables passive Wi-Fi transmitters to coexist with other devices in the ISM band, without incurring the power consumption of carrier sense and medium access control operations. We build prototype hardware and implement all four 802.11b bit rates on an FPGA platform. Our experimental evaluation shows that passive Wi-Fi transmissions can be decoded on off-the-shelf smartphones and Wi-Fi chipsets over distances of 30–100 feet in various line-of-sight and through-the-wall scenarios. Finally, we design a passive Wi-Fi IC that shows that 1 and 11 Mbps transmissions consume 14.5 and 59.2 µW respectively. This translates to 10000x lower power than existing Wi-Fi chipsets and 1000x lower power than Bluetooth LTE and ZigBee.

Available Media

An Industrial-Scale Software Defined Internet Exchange Point

Arpit Gupta and Robert MacDavid, Princeton University; Rüdiger Birkner, ETH Zürich; Marco Canini, Université catholique de Louvain; Nick Feamster and Jennifer Rexford, Princeton University; Laurent Vanbever, ETH Zürich

Community Award Winner at NSDI '16: Read the Paper

Software-Defined Internet Exchange Points (SDXes) promise to significantly increase the flexibility and function of interdomain traffic delivery on the Internet. Unfortunately, current SDX designs cannot yet achieve the scale required for large Internet exchange points (IXPs), which can host hundreds of participants exchanging traffic for hundreds of thousands of prefixes. Existing platforms are indeed too slow and inefficient to operate at this scale, typically requiring minutes to compile policies and millions of forwarding rules in the data plane.

We motivate, design, and implement iSDX, the first SDX architecture that can operate at the scale of the largest IXPs. We show that iSDX reduces both policy compilation time and forwarding table size by two orders of magnitude compared to current state-of-the-art SDX controllers. Our evaluation against a trace from one of the largest IXPs in the world found that iSDX can compile a realistic set of policies for 500 IXP participants in less than three seconds. Our public release of iSDX, complete with tutorials and documentation, is already spurring early adoption in operational networks.

Available Media

PhyCloak: Obfuscating Sensing from Communication Signals

Yue Qiao, Ouyang Zhang, Wenjie Zhou, Kannan Srinivasan, and Anish Arora, The Ohio State University

Best Student Paper at NSDI '16: Link to Paper

Recognition of human activities and gestures using preexisting WiFi signals has been shown to be feasible in recent studies. Given the pervasiveness of WiFi signals, this emerging sort of sensing poses a serious privacy threat. This paper is the first to counter the threat of unwanted or even malicious communication based sensing: it proposes a blackbox sensor obfuscation technique PhyCloak which distorts only the physical information in the communication signal that leaks privacy. The data in the communication signal is preserved and, in fact, the throughput of the link is increased with careful design. Moreover, the design allows coupling of the PhyCloak module with legitimate sensors, so that their sensing is preserved, while that of illegitimate sensors is obfuscated. The effectiveness of the design is validated via a prototype implementation on an SDR platform

Available Media

Hyperprobe: Towards Virtual Machine Extrospection

Jidong Xiao, College of William and Mary; Lei Lu, VMware Inc.; Hai Huang, IBM T. J. Watson Research Center; Haining Wang, University of Delaware

Best Paper at LISA15: Link to Paper

In a virtualized environment, it is not difficult to retrieve guest OS information from its hypervisor. However, it is very challenging to retrieve information in the reverse direction, i.e., retrieve the hypervisor information from within a guest OS, which remains an open problem and has not yet been comprehensively studied before. In this paper, we take the initiative and study this reverse information retrieval problem. In particular, we investigate how to determine the host OS kernel version from within a guest OS.We observe that modern commodity hypervisors introduce new features and bug fixes in almost every new release. Thus, by carefully analyzing the seven-year evolution of Linux KVM development (including 3485 patches), we can identify 19 features and 20 bugs in the hypervisor detectable from within a guest OS. Building on our detection of these features and bugs, we present a novel framework called Hyperprobe that for the first time enables users in a guest OS to automatically detect the underlying host OS kernel version in a few minutes. We implement a prototype of Hyperprobe and evaluate its effectiveness in five real world clouds, including Google Compute Engine (a.k.a. Google Cloud), HP Helion Public Cloud, ElasticHosts, Joyent Cloud, and CloudSigma, as well as in a controlled testbed environment, all yielding promising results.

Available Media

sponsors

help promote

connect with us

twitter

usenix conference policies

You are here

connect with us

Technical Sessions

Wednesday, June 22, 2016

Continental Breakfast

Break with Refreshments

Lunch (on your own)

Break with Refreshments