Selling Stuff That's Free: the Commercial Side of Free Software

Jian Huang; Keonwoo Lee; Anirudh Badam; Hankeun Son; Ranveer Chandra; Wook-Hee Kim; Edmund B. Nightingale; Beomseok Nam; Zhengwei Qi; Youjip Won; Haibing Guan; Solom Heddaya; Raghu Ramakrishnan; Sarvesh Sakalanaga

help promote

USENIX ATC '15 button

Get more
Help Promote graphics!

connect with us

twitter

Tweets by @usenix

Technical Sessions

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
USENIX ATC '15 Full Proceedings (PDF)
USENIX ATC '15 Proceedings Interior (PDF, best for mobile devices)
USENIX ATC '15 Errata Slip (PDF) (Updated 6/29/15)
USENIX ATC '15 Errata Slip (PDF) (Updated 7/9/15)
USENIX ATC '15 Errata Slip (PDF) (Updated 12/23/15)

Full Proceedings ePub (for iPad and most eReaders)
USENIX ATC '15 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
USENIX ATC '15 Full Proceedings (Mobi)

Downloads for Registered Conference Attendees

Attendee Files

USENIX ATC '15 Proceedings Web Archive (7z compressed file)

USENIX ATC '15 Attendee List

Wednesday, July 8, 2015

7:30 am–8:30 am	Wednesday
Continental Breakfast
8:30 am–8:45 am	Wednesday
Introduction and Awards Program Co-Chairs: Shan Lu, University of Chicago, and Erik Riedel, EMC
8:45 am–9:50 am	Wednesday
Keynote Address Network Protocols: Myths, Missteps, and Mysteries Radia Perlman, EMC Corporation Available Media Read more about Network Protocols: Myths, Missteps, and Mysteries
9:50 am–10:15 am	Wednesday
Break with Refreshments

10:15 am–11:55 am

Wednesday

Parallel & Distributed Systems

Track 1: Refereed Papers

Session Chair: Hakim Weatherspoon, Cornell University

Spartan: A Distributed Array Framework with Smart Tiling

Chien-Chin Huang, New York University; Qi Chen, Peking University; Zhaoguo Wang and Russell Power, New York University; Jorge Ortiz, IBM T.J. Watson Research Center; Jinyang Li, New York University; Zhen Xiao, Peking University

Application programmers in domains like machine learning, scientific computing, and computational biology are accustomed to using powerful, high productivity array languages such as MatLab, R and NumPy. Distributed array frameworks aim to scale array programs across machines. However, maximizing the locality of access to distributed arrays is an unsolved problem; such locality is critical for high performance. This paper presents Spartan, a distributed array framework that automatically determines how to best partition (aka “tile”) ndimensional arrays and to co-locate data with computation to maximize locality. Spartan combines a lazy-evaluation based, optimizing frontend with a distributed tiled array backend. Central to Spartan’s design is a small number of carefully chosen parallel high-level operators, which form the expression graph captured by Spartan’s frontend during runtime. These operators simplify the programming of distributed applications.More importantly, their well-defined semantics allow Spartan’s runtime to calculate the costs of different tiling strategies and pick the best one for evaluating the entire expression graph.

Using Spartan, we have implemented 12 applications from a variety of domains including machine learning and scientific computing. Our evaluations show that Spartan’s automatic tiling mechanism leads to good and scalable performance while eliminating the need for manual tiling.

Available Media

Experience with Rules-Based Programming for Distributed, Concurrent, Fault-Tolerant Code

Ryan Stutsman, University of Utah; Collin Lee and John Ousterhout, Stanford University

This paper describes how a rules-based approach allowed us to solve a broad class of challenging distributed system problems in the RAMCloud storage system. In the rules-based approach, behavior is described with small sections of code that trigger independently based on system state; this provides a clean separation between the deterministic and nondeterministic parts of an algorithm. To simplify the implementation of rules-based modules, we developed a task abstraction for information hiding and complexity management, pools for grouping tasks and minimizing the cost of rule evaluation, and a polling-based asynchronous RPC system. The rules-based approach is a special case of an event-based state machine, but it encourages a cleaner factoring of code.

Available Media

Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication

Asaf Cidon, Stanford University; Robert Escriva, Cornell University; Sachin Katti and Mendel Rosenblum, Stanford University; Emin Gün Sirer, Cornell University

Cloud storage systems typically use three-way random replication to guard against data loss within the cluster, and utilize cluster geo-replication to protect against correlated failures. This paper presents a much lower cost alternative to full cluster geo-replication. We demonstrate that in practical settings, using two replicas is sufficient for protecting against independent node failures, while using three random replicas is inadequate for protecting against correlated node failures.

We present Tiered Replication, a replication scheme that splits the cluster into a primary and backup tier. The first two replicas are stored on the primary tier and are used to recover data in the case of independent node failures, while the third replica is stored on the backup tier and is used to protect against correlated failures. The key insight of our paper is that, since the third replicas are rarely read, we can place the backup tier on separate physical infrastructure or a remote location without affecting performance. This separation significantly increases the resilience of the storage system to correlated failures and presents a low cost alternative to geo-replication of an entire cluster. In addition, the Tiered Replication algorithm optimally minimizes the probability of data loss under correlated failures. Tiered Replication can be executed incrementally for each cluster change, which allows it to supports dynamic environments in which nodes join and leave the cluster, and it facilitates additional data placement constraints required by the storage designer, such as network and rack awareness. We have implemented Tiered Replication on HyperDex, an open-source cloud storage system, and demonstrate that it incurs a small performance overhead. Tiered Replication improves the cluster-wide MTTF by a factor of 20,000 compared to random replication and by a factor of 20 compared to previous non-random replication schemes, without increasing the amount of storage.

Available Media

Callisto-RTS: Fine-Grain Parallel Loops

Tim Harris, Oracle Labs; Stefan Kaestle, ETH Zürich

We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops—down to batches of work of around 1K cycles. Fine-grained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an 8-socket SPARC machine (1024 h/w contexts). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).

Available Media

At Small Scale

Track 2: Industry Talks

Session Chair: Dilma Da Silva, Texas A&M University

Simba: Building Data-Centric Applications for Mobile Devices

Nitin Agrawal, NEC Labs

Personal smart devices have become ubiquitous. Millions of innovative apps run on our tablets, smart phones, and smart watches, bringing these devices to life. Whether the app provides photo sharing, heart-rate monitoring, collaborative editing, or even gaming, managing data locally on the devices and remotely through various cloud-based services is crucial in defining the user experience. In the near future, the needs of the apps running on tens of billions of networked smart devices will further exacerbate the challenges with managing data. Much of the value proposition, and the burden, will also rest with the cloud infrastructure to provide quick responses to queries, personalize and recommend through analytics, and allow seamless access to content. To build “data-centric” applications spanning the devices and the cloud, developers need high-level abstractions for managing data. My talk will focus on the systems infrastructure that powers such applications.

In this talk, I will present evidence as to why existing data abstractions, for local storage, are counter-productive for performance, and for network transfer, are inadequate for consistency, efficiency, and programmability. I will present a study of several popular mobile apps on Android, including ones that use commercial sync services, where we found the apps to be unreliable, losing and corrupting user data, and inconsistent, under concurrent use. I will then present Simba, a data-management platform built for mobile apps with tunable end-to-end consistency. Simba’s table abstraction unifies both structured and unstructured data, and enables developers to write and deploy quality apps with ease.

Read more about Simba: Building Data-Centric Applications for Mobile Devices

A New Paradigm for Power Management

MyungJoo Ham, Geunsik Lim, and Dongyun Jin, Samsung Electronics

The paradigm of power management has been moving from device-centric to user-centric as device-centric approaches appear to be near the wall of saturation. With the emergent of user-centric approaches, it appears that the academia is seeing new research topics and the industry is seeing new chances to further evolve. However, being designers of an operating system, Tizen, and developers of their commercial products, including Gear series and Z1, simultaneously in Samsung , we are already observing drawbacks and limitations of several user-centric approaches. Such issues are usually induced by not properly understanding other stakeholders in the ecosystem: application developers (in-house, 2nd party, and 3rd party; with characteristics much different from each other) and customers (end users, operators, and 2nd/3rd party manufacturers). In the talk, we want to share the experiences and perspectives from both operating system designers and their customers, commercial product developers, on device-centric and user-centric approaches. Then, in order to address the extremely significant but missed points in the conventional paradigms, we would like to propose the next paradigm of power management: Eco-Centric Power Management.

Available Media

Read more about A New Paradigm for Power Management

Faster Booting in Consumer Electronics

Geunsik Lim, Samsung Electronics

In consumer electronics market, the need for fast boot has become paramount; think of a TV set taking 20 seconds to boot up and another TV set taking 3 seconds on shelves. Suspend-to-RAM might be an appealing solution; however, there are a lot of customers that plug out from an outlet to save energy, making suspend-to-RAM useless. Besides, with the emergent of smart electronics, especially with smart phones and smart TVs, the initial state of a device has become nondeterministic, making snapshot booting technique based on suspend-to-disk extremely difficult. Thus, vendors of such devices depend on the performance of a normal booting sequence: a.k.a. "cold boot".

Recently, the importance of the addressed area has been dramatically highlighted by the increasing boot time in mobile platforms. We have been developing general-purpose OS based consumer electronics with the requirement of showing up things within a few seconds with a "cold boot", where even tens of milli-seconds appear significant to engineers. In such optimization practices, the worst enemy is lying in free page scanning from the physical memory. In this talk, we introduce why we cannot reach the requirement of the lightweight kernel without the study of the memory allocator in embedded devices.

Available Media

Read more about Faster Booting in Consumer Electronics

11:55 am–1:25 pm	Wednesday
Lunch (on your own)

1:25 pm–3:30 pm

Wednesday

Cloud Storage

Track 1: Refereed Papers

Session Chair: Ajay Gulati, Zerostack, Inc.

LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache

Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, and Yingwei Luo, Peking University; Chen Ding, University of Rochester; Song Jiang, Wayne State University; Zhenlin Wang, Michigan Technological University

The in-memory cache system is a performance-critical layer in today’s web server architecture. Memcached is one of the most effective, representative, and prevalent among such systems. An important problem is memory allocation. The default design does not make the best use of the memory. It fails to adapt when the demand changes, a problem known as slab calcification.

This paper introduces locality-aware memory allocation (LAMA), which solves the problem by first analyzing the locality of the Memcached requests and then repartitioning the memory to minimize the miss ratio and the average response time. By evaluating LAMA using various industry and academic workloads, the paper shows that LAMA outperforms existing techniques in the steady-state performance, the speed of convergence, and the ability to adapt to request pattern changes and overcome slab calcification. The new solution is close to optimal, achieving over 98% of the theoretical potential.

Available Media

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items

Xingbo Wu and Yuehai Xu, Wayne State University; Zili Shao, The Hong Kong Polytechnic University; Song Jiang, Wayne State University

Key-value (KV) stores have become a backbone of large- scale applications in today’s data centers. The data set of the store on a single server can grow to billions of KV items or many terabytes, while individual data items are often small (with their values as small as a couple of bytes). It is a daunting task to efficiently organize such an ultra-large KV store to support fast access. Current KV storage systems have one or more of the following inadequacies: (1) very high data write amplifications, (2) large index set, and (3) dramatic degradation of read performance with overspill index out of memory.

To address the issue, we propose LSM-trie, a KV storage system that substantially reduces metadata for locating KV items, reduces write amplification by an order of magnitude, and needs only two disk accesses with each KV read even when only less than 10% of meta- data (Bloom filters) can be held in memory. To this end, LSM-trie constructs a trie, or a prefix tree, that stores data in a hierarchical structure and keeps re-organizing them using a compaction method much more efficient than that adopted for LSM-tree. Our experiments show that LSM-trie can improve write and read throughput of LevelDB, a state-of-the-art KV system, by up to 20 times and up to 10 times, respectively.

Available Media

MetaSync: File Synchronization Across Multiple Untrusted Storage Services

Seungyeop Han and Haichen Shen, University of Washington; Taesoo Kim, Georgia Institute of Technology; Arvind Krishnamurthy, Thomas Anderson, and David Wetherall, University of Washington

Cloud-based file synchronization services, such as Dropbox, are a worldwide resource for many millions of users. However, individual services often have tight resource limits, suffer from temporary outages or even shutdowns, and sometimes silently corrupt or leak user data.

We design, implement, and evaluate MetaSync, a secure and reliable file synchronization service that uses multiple cloud synchronization services as untrusted storage providers. To make MetaSync work correctly, we devise a novel variant of Paxos that provides efficient and consistent updates on top of the unmodified APIs exported by existing services. Our system automatically redistributes files upon reconfiguration of providers.

Our evaluation shows that MetaSync provides low update latency and high update throughput while being more trustworthy and available. MetaSync outperforms its underlying cloud services by 1.2-10 on three realistic workloads.

Available Media

Pyro: A Spatial-Temporal Big-Data Storage System

Shen Li and Shaohan Hu, University of Illinois at Urbana-Champaign; Raghu Ganti and Mudhakar Srivatsa, IBM Research; Tarek Abdelzaher, University of Illinois at Urbana-Champaign

With the rapid growth of mobile devices and applications, geo-tagged data has become a major workload for big data storage systems. In order to achieve scalability, existing solutions build an additional index layer above general purpose distributed data stores. Fulfilling the semantic level need, this approach, however, leaves a lot to be desired for execution efficiency, especially when users query for moving objects within a high resolution geometric area, which we call geometry queries. Such geometry queries translate to a much larger set of range scans, forcing the backend to handle orders of magnitude more requests. Moreover, spatial-temporal applications naturally create dynamic workload hotspots1, which pushes beyond the design scope of existing solutions. This paper presents Pyro, a spatial-temporal bigdata storage system tailored for high resolution geometry queries and dynamic hotspots. Pyro understands geometries internally, which allows range scans of a geometry query to be aggregately optimized. Moreover, Pyro employs a novel replica placement policy in the DFS layer that allows Pyro to split a region without losing data locality benefits. Our evaluations use NYC taxi trace data and an 80-server cluster. Results show that Pyro reduces the response time by 60X on 1kmx1km rectangle geometries compared to the state-of-the-art solutions. Pyro further achieves 10X throughput improvement on 100mx100m rectangle geometries.

Available Media

CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via Convergent Dispersal

Mingqiang Li, Chuan Qin, and Patrick P. C. Lee, The Chinese University of Hong Kong

We present CDStore, which disperses users’ backup data across multiple clouds and provides a unified multi-cloud storage solution with reliability, security, and cost efficiency guarantees. CDStore builds on an augmented secret sharing scheme called convergent dispersal, which supports deduplication by using deterministic content-derived hashes as inputs to secret sharing. We present the design of CDStore, and in particular, describe how it combines convergent dispersal with two-stage deduplication to achieve both bandwidth and storage savings and be robust against side-channel attacks. We evaluate the performance of our CDStore prototype using real-world workloads on LAN and commercial cloud testbeds. Our cost analysis also demonstrates that CDStore achieves a monetary cost saving of 70% over a baseline cloud storage solution using state-of-the-art secret sharing.

Available Media

OS, Security, and Storage

Track 2: Best of the Rest

Session Chair: Dan Tsafrir, Technion—Israel Institute of Technology

Arrakis: The Operating System is the Control Plane

Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Timothy Roscoe, ETH Zurich
Best Paper at OSDI '14: Link to Paper

Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security.We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation.We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2-5 in latency and 9 in throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.

Shielding Applications from an Untrusted Cloud with Haven

Andrew Baumann, Marcus Peinado, and Galen Hunt, Microsoft Research
Best Paper at OSDI '14: Link to Paper

Today's cloud computing infrastructure requires substantial trust. Cloud users rely on both the provider's staff and its globally-distributed software/hardware platform not to expose any of their private data.

We introduce the notion of shielded execution, which protects the confidentiality and integrity of a program and its data from the platform on which it runs (i.e., the cloud operator's OS, VM and firmware). Our prototype, Haven, is the first system to achieve shielded execution of unmodified legacy applications, including SQL Server and Apache, on a commodity OS (Windows) and commodity hardware. Haven leverages the hardware protection of Intel SGX to defend against privileged code and physical attacks such as memory probes, but also addresses the dual challenges of executing unmodified legacy binaries and protecting them from a malicious host.

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Adam Belay, Stanford University; George Prekas, École Polytechnique Fédérale de Lausanne (EPFL); Ana Klimovic, Samuel Grossman, and Christos Kozyrakis, Stanford University; Edouard Bugnion, École Polytechnique Fédérale de Lausanne (EPFL)
Best Paper at OSDI '14: Link to Paper

The conventional wisdom is that aggressive networking requirements, such as high packet rates for small messages and microsecond-scale tail latency, are best addressed outside the kernel, in a user-level networking stack. We present IX, a dataplane operating system that provides high I/O performance, while maintaining the key advantage of strong protection offered by existing kernels. IX uses hardware virtualization to separate management and scheduling functions of the kernel (control plane) from network processing (dataplane). The dataplane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedicating hardware threads and networking queues to dataplane instances, processing bounded batches of packets to completion, and by eliminating coherence traffic and multi-core synchronization. We demonstrate that IX outperforms Linux and state-of-the-art, user-space network stacks significantly in both throughput and end-to-end latency. Moreover, IX improves the throughput of a widely deployed, key-value store by up to 3.6 and reduces tail latency by more than 2.

Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing

Matthew Fredrikson, Eric Lantz, and Somesh Jha, University of Wisconsin—Madison; Simon Lin, Marshfield Clinic Research Foundation; David Page and Thomas Ristenpart, University of Wisconsin—Madison
Best Paper at USENIX Security '14: Link to Paper

We initiate the study of privacy in pharmacogenetics, wherein machine learning models are used to guide medical treatments based on a patient’s genotype and background. Performing an in-depth case study on privacy in personalized warfarin dosing, we show that suggested models carry privacy risks, in particular because attackers can perform what we call model inversion: an attacker, given the model and some demographic information about a patient, can predict the patient’s genetic markers.

As differential privacy (DP) is an oft-proposed solution for medical settings such as this, we evaluate its effectiveness for building private versions of pharmacogenetic models. We show that DP mechanisms prevent our model inversion attacks when the privacy budget is carefully selected. We go on to analyze the impact on utility by performing simulated clinical trials with DP dosing models. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. We conclude that current DP mechanisms do not simultaneously improve genomic privacy while retaining desirable clinical efficacy, highlighting the need for new mechanisms that should be evaluated in situ using the general methodology introduced by our work.

Skylight—A Window on Shingled Disk Operation

Abutalib Aghayev and Peter Desnoyers, Northeastern University
Best Paper at FAST '15: Link to Paper

We introduce Skylight, a novel methodology that combines software and hardware techniques to reverse engineer key properties of drive-managed ShingledMagnetic Recording (SMR) drives. The software part of Skylight measures the latency of controlled I/O operations to infer important properties of drive-managed SMR, including type, structure, and size of the persistent cache; type of cleaning algorithm; type of block mapping; and size of bands. The hardware part of Skylight tracks drive head movements during these tests, using a high-speed camera through an observation window drilled through the cover of the drive. These observations not only confirm inferences from measurements, but resolve ambiguities that arise from the use of latency measurements alone.We show the generality and efficacy of our techniques by running them on top of three emulated and two real SMR drives, discovering valuable performance-relevant details of the behavior of the real SMR drives.

3:30 pm–3:55 pm	Wednesday
Break with Refreshments

3:55 pm–6:00 pm

Wednesday

Dependability

Track 1: Refereed Papers

Session Chair: Xi Wang, University of Washington

Surviving Peripheral Failures in Embedded Systems

Rebecca Smith and Scott Rixner, Rice University

Peripherals fail. Yet, modern embedded systems largely leave the burden of tolerating peripheral failures to the programmer. This paper presents Phoenix, a semi-automated peripheral recovery system for resource-constrained embedded systems. Phoenix introduces lightweight checkpointing mechanisms that transparently track both the internal program state and the external peripheral state. These mechanisms enable rollback to the precise point at which any failed peripheral access occurred using as little as 6 KB of memory, minimizing both recovery latency and memory utilization.

Available Media

Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis

Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, and Qingwei Lin, Microsoft Research; Qiang Fu, Microsoft; Dongmei Zhang, Microsoft Research; Tao Xie, University of Illinois at Urbana-Champaign

Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service systems. First, the overhead incurred by intensive logging is non-negligible. Second, it is costly to diagnose a performance issue if there are a tremendous amount of redundant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacrificing the logging effectiveness. In this paper we propose Log², a cost-aware logging mechanism. Given a “budget” (defined as the maximum volume of logs allowed to be output in a time interval), Log² makes the "whether to log" decision through a two-phase filtering mechanism. In the first phase, a large number of irrelevant logs are discarded efficiently. In the second phase, useful logs are cached and output while complying with logging budget. In this way, Log² keeps the useful logs and discards the less useful ones. We have implemented Log² and evaluated it on an open source system as well as a real-world online service system from Microsoft. The experimental results show that Log² can control logging overhead while preserving logging effectiveness.

Available Media

Identifying Trends in Enterprise Data Protection Systems

George Amvrosiadis, University of Toronto; Medha Bhadkamkar, Symantec Research Labs

Enterprises routinely use data protection techniques to achieve business continuity in the event of failures. To ensure that backup and recovery goals are met in the face of the steep data growth rates of modern workloads, data protection systems need to constantly evolve. Recent studies show that these systems routinely miss their goals today. However, there is little work in the literature to understand why this is the case.

In this paper, we present a study of 40,000 enterprise data protection systems deploying Symantec NetBackup, a commercial backup product. In total, we analyze over a million weekly reports which have been collected over a period of three years. We discover that the main reason behind inefficiencies in data protection systems is misconfigurations. Furthermore, our analysis shows that these systems grow in bursts, leaving clients unprotected at times, and are often configured using the default parameter values. As a result, we believe there is potential in developing automated, self-healing data protection systems that achieve higher efficiency standards. To aid researchers in the development of such systems, we use our dataset to identify trends characterizing data protection systems with regards to configuration, job scheduling, and data growth.

Available Media

Systematically Exploring the Behavior of Control Programs

Jason Croft, University of Illinois at Urbana-Champaign; Ratul Mahajan, Microsoft Research; Matthew Caesar,University of Illinois at Urbana-Champaign; Madan Musuvathi, Microsoft Research

Many networked systems today, ranging from home automation networks to global wide-area networks, are operated using centralized control programs. Bugs in such programs pose serious risks to system security and stability. We develop a new technique to systematically explore the behavior of control programs. Because control programs depend intimately on absolute and relative timing of inputs, a key challenge that we face is to systematically handle time. We develop an approach that models programs as timed automata and incorporates novel mechanisms to enable scalable and comprehensive exploration. We implement our approach in a tool called DeLorean and apply it to real control programs for home automation and software-defined networks. DeLorean is able to finds bugs in these programs as well as provide significantly better code coverage—up to 94% compared to 76% for existing techniques.

Available Media

Fence: Protecting Device Availability With Uniform Resource Control

Tao Li and Albert Rafetseder, New York University; Rodrigo Fonseca, Brown University; Justin Cappos, New York University

Applications such as software updaters or a run-away web app, even if low priority, can cause performance degradation, loss of battery life, or other issues that reduce a computing device’s availability. The core problem is that OS resource control mechanisms unevenly apply uncoordinated policies across different resources. This paper shows how handling resources – e.g., CPU, memory, sockets, and bandwidth – in coordination, through a unifying abstraction, can be both simpler and more effective. We abstract resources along two dimensions of fungibility and renewability, to enable resource-agnostic algorithms to provide resource limits for a diverse set of applications.

We demonstrate the power of our resource abstraction with a prototype resource control subsystem, Fence, which we implement for two sandbox environments running on a wide variety of operating systems (Windows, Linux, the BSDs, Mac OS X, iOS, Android, OLPC, and Nokia) and device types (servers, desktops, tablets, laptops, and smartphones). We use Fence to provide systemwide protection against resource hogging processes that include limiting battery drain, preventing overheating, and isolating performance. Even when there is interference, Fence can double the battery life and improve the responsiveness of other applications by an order of magnitude. Fence is publicly available and has been deployed in practice for five years, protecting tens of thousands of users.

Available Media

Storage & Scalability

Track 2: Industry Talks

Session Chair: Theodore Ts’o, Google

Scalability and Load in Online Games

James Gwertzman, PlayFab, Inc.

The huge growth in online games over the last few years presents unique challenges in terms of scalability and backend infrastructure. Launch day can mean an immediate increase in load from a few hundred to millions of users, each of whom may be conducting dozens of transactions per minute as they purchase extra weapons or gear. Games often process many more writes than reads, some of which are low-latency and compute-intensive while others are high-latency but must be shared among hundreds of global servers. Multiplayer games require enough global player state to allow for matchmaking, while not allowing for a single point of failure. Games require both centralized servers to allow for frequent updates and decentralized ones to allow for low-latency play worldwide. And to prevent cheating, authoritative logic needs to stay server-side.

Available Media

Read more about Scalability and Load in Online Games

Erasure Code Foundations

W. David Schwaderer

Erasure Code storage applications (RAID 6, Object Storage, Disbursed Storage, etc.) are all the rage, and deservedly so. They have intrinsic, engineering beauty and elegance that merit front-row seats in deep, advanced-technology discussions. But mastering Erasure Code principles can quickly prove challenging, if not impossible, because Erasure Coding's simple principles are typically steeped in academic obfuscation. This has historically presented impenetrable obstacles to uncounted intrepid, serious, and competent engineers - maybe even you. Luckily, that's totally unnecessary.

This significantly updated presentation's goal is to arm aspiring, inquisitive engineers with Erasure Code foundational insights, intuition, and fundamental understandings that enable them to totally dominate Erasure Code discussions, both on their home court and on their own terms.

Read more about Erasure Code Foundations

Experiences with Scaling Blockchain-based Data Stores

Muneeb Ali, Onename

In the past years cryptocurrency blockchains (like Bitcoin and Namecoin) have seen significant adoption with the promise of using such blockchains as general-purpose databases and/or key-value stores. Cryptocurrency blockchains provide a zero-trust infrastructure, where users can securely store and retrieve information while providing security guarantees that only the owner of a particular private-key can write/modify the data. In theory, many decentralized services/application can be built using cryptocurrency blockchains as key-value stores. However, the area is relatively new and rapidly evolving with little production experience/data available to guide design tradeoffs. In this talk, we describe our experience of operating a large real-world deployment of a decentralized naming service, called Blockchain Name System, built on top of a cryptocurrency blockchain (Namecoin).

Available Media

Read more about Experiences with Scaling Blockchain-based Data Stores

Its Not Your Father’s Scale-out Storage Architecture!

Sandeep Uttamchandani, VMware

Compute, Network, Memory, and Storage hardware are undergoing a disruptive transformation within the data-center. In addition to hardware shifts, the next generation applications are non-POSIX with different consistency, performance, scaling tradeoffs. This talk explores the implications of these shifts in the context of designing the next generation scale-out architecture for enterprise storage.

In this talk, we focus on ten fundamental building blocks of a scale-out storage architecture, and analyze the evolution of implementation design patterns as a function of the hardware and application evolution. The analysis is based on examples from different generation of distributed storage and database architectures namely Network File-systems, Shared-disk HPC file-systems, Shared Nothing solutions (General-purpose and Big Data specific), Cloud Storage systems, NoSQL data management, and the recent emergence of in-memory storage solutions.

Read more about Its Not Your Father’s Scale-out Storage Architecture!

6:30 pm–8:30 pm

Wednesday

Conference Reception

Mingle with fellow attendees in the Terra Courtyard for the Conference Reception. Enjoy dinner, drinks, and the summer evening by the pool.

Thursday, July 9, 2015

7:30 am–8:30 am	Thursday
Continental Breakfast

8:30 am–10:35 am Thursday

File Systems & Flash

Track 1: Refereed Papers

Session Chair: Haryadi Gunawi, University of Chicago

Request-Oriented Durable Write Caching for Application Performance

Sangwook Kim, Sungkyunkwan University; Hwanju Kim, University of Cambridge; Sang-Hoon Kim, Korea Advanced Institute of Science and Technology (KAIST); Joonwon Lee and Jinkyu Jeong, Sungkyunkwan University

Non-volatile write cache (NVWC) can help to improve the performance of I/O-intensive tasks, especially write-dominated tasks. The benefit of NVWC, however, cannot be fully exploited if an admission policy blindly caches all writes without differentiating the criticality of each write in terms of application performance. We propose a request-oriented admission policy, which caches only writes awaited in the context of request execution. To accurately detect such writes, a critical process, which is involved in handling requests, is identified by application-level hints. Then, we devise criticality inheritance protocols in order to handle process and I/O dependencies to a critical process. The proposed scheme is implemented on the Linux kernel and is evaluated with PostgreSQL relational database and Redis NoSQL store. The evaluation results show that our scheme outperforms the policy that blindly caches all writes by up to 2.2× while reducing write traffic to NVWC by up to 87%.

Available Media

NVMKV: A Scalable, Lightweight, FTL-aware Key-Value Store

Leonardo Marmol, Florida International University; Swaminathan Sundararaman and Nisha Talagala, SanDisk; Raju Rangaswami, Florida International University

Key-value stores are ubiquitous in high performance data-intensive, scale out, and NoSQL environments. Many KV stores use flash devices for meeting their performance needs. However, by using flash as a simple block device, these KV stores are unable to fully leverage the powerful capabilities that exist within Flash Translation Layers (FTLs). NVMKV is a lightweight KV store that leverages native FTL capabilities such as sparse addressing, dynamic mapping, transactional persistence, and support for high-levels of lock free parallelism. Our evaluation of NVMKV demonstrates that it provides scalable, high-performance, and ACID compliant KV operations at close to raw device speeds.

Available Media

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Changwoo Min, Georgia Institute of Technology; Woon-Hak Kang, Sungkyunkwan University; Taesoo Kim, Georgia Institute of Technology; Sang-Won Lee and Young Ik Eom, Sungkyunkwan University

Applications implement their own update protocols to ensure consistency of data on the file system. However, since current file systems provide only a preliminary ordering guarantee, notably fsync(), these update protocols become complex, slow, and error-prone.

We present a new file system, CFS, that supports a native interface for applications to maintain crash consistency of their data. Using CFS, applications can achieve crash consistency of data by declaring code regions that must operate atomically. By utilizing transactional flash storage (SSD/X-FTL), CFS implement a lightweight mechanism for crash consistency. Without using any heavyweight mechanisms based on redundant writes and ordering, CFS can atomically write multiple data pages and their relevant metadata to storage.

We made three technical contributions to develop a crash consistency interface with SSD/X-FTL in CFS: selective atomic propagation of dirty pages, in-memory metadata logging, and delayed deallocation. Our evaluation of five real-world applications shows that CFS-based applications significantly outperform ordering versions: 2–5x faster by reducing disk writes 1.9–4.1x and disk cache flushing 1.1–17.6x. Importantly, our porting effort is minimal: CFS requires 317 lines of modifications from 3.5 million lines of ported applications.

Available Media

WALDIO: Eliminating the Filesystem Journaling in Resolving the Journaling of Journal Anomaly

Wongun Lee, Keonwoo Lee, and Hankeun Son, Hanyang University; Wook-Hee Kim and Beomseok Nam, Ulsan National Institute of Science and Technology; Youjip Won, Hanyang University

This work is dedicated to resolve the Journaling of Journal Anomaly in Android IO stack.We orchestrate SQLite and EXT4 filesystem so that SQLite’s file-backed journaling activity can dispense with the expensive filesystem intervention, the journaling, without compromising the file integrity under unexpected filesystem failure. In storing the logs, we exploit the direct IO to suppress the filesystem interference. This work consists of three key ingredients: (i) Preallocation with Explicit Journaling, (ii) Header Embedding, and (iii) Group Synchronization. Preallocation with Explicit Journaling eliminates the filesystem journaling properly protecting the file metadata against the unexpected system crash. We redesign the SQLite B-tree structure with Header Embedding to make it direct IO compatible and block IO friendly. With Group Synch, we minimize the synchronization overhead of direct IO and make the SQLite operation NAND Flash friendly. Combining the three technical ingredients, we develop a new journal mode in SQLite, the WALDIO. We implement it on the commercially available smartphone. WALDIO mode achieves 5.1x performance (insert/sec) against WAL mode which is the fastest journaling mode in SQLite. It yields 2.7x performance (inserts/ sec) against the LS-MVBT, the fastest SQLite journaling mode known to public. WALDIO mode achieves 7.4x performance (insert/sec) against WAL mode when it is relieved from the overhead of explicitly synchronizing individual log-commit operations. WALDIO mode reduces the IO volume to 1/6 compared against the WAL mode.

Available Media

SpanFS: A Scalable File System on Fast Storage Devices

Junbin Kang, Benlong Zhang, Tianyu Wo, Weiren Yu, Lian Du, Shuai Ma, and Jinpeng Huai, Beihang University

Most recent storage devices, such as NAND flash-based solid state drives (SSDs), provide low access latency and high degree of parallelism. However, conventional file systems, which are designed for slow hard disk drives, often encounter severe scalability bottlenecks in exploiting the advances of these fast storage devices on manycore architectures. To scale file systems to many cores, we propose SpanFS, a novel file system which consists of a collection of micro file system services called domains. SpanFS distributes files and directories among the domains, provides a global file system view on top of the domains and maintains consistency in case of system crashes.

SpanFS is implemented based on the Ext4 file system. Experimental results evaluating SpanFS against Ext4 on a modern PCI-E SSD show that SpanFS scales much better than Ext4 on a 32-core machine. In micro-benchmarks SpanFS outperforms Ext4 by up to 1226%. In application-level benchmarks SpanFS improves the performance by up to 73% relative to Ext4.

Available Media

Clusters & Containers

Track 2: industry Talks

Session Chair: Edouard Bugnion, École Polytechnique Fédérale de Lausanne (EPFL)

Hypervisor-based Memory Introspection at the Next Level: User-Mode Memory Introspection and Protection of Live VMs

Andrei Vlad Lutas, Bitdefender

We are living in an era when advanced malware and APTs are trying day-by-day to steal our money, get away with our confidential data, or allow unknown foreign state-sponsored entities to take full control over our systems. With the growing ineffectiveness of traditional anti-malware solutions, it became more than obvious that the industry needs to employ game-changing technologies: we need to get security to a next level. While the support for hardware virtualization becomes generally available on a large variety of platforms, security software taking advantage of it still needs to evolve to be ready for wide scale adoption. While kernel memory introspection, capable of providing rootkit protection is well known in the academia, we've taken the idea beyond the current state-of-the-art providing synchronous, real-time protection for live-VMs against a wide scale of threats. We provide advanced protection also for user-mode processes, while running our solution below the OS, securely isolated against kernel mode attacks. Among others, our approach features stacks & heaps execution prevention, detours prevention and code injection prevention inside protected processes. I will talk about the challenges we faced to get there, some of the key results we obtained, what are the remaining roadblocks, and finally, highlight also how I see the next few years.

Available Media

Read more about Hypervisor-based Memory Introspection at the Next Level: User-Mode Memory Introspection and Protection of Live VMs

Terminal.com: Full System Linux Containers with Bare-metal Speed

Dr. Varun Ganapathi, Terminal.com

Terminal is a new kind of Linux Cloud. Think of it as a complete software defined datacenter delivering full system containers that can do anything your local machine can do.

Terminal is fast. It allows anyone to provision full linux machines of almost any size in under 5 seconds and resize them at will without rebooting. Terminal is scalable. It supports tens of thousands of users in the public cloud and many installations in private clouds. Terminal is resilient. It can create RAM-perfect snapshots on every file system delta and store those snapshots in a geographically diverse manner.

Terminal features a software defined distributed virtual file system, networking layer, and transparent migration across physical hosts.

This talk will cover some of the design considerations that went into building Terminal and some of the use cases. There will be a short demonstration of how Terminal works as well.

Terminal is a new kind of Linux Cloud. Think of it as a complete software defined datacenter delivering full system containers that can do anything your local machine can do.

Terminal features a software defined distributed virtual file system, networking layer, and transparent migration across physical hosts.

This talk will cover some of the design considerations that went into building Terminal and some of the use cases. There will be a short demonstration of how Terminal works as well.

Read more about Terminal.com: Full System Linux Containers with Bare-metal Speed

In a World of Ephemeral Containers, How Do We Keep Track of Things?

Brian Dorsey, Google

Emerging patterns for keeping state in clusters of containers. A review of the challenges, solutions and their tradeoffs for keeping track of things when running a cluster of ephemeral containers. We will cover both configuration management and data storage (databases, etc), a survey of the different ways people are dealing with these challenges on the Google Cloud Platform. We will focus on systems using Kubernetes, an open source cluster manager and scheduler that simplifies deploying and managing applications running large numbers of containers. These patterns are applicable in a wide range of situations including using PaaS or VMs directly.

Read more about In a World of Ephemeral Containers, How Do We Keep Track of Things?

10:35 am–11:00 am	Thursday
Break with Refreshments

11:00 am–12:40 pm

Thursday

Memory

Track 1: Refereed Papers

Session Chair: Keith Adams, Facebook AI Research

Shoal: Smart Allocation and Replication of Memory For Parallel Programs

Stefan Kaestle, Reto Achermann, and Timothy Roscoe, ETH Zürich; Tim Harris, Oracle Labs, Cambridge

Modern NUMA multi-core machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program’s access patterns. However, sub-optimal allocation can significantly impact performance of parallel programs.

We present an array abstraction that allows data placement to be automatically inferred from program analysis, and implement the abstraction in Shoal, a runtime library for parallel programs on NUMA machines. In Shoal, arrays can be automatically replicated, distributed, or partitioned across NUMA domains based on annotating memory allocation statements to indicate access patterns. We further show how such annotations can be automatically provided by compilers for high-level domain-specific languages (for example, the Green-Marl graph language). Finally, we show how Shoal can exploit additional hardware such as programmable DMA copy engines to further improve parallel program performance.

We demonstrate significant performance benefits from automatically selecting a good array implementation based on memory access patterns and machine characteristics. We present two case-studies: (i) Green-Marl, a graph analytics workload using automatically annotated code based on information extracted from the high-level program and (ii) a manually-annotated version of the PARSEC Streamcluster benchmark.

Available Media

Thread and Memory Placement on NUMA Systems: Asymmetry Matters

Baptiste Lepers, Simon Fraser University; Vivien Quéma, Grenoble INP; Alexandra Fedorova, Simon Fraser University
Awarded Best Paper!

It is well known that the placement of threads and memory plays a crucial role for performance on NUMA (Non-Uniform Memory-Access) systems. The conventional wisdom is to place threads close to their memory, to collocate on the same node threads that share data, and to segregate on different nodes threads that compete for memory bandwidth or cache resources. While many studies addressed thread and data placement, none of them considered a crucial property of modern NUMA systems that is likely to prevail in the future: asymmetric interconnect. When the nodes are connected by links of different bandwidth, we must consider not only whether the threads and data are placed on the same or different nodes, but how these nodes are connected.

We study the effects of asymmetry on a widely available x86 system and find that performance can vary by more than 2x under the same distribution of thread and data across the nodes but different inter-node connectivity. The key new insight is that the best-performing connectivity is the one with the greatest total bandwidth as opposed to the smallest number of hops. Based on our findings we designed and implemented a dynamic thread and memory placement algorithm in Linux that delivers similar or better performance than the best static placement and up to 218% better performance than when the placement is chosen randomly.

Available Media

Latency-Tolerant Software Distributed Shared Memory

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin, University of Washington
Awarded Best Paper!

We present Grappa, a modern take on software distributed shared memory (DSM) for in-memory data-intensive applications. Grappa enables users to program a cluster as if it were a single, large, non-uniform memory access (NUMA) machine. Performance scales up even for applications that have poor locality and input-dependent load distribution. Grappa addresses deficiencies of previous DSM systems by exploiting application parallelism, trading off latency for throughput. We evaluate Grappa with an in-memory MapReduce framework (10x faster than Spark); a vertex-centric framework inspired by GraphLab (1.33x faster than native GraphLab); and a relational query execution engine (12.5x faster than Shark). All these frameworks required only 60-690 lines of Grappa code.

Available Media

NightWatch: Integrating Lightweight and Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Rentong Guo, Xiaofei Liao, and Hai Jin, Huazhong University of Science and Technology; Jianhui Yue, Auburn University; Guang Tan, Chinese Academy of Sciences

Cache pollution, by which weak-locality data unduly replaces strong-locality data, may notably degrade application performance in a shared-cache multicore machine. This paper presents NightWatch, a cache management subsystem that provides general, transparent and lowoverhead pollution control to applications. NightWatch is based on the observation that data within the same memory chunk or chunks within the same allocation context often share similar locality property. NightWatch embodies this observation by online monitoring current cache locality to predict future behavior and restricting potential cache polluters proactively. We have integrated NightWatch into two popular allocators, tcmalloc and ptmalloc2. Experiments with SPEC CPU2006 show that NightWatch improves application performance by up to 45% (18% on average), with an average monitoring overhead of 0.57% (up to 3.02%).

Available Media

Big Data

Track 2: Industry Talks

Session Chair: Fred Douglis, EMC

Mambo: Running Analytics on Enterprise Storage

Gokul Soundararajan and Jingxin Feng, NetApp; Xing Lin, University of Utah

Big data is defined broadly as large datasets with unstructured types of formats, and which cannot be processed by traditional database systems. Businesses have turned to big data analytical tools such as Apache Hadoop to help store and analyze these datasets. Apache Hadoop software is a framework that enables the distributed processing of large and varied datasets, across clusters of computers, by using programming models. Hadoop Distributed File System (HDFS) provides high throughput of application data. Hadoop provides integration to enhance specific workloads, storage efficiency, and data management.

Hadoop has been used primarily on incoming, external data; however, there’s been a need to use Hadoop on existing, internal data, typically stored in network-attached storage (NAS). Typically, this requires setting up another storage silo to host the HDFS and then running the Hadoop analytics on that storage. This, in turn, results in additional data management, more inefficiencies, and additional costs of moving the data. In this talk, we will talk about NFS Connector for Hadoop, *an open-source project*, which allows Hadoop to run natively by using NFS and without needing to move the data or create a separate data silo. We will describe the underlying architecture, use cases, and integration with Hadoop.

The work has been published earlier in a FAST '13 publication titled "MixApart: Decoupled Analytics for Shared Storage Systems." This talk will describe the journey from the original research work to its current form. It will also highlight customer use cases that led to its productization. We will use the forum to get additional feedback and to glean additional use cases.

Available Media

Read more about Mambo: Running Analytics on Enterprise Storage

A Reference Architecture for Securing Big Data Infrastructures

Abhay Raman and Shezan Chagani, Ernst & Young LP

Enterprise information repositories contain sensitive, and in some cases personally identifiable information that allows organizations to understand lifetime value of their customers, enable journey mapping to provide better and more targeted products and services, and improve their own internal operations by reducing cost of operations and driving profitability. Analysis of such large quantities of disparate data is possible today leveraging Hadoop and other Big data technologies. These large data sets create a very attractive target for hackers and are worth a lot of money in the cyber black market. Recent data breaches such as the ones at Anthem, Sony and others, drive the need to secure these infrastructures consistently and effectively.

In the form of a case study, we introduce the security requirements for such ecosystems, elaborate on the inherent security gaps and challenges in such environments and go on to propose a reference architecture for securing big data (Hadoop) infrastructures in financial institutions. The high-level requirements that drive the implementation of secure Hadoop implementations for the enterprise, including privacy issues, authentication, authorization, marking, key management, multi-tenancy, monitoring and current technologies available to the end user will be explored in detail. Furthermore, the case study will elaborate on how these requirements, enabled by the reference architecture are applied to actual systems implementations.

Available Media

Read more about A Reference Architecture for Securing Big Data Infrastructures

Power Your Big Data Analytics With Pivotal Greenplum Database

Kuien Liu and Yandong Yao, Pivotal Software, Inc.

Big data and analytics have been receiving attention for a few years, but many long-standing customers are unclear about how to move from traditional databases to modern concepts of data analytics. These customers worry about practical situations of lack of available skills in deploying modern infrastructure with compatibility to original systems, difficulties finding approaches to processing large-scale data, and other concerns make data analytics jobs slow and sometimes painful. Most of today’s general-purpose relational databases (e.g., Oracle, Microsoft SQL Server) originated as OLTP systems. Their shared-disk or shared-everything architectures are optimized for high-transaction rates at the expense of analytical query performance and concurrency. In contrast, Pivotal offers the Greenplum Database (GPDB), which is an extensible relational database platform that uses a shared-nothing, massive parallel processing (MPP) based architecture built atop commodity hardware to vastly accelerate the analytical processing of big data. Recent reports from Gartner highly scored Pivotal GPDB based on existing customer implementations and their experiences with data warehouse DBMS products *. This talk will briefly introduce (1) the architecture of Pivotal GPDB that provides automatic high-performance parallelization of data loading and data, (2) GPDB’s extensive and growing library of in-database analytic functions, and (3) the capability to build up a comprehensive big data platform around Pivotal GPDB. I will provide examples of how data science teams may transform billions of customer records to tackle the real-world problem of identity resolution in one minute. I will also discuss our plan of making Pivotal Greenplum Database open-source in the coming quarters.

Read more about Power Your Big Data Analytics With Pivotal Greenplum Database

12:40 pm–2:10 pm	Thursday
Conference Luncheon

2:10 pm–3:50 pm

Thursday

Security

Track 1: Refereed Papers

Session Chair: Xi Wang, University of Washington

Secure Deduplication of General Computations

Yang Tang and Junfeng Yang, Columbia University

The world’s fast-growing data has become highly concentrated on enterprise or cloud storage servers. Data deduplication reduces redundancy in this data, saving storage and simplifying management. While existing systems can deduplicate computations on this data by memoizing and reusing computation results, they are insecure, not general, or slow.

This paper presents UNIC, a system that securely deduplicates general computations. It exports a cache service that allows applications running on behalf of mutually distrusting users on local or remote hosts to memoize and reuse computation results. Key in UNIC are three new ideas. First, through a novel use of code attestation, UNIC achieves both integrity and secrecy. Second, it provides a simple yet expressive API that enables applications to deduplicate their own rich computations. This design is much more general and flexible than existing systems that can deduplicate only specific types of computations. Third, UNIC explores a cross-layer design that allows the underlying storage system to expose data deduplication information to the applications for better performance.

Evaluation of UNIC on four popular open-source applications shows that UNIC is easy to use, fast, and with little storage overhead.

Available Media

Lamassu: Storage-Efficient Host-Side Encryption

Peter Shah and Won So, NetApp Inc.

Many storage customers are adopting encryption solutions to protect critical data. Most existing encryption solutions sit in, or near, the application that is the source of critical data, upstream of the primary storage system. Placing encryption near the source ensures that data remains encrypted throughout the storage stack, making it easier to use untrusted storage, such as public clouds.

Unfortunately, such a strategy also prevents downstream storage systems from applying content-based features, such as deduplication, to the data. In this paper, we present Lamassu, an encryption solution that uses block-oriented, host-based, convergent encryption to secure data, while preserving storage-based data deduplication. Unlike past convergent encryption systems, which typically store encryption metadata in a dedicated store, our system transparently inserts its metadata into each file’s data stream. This allows us to add Lamassu to an application stack without modifying either the client application or the storage controller.

In this paper, we lay out the architecture and security model used in our system, and present a new model for maintaining metadata consistency and data integrity in a convergent encryption environment. We also evaluate its storage efficiency and I/O performance by using a variety of microbenchmarks, showing that Lamassu provides excellent storage efficiency, while achieving I/O throughput on par with similar conventional encryption systems.

Available Media

SecPod: a Framework for Virtualization-based Security Systems

Xiaoguang Wang, Xi'an Jiaotong University and Florida State University; Yue Chen and Zhi Wang, Florida State University; Yong Qi, Xi'an Jiaotong University; Yajin Zhou, Qihoo 360

The OS kernel is critical to the security of a computer system. Many systems have been proposed to improve its security. A fundamental weakness of those systems is that page tables, the data structures that control the memory protection, are not isolated from the vulnerable kernel, and thus subject to tampering. To address that, researchers have relied on virtualization for reliable kernel memory protection. Unfortunately, such memory protection requires to monitor every update to the guest’s page tables. This fundamentally conflicts with the recent advances in the hardware virtualization support. In this paper, we propose SecPod, an extensible framework for virtualization-based security systems that can provide both strong isolation and the compatibility with modern hardware. SecPod has two key techniques: paging delegation delegates and audits the kernel’s paging operations to a secure space; execution trapping intercepts the (compromised) kernel’s attempts to subvert SecPod by misusing privileged instructions. We have implemented a prototype of SecPod based on KVM. Our experiments show that SecPod is both effective and efficient.

Available Media

Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications

Jun Wang, The Pennsylvania State University; Xi Xiong, Facebook Inc. and The Pennsylvania State University; Peng Liu, The Pennsylvania State University

Threads in a multithreaded process share the same address space and thus are implicitly assumed to be mutually trusted. However, one (compromised) thread attacking another is a real world threat. It remains challenging to achieve privilege separation for multithreaded applications so that the compromise or malfunction of one thread does not lead to data contamination or data leakage of other threads.

The Arbiter system proposed in this paper explores the solution space. In particular, we find that page table protection bits can be leveraged to do efficient reference monitoring if data objects with the same accessibility stay in the same page. We design and implement Arbiter which consists of a new memory allocation mechanism, a policy manager, and a set of APIs. Programmers specify security policy through annotating the source code. We apply Arbiter to three applications, an in-memory key/- value store, a web server, and a userspace file system, and show how they can benefit from Arbiter in terms of security. Our experiments on the three applications show that Arbiter reduces application throughput by less than 10% and increases CPU utilization by 1.37-1.55x.

Available Media

At Large Scale

Track 2: Best of the Rest

Session Chair: Shan Lu, University of Chicago

The Design and Implementation of Open vSwitch

Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, and Pravin Shelar, VMware, Inc.; Keith Amidon, Awake Networks; Martín Casado, VMware, Inc.
Best Paper at NSDI '15: Link to Paper

We describe the design and implementation of Open vSwitch, a multi-layer, open source virtual switch for all major hypervisor platforms. Open vSwitch was designed de novo for networking in virtual environments, resulting in major design departures from traditional software switching architectures. We detail the advanced flow classification and caching techniques that Open vSwitch uses to optimize its operations and conserve hypervisor resources. We evaluate Open vSwitch performance, drawing from our deployment experiences over the past seven years of using and improving Open vSwitch.

Available Media

Designing Distributed Systems Using Approximate Synchrony in Data Center Networks

Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy, University of Washington

Best Paper at NSDI '15: Link to Paper

Distributed systems are traditionally designed independently from the underlying network, making worst-case assumptions (e.g., complete asynchrony) about its behavior. However, many of today’s distributed applications are deployed in data centers, where the network is more reliable, predictable, and extensible. In these environments, it is possible to co-design distributed systems with their network layer, and doing so can offer substantial benefits.

This paper explores network-level mechanisms for providing Mostly-Ordered Multicast (MOM): a best-effort ordering property for concurrent multicast operations. Using this primitive, we design Speculative Paxos, a state machine replication protocol that relies on the network to order requests in the normal case. This approach leads to substantial performance benefits: under realistic data center conditions, Speculative Paxos can provide 40% lower latency and 2:6 higher throughput than the standard Paxos protocol. It offers lower latency than a latencyoptimized protocol (Fast Paxos) with the same throughput as a throughput-optimized protocol (batching).

Queues Don’t Matter When You Can JUMP Them!

Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft, University of Cambridge
Best Paper at NSDI '15: Link to Paper

QJUMP is a simple and immediately deployable approach to controlling network interference in datacenter networks. Network interference occurs when congestion from throughput-intensive applications causes queueing that delays traffic from latency-sensitive applications. To mitigate network interference, QJUMP applies Internet QoS-inspired techniques to datacenter applications. Each application is assigned to a latency sensitivity level (or class). Packets from higher levels are rate-limited in the end host, but once allowed into the network can “jump-the-queue” over packets from lower levels. In settings with known node counts and link speeds, QJUMP can support service levels ranging from strictly bounded latency (but with low rate) through to line-rate throughput (but with high latency variance).

We have implemented QJUMP as a Linux Traffic Control module. We show that QJUMP achieves bounded latency and reduces in-network interference by up to 300, outperforming Ethernet Flow Control (802.3x), ECN (WRED) and DCTCP. We also show that QJUMP improves average flow completion times, performing close to or better than DCTCP and pFabric.

Realtime High-Speed Network Traffic Monitoring Using ntopng

Luca Deri, ntop / IIT-CNR; Maurizio Martinelli, IIT-CNR; Alfredo Cardigliano, ntop
Best Paper at LISA14: Link to Paper

Monitoring network traffic has become increasingly challenging in terms of number of hosts, protocol proliferation and probe placement topologies. Virtualised environments and cloud services shifted the focus from dedicated hardware monitoring devices to virtual machine based, software traffic monitoring applications. This paper covers the design and implementation of ntopng, an open-source traffic monitoring application designed for high-speed networks. ntopng’s key features are large networks real-time analytics and the ability to characterise application protocols and user traffic behaviour. ntopng was extensively validated in various monitoring environments ranging from small networks to .it ccTLD traffic analysis.

3:50 pm–4:15 pm	Thursday
Break with Refreshments

4:15 pm–5:05 pm

Thursday

Graph Processing

Track 1: Refereed Papers

Session Chair: Keith Adams, Facebook AI Research

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning

Xiaowei Zhu, Wentao Han, and Wenguang Chen, Tsinghua University

In this paper, we present GridGraph, a system for processing large-scale graphs on a single machine. Grid- Graph breaks graphs into 1D-partitioned vertex chunks and 2D-partitioned edge blocks using a first fine-grained level partitioning in preprocessing. A second coarsegrained level partitioning is applied in runtime. Through a novel dual sliding windows method, GridGraph can stream the edges and apply on-the-fly vertex updates, thus reduce the I/O amount required for computation. The partitioning of edges also enable selective scheduling so that some of the blocks can be skipped to reduce unnecessary I/O. This is very effective when the active vertex set shrinks with convergence.

Our evaluation results show that GridGraph scales seamlessly with memory capacity and disk bandwidth, and outperforms state-of-the-art out-of-core systems, including GraphChi and X-Stream. Furthermore, we show that the performance of GridGraph is even competitive with distributed systems, and it also provides significant cost efficiency in cloud environment.

Available Media

GraphQ: Graph Query Processing with Abstraction Refinement—Scalable and Programmable Analytics over Very Large Graphs on a Single PC

Kai Wang and Guoqing Xu, University of California, Irvine; Zhendong Su, University of California, Davis; Yu David Liu, SUNY at Binghamton

This paper introduces GraphQ, a scalable querying framework for very large graphs. GraphQ is built on a key insight that many interesting graph properties—such as finding cliques of a certain size, or finding vertices with a certain page rank—can be effectively computed by exploring only a small fraction of the graph, and traversing the complete graph is an overkill. The centerpiece of our framework is the novel idea of abstraction refinement, where the very large graph is represented as multiple levels of abstractions, and a query is processed through iterative refinement across graph abstraction levels. As a result, GraphQ enjoys several distinctive traits unseen in existing graph processing systems: query processing is naturally budget-aware, friendly for out-ofcore processing when “Big Graphs” cannot entirely fit into memory, and endowed with strong correctness properties on query answers. With GraphQ, a wide range of complex analytical queries over very large graphs can be answered with resources affordable to a single PC, which complies with the recent trend advocating singlemachine- based Big Data processing.

Experiments show GraphQ can answer queries in graphs 4-6 times bigger than the memory capacity, only in several seconds to minutes. In contrast, GraphChi, a state-of-the-art graph processing system, takes hours to days to compute a whole-graph solution. An additional comparison with a modified version of GraphChi that terminates immediately when a query is answered shows that GraphQ is on average 1.6–13.4x faster due to its ability to process partial graphs.

Available Media

Networking

Track 2: Industry Talks

Session Chair: Theodore Ts’o, Google

Experience with B4: Google's Private SDN Backbone

Subhasree Mandal, Google

In this talk, we will describe B4, the SDN-based network that connects Google’s data centers to each other, and some of the lessons we have learned over several years of operational experience. Background: B4 is one of Google’s two backbone networks. Originally designed as a low cost backend backbone, specifically for carrying high volume copy traffic, B4 has grown into a critical infrastructure for Google. Currently it carries more traffic than Google’s public facing backbone. B4 is built with a home-grown software stack and home-grown switch hardware, based on merchant silicon ASICs.

Available Media

Read more about Experience with B4: Google's Private SDN Backbone

5:05 pm–6:00 pm	Thursday
Work-in-Progress Reports (WiPs) and Crazy Ideas Session This is an informal session for short and engaging presentations on recent unpublished results, work in progress, or other discussion topics of interest to the USENIX ATC attendees. Talks do not always need to be serious and funny talks are encouraged!
6:30 pm–8:00 pm	Thursday
Poster Session and Happy Hour Check out the cool new ideas and the latest preliminary work on display at the Poster Session and Happy Hour. Take advantage of an opportunity to mingle with colleagues who may be interested in the same area while enjoying complimentary food and drinks. The list of accepted posters is now available.

Friday, July 10, 2015

7:30 am–8:00 am	Friday
Continental Breakfast

8:00 am–9:45 am

Friday

Networking

Track 1: Refereed Papers

Session Chair: Simon Peter, The University of Texas at Austin

Accurate Latency-based Congestion Feedback for Datacenters

Changhyun Lee and Chunjong Park, Korea Advanced Institute of Science and Technology (KAIST); Keon Jang, Intel Labs; Sue Moon and Dongsu Han, Korea Advanced Institute of Science and Technology (KAIST)

The nature of congestion feedback largely governs the behavior of congestion control. In datacenter networks, where RTTs are in hundreds of microseconds, accurate feedback is crucial to achieve both high utilization and low queueing delay. Proposals for datacenter congestion control predominantly leverage ECN or even explicit in-network feedback (e.g., RCP-type feedback) to minimize the queuing delay. In this work we explore latency-based feedback as an alternative and show its advantages over ECN. Against the common belief that such implicit feedback is noisy and inaccurate, we demonstrate that latency-based implicit feedback is accurate enough to signal a single packet’s queuing delay in 10 Gbps networks.

DX enables accurate queuing delay measurements whose error falls within 1.98 and 0.53 microseconds using software-based and hardware-based latency measurements, respectively. This enables us to design a new congestion control algorithm that performs fine-grained control to adjust the congestion window just enough to achieve very low queuing delay while attaining full utilization. Our extensive evaluation shows that 1) the latency measurement accurately reflects the one-way queuing delay in single packet level; 2) the latency feedback can be used to perform practical and fine-grained congestion control in high-speed datacenter networks; and 3) DX outperforms DCTCP with 5.33x smaller median queueing delay at 1 Gbps and 1.57x at 10 Gbps.

Available Media

Mahimahi: Accurate Record-and-Replay for HTTP

Ravi Netravali, Anirudh Sivaraman, Somak Das, and Ameesh Goyal MIT CSAIL; Keith Winstein, Stanford University; James Mickens, Harvard University; Hari Balakrishnan, MIT CSAIL

This paper presents Mahimahi, a framework to record traffic from HTTP-based applications, and later replay it under emulated network conditions. Mahimahi improves upon prior record-and-replay frameworks in three ways. First, it is more accurate because it carefully emulates the multi-server nature of Web applications, present in 98% of the Alexa US Top 500 Web pages. Second, it isolates its own network traffic, allowing multiple Mahimahi instances emulating different networks to run concurrently without mutual interference. And third, it is designed as a set of composable shells, providing ease-of-use and extensibility.

We evaluate Mahimahi by: (1) analyzing the performance of HTTP/1.1, SPDY, and QUIC on a corpus of 500 sites, (2) using Mahimahi to understand the reasons why these protocols are suboptimal, (3) developing Cumulus, a cloud-based browser designed to overcome these problems, using Mahimahi both to implement Cumulus by extending one of its shells, and to evaluate it, (4) using Mahimahi to evaluate HTTP multiplexing protocols on multiple performance metrics (page load time and speed index), and (5) describing how others have used Mahimahi.

Available Media

Slipstream: Automatic Interprocess Communication Optimization

Will Dietz, Joshua Cranmer, Nathan Dautenhahn, and Vikram Adve, University of Illinois at Urbana-Champaign

We present Slipstream, a userspace solution for transparently selecting efficient local transports in distributed applications written to use TCP/IP, when such applications communicate between local processes. Slipstream is easy to deploy because it is language-agnostic, automatic, and transparent. Our design in particular (1) requires no changes to the kernel or applications, (2) correctly identifies (and optimizes) pairs of communicating local endpoints, without knowledge of routing performed locally or by the network, and (3) imposes little or no overhead when optimization is not possible, including communication with parties not using our technology. Slipstream is sufficiently general that it can not only optimize traffic between local applications, but can also be used between Docker containers residing on the same host. Our results show that Slipstream significantly improves throughput and latency, 16-100% faster throughput for server applications (and 100-200% with Docker), while imposing an overhead of around 1-3% when not in use. Overall, Slipstream enables programmers to write simpler code using TCP/IP "everywhere" and yet obtain the significant benefits of faster local transports whenever available.

Available Media

FloSIS: A Highly Scalable Network Flow Capture System for Fast Retrieval and Storage Efficiency

Jihyung Lee, Korea Advanced Institute of Science and Technology (KAIST); Sungryoul Lee and Junghee Lee, The Attached Institute of ETRI; Yung Yi and KyoungSoo Park, Korea Advanced Institute of Science and Technology (KAIST)

Network packet capture performs essential functions in network management such as attack analysis, network troubleshooting, and performance debugging. As the network edge bandwidth exceeds 10 Gbps, the demand for scalable packet capture and retrieval is rapidly increasing. However, existing software-based packet capture systems neither provide high performance nor support flow-level indexing for fast query response. This would either prevent important packets from being stored or make it too slow to retrieve relevant flows.

In this paper, we present FloSIS, a highly scalable, software-based flow storing and indexing system. FloSIS is characterized as the following three aspects. First, it exercises full parallelism in multiple CPU cores and disks at all stages of packet processing. Second, it constructs two-stage flow-level indexes, which helps minimize expensive disk access for user queries. It also stores the packets in the same flow at a contiguous disk location, which maximizes disk read throughput. Third, we optimize storage usage by flow-level content deduplication at real time. Our evaluation shows that FloSIS on a dual octa-core CPU machine with 24 HDDs achieves 30 Gbps of zero-drop performance with real traffic, consuming only 0.25% of the space for indexing.

Available Media

Scheduling at Large Scale

Track 2: Refereed Papers

Session Chair: Liane Praza, Oracle

Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems

Andrey Goder, Alexey Spiridonov, and Yin Wang, Facebook Inc.

Data-intensive batch jobs increasingly compete for resources with customer-facing online workloads in modern data centers. Today, the two classes of workloads run on separate infrastructures using different resource managers that pursue different objectives. Batch processing systems strive for coarse-grained throughput whereas online systems must keep the latency of fine-grained end-user requests low. Better resource management would allow both batch and online workloads to share infrastructure, reducing hardware and eliminating the inefficient and error-prone chore of creating and maintaining copies of data. This paper describes Facebook’s Bistro, a scheduler that runs data-intensive batch jobs on live, customer-facing production systems without degrading the end-user experience. Bistro employs a novel hierarchical model of data and computational resources. The model enables Bistro to schedule workloads efficiently and adapt rapidly to changing configurations. At Facebook, Bistro is replacing Hadoop and custom-built batch schedulers, allowing batch jobs to share infrastructure with online workloads without harming the performance of either.

Available Media

Rubik: Unlocking the Power of Locality and End-point Flexibility in Cloud Scale Load Balancing

Rohan Gandhi, Y. Charlie Hu and Cheng-kok Koh, Purdue University; Hongqiang (Harry) Liu and Ming Zhang, Microsoft Research

Cloud scale load balancers, such as Ananta and Duet are critical components of the data center (DC) infrastructure, and are vital to the performance of the hosted online services. In this paper, using traffic traces from a production DC, we show that prior load balancer designs incur substantial overhead in the DC network bandwidth usage, due to the intrinsic nature of traffic redirection. Moreover, in Duet, traffic redirection results in extra bandwidth consumption in the core network and breaks the full-bisection bandwidth guarantees offered by the underlying networks such as Clos and FatTree.

We present RUBIK, a load balancer that significantly lowers the DC network bandwidth usage while providing all the performance and availability benefits of Duet. RUBIK achieves its goals by applying two principles in the scale-out load balancer design—exploiting locality and applying end-point flexibility in placing the servers. We show how to jointly exploit these two principles to maximally contain the traffic load balanced to be within individual ToRs while satisfying service-specific failure domain constraints. Our evaluation using a testbed prototype and DC-scale simulation using real traffic traces shows that compared to the prior art Duet, RUBIK can reduce the bandwidth usage by over 3x and the maximum link utilization of the DC network by 4x, while providing all the performance, scalability, and availability benefits.

Available Media

Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters

Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga, Microsoft Corporation

Datacenter-scale computing for analytics workloads is increasingly common. High operational costs force heterogeneous applications to share cluster resources for achieving economy of scale. Scheduling such large and diverse workloads is inherently hard, and existing approaches tackle this in two alternative ways: 1) centralized solutions offer strict, secure enforcement of scheduling invariants (e.g., fairness, capacity) for heterogeneous applications, 2) distributed solutions offer scalable, efficient scheduling for homogeneous applications.

We argue that these solutions are complementary, and advocate a blended approach. Concretely, we propose Mercury, a hybrid resource management framework that supports the full spectrum of scheduling, from centralized to distributed. Mercury exposes a programmatic interface that allows applications to trade-off between scheduling overhead and execution guarantees. Our framework harnesses this flexibility by opportunistically utilizing resources to improve task throughput. Experimental results on production-derived workloads show gains of over 35% in task throughput. These benefits can be translated by appropriate application and framework policies into job throughput or job latency improvements. We have implemented and contributed Mercury as an extension of Apache Hadoop / YARN.

Available Media

Hawk: Hybrid Datacenter Scheduling

Pamela Delgado and Florin Dinu, École Polytechnique Fédérale de Lausanne (EPFL); Anne-Marie Kermarrec, Inria; Willy Zwaenepoel, École Polytechnique Fédérale de Lausanne (EPFL)

This paper addresses the problem of efficient scheduling of large clusters under high load and heterogeneous workloads. A heterogeneous workload typically consists of many short jobs and a small number of large jobs that consume the bulk of the cluster’s resources.

Recent work advocates distributed scheduling to overcome the limitations of centralized schedulers for large clusters with many competing jobs. Such distributed schedulers are inherently scalable, but may make poor scheduling decisions because of limited visibility into the overall resource usage in the cluster. In particular, we demonstrate that under high load, short jobs can fare poorly with such a distributed scheduler.

We propose instead a new hybrid centralized/distributed scheduler, called Hawk. In Hawk, long jobs are scheduled using a centralized scheduler, while short ones are scheduled in a fully distributed way. Moreover, a small portion of the cluster is reserved for the use of short jobs. In order to compensate for the occasional poor decisions made by the distributed scheduler, we propose a novel and efficient randomized work-stealing algorithm.

We evaluate Hawk using a trace-driven simulation and a prototype implementation in Spark. In particular, using a Google trace, we show that under high load, compared to the purely distributed Sparrow scheduler, Hawk improves the 50th and 90th percentile runtimes by 80% and 90% for short jobs and by 35% and 10% for long jobs, respectively. Measurements of a prototype implementation using Spark on a 100-node cluster confirm the results of the simulation.

Available Media

9:45 am–10:00 am	Friday
Break with Refreshments

10:00 am–12:05 pm Friday

OS & Hardware

Track 1: Refereed Papers

Session Chair: Jeremy Andrus, Apple

Bolt: Faster Reconfiguration in Operating Systems

Sankaralingam Panneerselvam, Michael M. Swift, and Nam Sung Kim, University of Wisconsin—Madison

Dynamic resource scaling enables provisioning extra resources during peak loads and saving energy by reclaiming those resources during off-peak times. Scaling the number of CPU cores is particularly valuable as it allows power savings during low-usage periods. Current systems perform scaling with a slow hotplug mechanism, which was primarily designed to remove or replace faulty cores. The high cost of scaling is reflected in power management policies that perform scaling at coarser time scales to amortize the high reconfiguration latency.

We describe Bolt, a new mechanism built on existing hotplug infrastructure to reduce scaling latency. Bolt also supports a new bulk interface to add or remove multiple cores at once. We implemented Bolt for x86 and ARM architectures. Our evaluation shows that Bolt can achieve over 20x speedup for entering offline state. While turning on CPUs, Bolt achieves speedup up to 10x and 21x for x86 and ARM respectively.

Available Media

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

Yaozu Dong and Mochi Xue, Shanghai Jiao Tong University and Intel Corporation; Xiao Zheng,Intel Corporation; Jiajun Wang, Shanghai Jiao Tong University and Intel Corporation; Zhengwei Qi and Haibing Guan, Shanghai Jiao Tong University

The increasing adoption of Graphic Process Unit (GPU) to computation-intensive workloads has stimulated a new computing paradigm called GPU cloud (e.g., Amazon’s GPU Cloud), which necessitates the sharing of GPU resources to multiple tenants in a cloud. However, state-of-the- art GPU virtualization techniques such as gVirt still suffer from non-trivial performance overhead for graphics memory-intensive workloads involving frequent page table updates.

To understand such overhead, this paper first presents GMedia, a media benchmark, and uses it to analyze the causes of such overhead. Our analysis shows that frequent updates to guest VM’s page tables causes excessive updates to the shadow page table in the hypervisor, due to the need to guarantee the consistency between guest page table and shadow page table. To this end, this paper proposes gHyvi¹, an optimized GPU virtualization scheme based on gVirt, which uses adaptive hybrid page table shadowing that combines strict and relaxed page table schemes. By significantly reducing trap-and-emulation due to page table updates, gHyvi significantly improves gVirt’s performance for memory-intensive GPU workloads. Evaluation using GMedia shows that gHyvi can achieve up to 13x performance improvement compared to gVirt, and up to 85% native performance for multithread media transcoding.

Available Media

Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems

Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen, University of Rochester

Modern multicore platforms suffer from inefficiencies due to contention and communication caused by sharing resources or accessing shared data. In this paper, we demonstrate that information from low-cost hardware performance counters commonly available on modern processors is sufficient to identify and separate the causes of communication traffic and performance degradation. We have developed SAM, a Sharing-Aware Mapper that uses the aggregated coherence and bandwidth event counts to separate traffic caused by data sharing from that due to memory accesses. When these counts exceed pre-determined thresholds, SAM effects task to core assignments that colocate tasks that share data and distribute tasks with high demand for cache capacity and memory bandwidth. Our new mapping policies automatically improve execution speed by up to 72% for individual parallel applications compared to the default Linux scheduler, while reducing performance disparities across applications in multiprogrammed workloads.

Available Media

Establishing a Base of Trust with Performance Counters for Enterprise Workloads

Andrzej Nowak, CERN openlab and École Polytechnique Fédérale de Lausanne (EPFL); Ahmad Yasin, Intel; Avi Mendelson, Technion—Israel Institute of Technology; Willy Zwaenepoel, École Polytechnique Fédérale de Lausanne (EPFL)

Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level.

This work evaluates the accuracy of multiple event-based sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results.

Available Media

Utilizing the IOMMU Scalably

Omer Peleg and Adam Morrison, Technion—Israel Institute of Technology; Benjamin Serebrin, Google; Dan Tsafrir,Technion—Israel Institute of Technology

IOMMUs provided by modern hardware allow the OS to enforce memory protection controls on the DMA operations of its I/O devices. An IOMMU translation management design must scalably handle frequent concurrent updates of IOMMU translations made by multiple cores, which occur in high throughput I/O workloads such as multi-Gb/s networking. Today, however, OSes experience performance meltdowns when using the IOMMU in such workloads.

This paper explores scalable IOMMU management designs and addresses the two main bottlenecks we find in current OSes: (1) assignment of I/O virtual addresses (IOVAs), and (2) management of the IOMMU’s TLB.

We propose three approaches for scalable IOVA assignment: (1) dynamic identity mappings, which eschew IOVA allocation altogether, (2) allocating IOVAs using the kernel’s kmalloc, and (3) per-core caching of IOVAs allocated by a globally-locked IOVA allocator. We further describe a scalable IOMMU TLB management scheme that is compatible with all these approaches.

Evaluation of our designs under Linux shows that (1) they achieve 88.5%–100% of the performance obtained without an IOMMU, (2) they achieve similar latency to that obtained without an IOMMU, (3) scalable IOVA allocation and dynamic identity mappings perform comparably, and (4) kmalloc provides a simple solution with high performance, but can suffer from unbounded page table blowup.

Available Media

At Small Scale

Track 2: Refereed Papers

Session Chair: Fred Douglis, EMC

Selectively Taming Background Android Apps to Improve Battery Lifetime

Marcelo Martins, Brown University; Justin Cappos, New York University; Rodrigo Fonseca, Brown University

Background activities on mobile devices can cause significant battery drain with little visibility or recourse to the user. They can range from useful but sometimes overly aggressive tasks, such as polling for messages or updates from sensors and online services, to outright bugs that cause resources to be held unnecessarily. In this paper we instrument the Android OS to characterize background activities that prevent the device from sleeping. We present TAMER, an OS mechanism that interposes on events and signals that cause task wakeups, and allows for their detailed monitoring, filtering, and rate-limiting. We demonstrate how TAMER can help reduce battery drain in scenarios involving popular Android apps with background tasks. We also show how TAMER can mitigate the effects of well-known energy bugs while maintaining most of the apps’ functionality. Finally, we elaborate on how developers and users can devise their own application-control policies for TAMER to maximize battery lifetime.

Available Media

U-root: A Go-based, Firmware Embeddable Root File System with On-demand Compilation

Ronald G. Minnich, Google; Andrey Mirtchovski, Cisco

U-root is an embeddable root file system intended to be placed in a FLASH device as part of the firmware image, along with a Linux kernel. The program source code is installed in the root file system contained in the firmware FLASH part and compiled on demand. All the u-root utilities, roughly corresponding to standard Unix utilities, are written in Go, a modern, type-safe language with garbage collection and language-level support for concurrency and inter-process communication.

Unlike most embedded root file systems, which consist largely of binaries, U-root has only five: an init program and 4 Go compiler binaries. When a program is first run, it and any not-yet-built packages it uses are compiled to a RAM-based file system. The first invocation of a program takes a fraction of a second, as it is compiled. Packages are only compiled once, so the slowest build is always the first one, on boot, which takes about 3 seconds. Subsequent invocations are very fast, usually a millisecond or so.

U-root blurs the line between script-based distros such as Perl Linux and binary-based distros such as BusyBox; it has the flexibility of Perl Linux and the performance of BusyBox. Scripts and builtins are written in Go, not a shell scripting language. U-root is a new way to package and distribute file systems for embedded systems, and the use of Go promises a dramatic improvement in their security.

Available Media

LPD: Low Power Display Mechanism for Mobile and Wearable Devices

MyungJoo Ham, Inki Dae, and Chanwoo Choi, Samsung Electronics

A plethora of mobile devices such as smartphones, wearables, and tablets have been explosively penetrated into the market in the last decade. In battery powered mobile devices, energy is a scarce resource that should be carefully managed. A mobile device consists of many components and each of them contributes to the overall power consumption. This paper focuses on the energy conservation problem in display components, the importance of which is growing as contemporary mobile devices are equipped with higher display resolutions. Prior approaches to save energy in display units either critically deteriorate user perception or depend on additional hardware. We propose a novel display energy conservation scheme called LPD (Low Power Display) that preserves display quality without requiring specialized hardware. LPD utilizes the display update information available at the X Window system and eliminates expensive memory copies of unvaried parts. LPD can be directly applicable to devices based on Linux and X Windows system. Numerous experimental analyses show that LPD saves up to 7.87% of the total device power consumption. Several commercial products such as Samsung Gear S employ LPD whose source code is disclosed to the public as open-source software at http://opensource.samsung.com and http://review.tizen.org.

Available Media

Memory-Centric Data Storage for Mobile Systems

Jinglei Ren, Tsinghua University; Chieh-Jan Mike Liang, Microsoft Research; Yongwei Wu, Tsinghua University;Thomas Moscibroda, Microsoft Research

Current data storage on smartphones mostly inherits from desktop/server systems a flash-centric design: The memory (DRAM) effectively acts as an I/O cache for the relatively slow flash. To improve both app responsiveness and energy efficiency, this paper proposes MobiFS, a memory-centric design for smartphone data storage. This design no longer exercises cache writeback at short fixed periods or on file synchronization calls. Instead, it incrementally checkpoints app data into flash at appropriate times, as calculated by a set of app/user-adaptive policies. MobiFS also introduces transactions into the cache to guarantee data consistency. This design trades off data staleness for better app responsiveness and energy efficiency, in a quantitative manner. Evaluations show that MobiFS achieves 18.8× higher write throughput and 11.2× more database transactions per second than the default Ext4 filesystem in Android. Popular real-world apps show improvements in response time and energy consumption by 51.6% and 35.8% on average, respectively.

Available Media

WearDrive: Fast and Energy-Efficient Storage for Wearables

Jian Huang,Georgia Institute of Technology; Anirudh Badam, Ranveer Chandra, and Edmund B. Nightingale, Microsoft Research
Awarded Best Paper!

Size and weight constraints on wearables limit their battery capacity and restrict them from providing rich functionality. The need for durable and secure storage for personal data further compounds this problem as these features incur energy-intensive operations. This paper presents WearDrive, a fast storage system for wearables based on battery-backed RAM and an efficient means to offload energy intensive tasks to the phone. WearDrive leverages low-power network connectivity available on wearables to trade the phone’s battery for the wearable’s by performing large and energy-intensive tasks on the phone while performing small and energy-efficient tasks locally using battery-backed RAM. WearDrive improves the performance of wearable applications by up to 8.85x and improves battery life up to 3.69x with negligible impact to the phone’s battery life.

Available Media

12:05 pm–12:15 pm	Friday
Break with Refreshments

12:15 pm–1:30 pm

Friday

Closing Panel Discussion

Selling Stuff That's Free: the Commercial Side of Free Software

Moderator: Mary Baker, Hewlett-Packard

Panelists: Jim Zemlin, Executive Director, Linux Foundation; Allison Randal, President, Open Source Initiative (OSI) and Distinguished Engineer, Hewlett-Packard; Jordan Hubbard, CTO, iXsystems; Sarah Guo, Investor, Greylock Partners; Eli Collins, Chief Technologist, Cloudera; Scott McClellan, Sr. Director, Red Hat

The rise of open source and open systems over the past 20 years has been nothing short of world-changing. Open source is the unrivaled winner in data center operating systems and software, and many of the world's most successful technology companies would be hobbled without the deep and wide benefits of shared and community-developed infrastructure. New industry giants are able to stand on the shoulders of existing giants to expand and extend for the benefit of technology both enterprise and consumer worldwide.

This is true in the sense of both technology and business.

Our panelists span the technology and business aspects of open source, with diverse expertise and experience.

The first run of a panel with this same topic took place in San Diego at the 1996 USENIX Annual Technical Conference when it was much less clear that "this open stuff" would prove to become commercially viable.

This is true in the sense of both technology and business.

Our panelists span the technology and business aspects of open source, with diverse expertise and experience.

Please join us for a lively discussion of the technical and business aspects of open source and what the next 20 years might have in store. We want this to be an interactive session, so please bring your insightful questions for our panelists.

1996 Moderator: Mary Baker, Stanford University
2015 Moderator: Mary Baker, Hewlett-Packard

1996 Panelists: Bob Bruce, Walnut Creek CD-ROM (now part of iXSystems); Bill Davidow, Mohr, Davidow Ventures; Michael Tiemann, Cygnus Support (now part of Red Hat); Linus Torvalds, University of Helsinki

2015 Panelists: Jim Zemlin, Executive Director, Linux Foundation; Allison Randal, President, Open Source Initiative (OSI) and Distinguished Engineer, Hewlett-Packard; Jordan Hubbard, CTO, iXsystems; Sarah Guo, Investor, Greylock Partners

Read more about Selling Stuff That's Free: the Commercial Side of Free Software

sponsors

help promote

connect with us

twitter

usenix conference policies

You are here

connect with us

Technical Sessions

Wednesday, July 8, 2015

Continental Breakfast

Break with Refreshments

Lunch (on your own)

Break with Refreshments

Conference Reception

Thursday, July 9, 2015

Continental Breakfast