All sessions will be held in Grand Ballroom I–III unless otherwise noted.
Papers are available for download below to registered attendees now and to everyone beginning July 8, 2019. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, July 8, 2019
7:30 am–9:00 am
Grand Ballroom Prefunction
9:00 am–10:30 am
Session Chairs: Gala Yadgar, Technion—Israel Institute of Technology, and Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology)
Sang-Hoon Kim, Ajou University; Jinhong Kim and Kisik Jeong, Sungkyunkwan University; Jin-Soo Kim, Seoul National University
Recently proposed key-value SSD (KVSSD) provides the popular and versatile key-value interface at the device level, promising high performance and simplified storage management with the minimal involvement of the host software. However, its I/O command set over NVMe is defined on a per key-value pair basis, enforcing the host to post key-value operations to KVSSD independently. This not only incurs high interfacing overhead for small key-value operations but also makes it subtle to support transactions in KVSSDs without a software support.
In this paper, we propose compound commands for KVSSDs. The compound command allows the host to specify multiple key-value pairs in a single NVMe operation, thereby effectively amortizing I/O interfacing overhead. In addition, it provides an effective way for defining a transaction comprised of multiple key-value pairs. Our evaluation using a prototype KVSSD and an in-house KVSSD emulator shows promising benefits of the compound command, with improving the performance by up to 55%.
Fenggang Wu, Bingzhe Li, Zhichao Cao, Baoquan Zhang, Ming-Hong Yang, Hao Wen, and David H.C. Du, University of Minnesota, Twin Cities
The emergence of Hybrid Shingled Magnetic Recording (H-SMR) allows dynamic conversion of the recording format between Conventional Magnetic Recording (CMR) and SMR on a single disk drive. H-SMR is promising for its ability to manage the performance/capacity trade-off on the disk platters and to adaptively support different application scenarios in large-scale storage systems. However, there is little research on how to efficiently manage data and space in such H-SMR drives.
In this paper, we present ZoneAlloy, an elastic data and space management scheme for H-SMR drives, to explore the benefit of using such drives. ZoneAlloy initially allocates CMR space for the application and then gradually converts the disk format from CMR to SMR to create more space for the application. ZoneAlloy controls the overhead of the format conversion on the application I/O with our quantized migration mechanism. When data is stored in an SMR area, ZoneAlloy reduces the SMR update overhead using H-Buffer and Zone-Swap. H-Buffer is a small host-controlled CMR space that absorbs the SMR updates and migrates those updates back to the SMR space in batches to bring down the SMR update cost. Zone-Swap dynamically swaps ``hot'' data from the SMR space to the CMR space to further alleviate the SMR update problem. Evaluation results based on MSR-Cambridge traces demonstrate that ZoneAlloy can reduce the average I/O latency and limit the performance degradation of the application I/O during format conversion.
Kan Wu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau, University of Wisconsin—Madison
New non-volatile memory technologies offer unprecedented performance levels for persistent storage. However, to exploit their full potential, a deeper performance characterization of such devices is required. In this paper, we analyze a NVM-based block device -- the Intel Optane SSD -- and formalize an "unwritten contract'' of the Optane SSD. We show that violating this contract can result in 11x worse read latency and limited throughput (only 20% of peak bandwidth) regardless of parallelism. We present that this contract is relevant to features of 3D XPoint memory and Intel Optane SSD's controller/interconnect design. Finally, we discuss the implications of the contract.
10:30 am–11:00 am
Break with Refreshments
Grand Ballroom Prefunction
11:00 am–12:30 pm
Session Chairs: Ethan Miller, University of California, Santa Cruz, and Pure Storage, and Bradley Kuszmaul, Oracle
Aashaka Shah, University of Texas at Austin; Vinay Banakar, Hewlett Packard Enterprise; Supreeth Shastri, Melissa Wasserman, and Vijay Chidambaram, University of Texas at Austin
The recently introduced General Data Protection Regulation (GDPR) is forcing several companies to make significant changes to their systems to achieve compliance. Motivated by the finding that more than 30% of GDPR articles are related to storage, we investigate the impact of GDPR compliance on storage systems. We illustrate the challenges of retrofitting existing systems into compliance by modifying GDPR-compliant Redis. We show that despite needing to introduce a small set of new features, a strict real-time compliance (e.g., logging every user request synchronously) lowers Redis’ throughput by 20x. Our work reveals how GDPR allows compliance to be a spectrum, and what its implications are for system designers. We discuss the technical challenges that need to be solved before strict compliance can be efficiently achieved.
Zhen Cao, Stony Brook University; Geoff Kuenning, Harvey Mudd College; Klaus Mueller, Anjul Tyagi, and Erez Zadok, Stony Brook University
Storage researchers have always been interested in understanding the complex behavior of storage systems with the help of statistics, machine learning, and simple visualization techniques. However, when a system's behavior is affected by hundreds or even thousands of factors, existing approaches break down. Results are often difficult to interpret, and it can be challenging for humans to apply domain knowledge to a complex system. We propose to enhance storage system analysis by applying "interactive visual analytics," which can address the aforementioned limitations. We have devised a suitable Interactive Configuration Explorer (ICE), and conducted several case studies on a typical storage system, to demonstrate its benefits for storage system researchers and designers. We found that ICE makes it easy to explore a large parameter space, identify critical parameters, and quickly zero in on optimal parameter settings.
Yuhan Peng and Peter Varman, Rice University
We present Fair-EDF, a framework for latency guarantees in shared storage servers. It provides fairness control while supporting latency guarantees. Fair-EDF extends the pure earliest deadline first (EDF) scheduler by adding a controller to shape the workloads. Under overload it selects a minimal number of requests to drop and to choose the dropped requests in a fair manner. The evaluation results show Fair-EDF provides steady fairness control among a set of clients with different runtime behaviors.
12:30 pm–2:00 pm
Luncheon for Workshop Attendees
2:00 pm–3:30 pm
Mobile, Consistent, and Flexible
Session Chairs: Peter Macko, NetApp, and Geoff Kuenning, Harvey Mudd College
Yu Liang, Qiao Li, and Chun Jason Xue, Department of Computer Science, City University of Hong Kong
Current Linux memory management algorithms have been applied for many years. Android inherits Linux kernel, and thus the memory management algorithms of Linux are transplanted to Android smartphones. To evaluate the efficiency of the memory management algorithms of Android, page re-fault is applied as the target metric in this paper. Through carefully designed experiments, this paper shows that current memory management algorithms are not working well on Android smartphones. For example, page re-fault is up to 37% when running a set of popular apps, which means a large proportion of pages evicted by the existing memory management algorithms are accessed again in the near future. Furthermore, the causes of the high page re-fault ratio are analyzed. Based on the analysis, a tradeoff between the reclaim size and the overall performance is uncovered. By exploiting this tradeoff, a preliminary idea is proposed to improve the performance of Android smartphones.
Aleksey Charapko, University at Buffalo, SUNY; Microsoft, Redmond, WA; Ailidani Ailijiang, Microsoft, Redmond, WA; Murat Demirbas, University at Buffalo, SUNY; Microsoft, Redmond, WA
Many distributed systems/databases rely on Paxos for providing linearizable reads. Linearizable reads in Paxos are achieved either through running a full read round with followers, or via reading from a stable leader which holds leases on followers. We introduce a third method for performing linearizable reads by eschewing the leader and only reading from a quorum of followers. For guaranteeing linearizability, a bare quorum read is insufficient and it needs to be amended with a rinse phase to account for pending update operations. We present our Paxos Quorum Read (PQR) protocol that implements this. Our evaluations show that PQR significantly improves throughput compared to the other methods. The evaluations also show that PQR achieves comparable latency to the read from stable Paxos leader optimization.
Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B+-Tree
Jung-Sang Ahn, Mohiuddin Abdul Qader, Woon-Hak Kang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane, eBay Inc.
Designing key-value stores based on log-structured merge-tree (LSM-tree) encounters a well-known trade-off between the I/O cost of update and that of lookup as well as of space usage. It is generally believed that they cannot be improved at the same time; reducing update cost will increase lookup cost and space usage, and vice versa. Recent works have been addressing this issue, but they focus on probabilistic approaches or reducing amortized cost only, which may not be helpful for tail latency that is critical to server applications. This paper suggests a novel approach that transplants copy-on-write B+-tree into LSM-tree, aiming at reducing update cost without sacrificing lookup cost. In addition to that, our scheme provides a simple and practical way to adjust the index between update-optimized form and space-optimized form. The evaluation results show that it significantly reduces update cost with consistent lookup cost.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom Prefunction
4:00 pm–5:30 pm
Session Chairs: Mike Mesnier, Intel Labs, and Bill Jannen, Williams College
Ian F. Adams, John Keys, and Michael P. Mesnier, Intel Labs
Computational storage has remained an elusive goal. Though minimizing data movement by placing computation close to storage has quantifiable benefits, many of the previous attempts failed to take root in industry. They either require a departure from the widespread block protocol to one that is more computationally-friendly (e.g., file, object, or key-value), or they introduce significant complexity (state) on top of the block protocol.
We participated in many of these attempts and have since concluded that neither a departure from nor a significant addition to the block protocol is needed. Here we introduce a block-compatible design based on virtual objects. Like a real object (e.g., a file), a virtual object contains the metadata that is needed to process the data. We show how numerous offloads are possible using virtual objects and, as one example, demonstrate a 99% reduction in the data movement required to “scrub” object storage for bitrot. We also present our early work with erasure coded data which, unlike RAID, can be easily adapted to computational storage using virtual objects.
Daniel Bittman, Peter Alvaro, Darrell D. E. Long, and Ethan L. Miller, UC Santa Cruz
The increasing availability of byte-addressable non-volatile memory on the system bus provides an opportunity to dramatically simplify application interaction with persistent data. However, software and hardware leverage different abstractions: software operating on persistent data structures requires “global” pointers that remain valid after a process terminates, while hardware requires that a diverse set of devices all have the same mappings they need for bulk transfers to and from memory, and that they be able to do so for a potentially heterogeneous memory system. Both abstractions must be implemented in a way that is efficient using existing hardware.
We propose to abstract physical memory into an object space, which maps objects to physical memory, while providing applications with a way to refer to data that may have a lifetime longer than the processes accessing it. This approach reduces the coordination required for access to multiple types of memory while improving hardware security and enabling more hardware autonomy. We describe how we can use existing hardware support to implement these abstractions, both for applications and for the OS and devices, and show that the performance penalty for this approach is minimal.
An Ounce of Prevention is Worth a Pound of Cure: Ahead-of-time Preparation for Safe High-level Container Interfaces
Ricardo Koller and Dan Williams, IBM T. J. Watson Research Center
Containers continue to gain traction in the cloud as lightweight alternatives to virtual machines (VMs). This is partially due to their use of host filesystem abstractions, which play a role in startup times, memory utilization, crash consistency, file sharing, host introspection, and image management. However, the filesystem interface is high-level and wide, presenting a large attack surface to the host. Emerging secure container efforts focus on lowering the level of abstraction of the interface to the host through deprivileged functionality recreation (e.g., VMs, userspace kernels). However, the filesystem abstraction is so important that some have resorted to directly exposing it from the host instead of suffering the resulting semantic gap. In this paper, we suggest that through careful ahead-of-time metadata preparation, secure containers can maintain a small attack surface while simultaneously alleviating the semantic gap.
5:30 pm–6:30 pm
Poster Session and Happy Hour
Lake Washington Ballroom
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Happy Hour. Take part in discussions with your colleagues over complimentary food and drinks.
Tuesday, July 9, 2019
7:30 am–8:30 am
Grand Ballroom Prefunction
8:30 am–9:40 am
9:40 am–10:10 am
Break with Refreshments
Grand Ballroom Prefunction
10:10 am–12:10 pm
Session Chairs: Dan Peek, Facebook, and Erez Zadok, Stony Brook University
Kunal Lillaney, Johns Hopkins University; Vasily Tarasov and David Pease, IBM Research-Almaden; Randal Burns, Johns Hopkins University
Object storage has emerged as a low-cost and hyper-scalable alternative to distributed file systems. However, interface incompatibilities and performance limitations often compel users to either transfer data between a file system and object storage or use inefficient file connectors over object stores. The result is growing storage sprawl, unacceptably low performance, and an increase in associated storage costs. One promising solution to this problem is providing dual access, the ability to transparently read and write the same data through both file system interfaces and object storage APIs. In this position paper we argue that there is a need for dual-access file systems over object storage, and examine some representative use cases which benefit from such systems. Based on our conversations with end users, we discuss features which we believe are essential or desirable in a dual-access object storage file system (OSFS). Further, we design and implement an early prototype of Agni, an efficient dual-access OSFS which overcomes the shortcomings of existing approaches. Our preliminary experiments demonstrate that for some representative workloads Agni can improve performance by 20%--60% compared to either S3FS, a popular OSFS, or the prevalent approach of manually copying data between different storage systems.
Jing Liu, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison; Sudarsun Kannan, Rutgers University
We introduce file systems as processes (FSP), a storage architecture designed for modern ultra-fast storage devices. By building a direct-access file system as a standalone user-level process, FSP accelerates file system development velocity without compromising essential file system properties. FSP promises to deliver raw device-level performance via highly tuned inter-process communication mechanisms; FSP also ensures protection and metadata integrity by design. To study the potential advantages and disadvantages of the FSP approach, we develop DashFS, a prototype user-level file system. We discuss its architecture and show preliminary performance benefits.
Alex Conway and Eric Knorr, Rutgers University; Yizheng Jiao, The University of North Carolina at Chapel Hill; Michael A. Bender, Stony Brook University; William Jannen, Williams College; Rob Johnson, VMware Research; Donald Porter, The University of North Carolina at Chapel Hill; Martin Farach-Colton, Rutgers University
Filesystem fragmentation is a first-order performance problem that has been the target of many heuristic and algorithmic approaches. Real-world application benchmarks show that common filesystem operations cause many filesystems to fragment over time, a phenomenon known as filesystem aging.
This paper examines the common assumption that space pressure will exacerbate fragmentation. Our microbenchmarks show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, on a “real-world” application benchmark, space pressure causes fragmentation that slows subsequent reads by only 20% on ext4, relative to the amount of fragmentation that would occur on a file system with abundant space. The other file systems show negligible additional degradation under space pressure.
Our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the filesystem aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation cause by normal usage.
Takeshi Yoshimura, Tatsuhiro Chiba, and Hiroshi Horii, IBM Research–Tokyo
The extremely low latency of non-volatile memory (NVM) raises issues of latency in file systems. In particular, user-kernel context switches caused by system calls and hardware interrupts become a non-negligible performance penalty. A solution to this problem is using direct-access file systems, but existing work focuses on optimizing their non-POSIX user interfaces. In this work, we propose EvFS, our new user-level POSIX file system that directly manages NVM in user applications. EvFS minimizes the latency by building a user-level storage stack and introducing asynchronous processing of complex file I/O with page cache and direct I/O. We report that the event-driven architecture of EvFS leads to a 700-ns latency for 64-byte non-blocking file writes and reduces the latency for 4-Kbyte blocking file I/O by 20 us compared to a kernel file system with journaling disabled.
12:10 pm–2:00 pm
Luncheon for Workshop Attendees
2:00 pm–3:30 pm
Session Chairs: Avani Wildani, Emory University, and Angela Demke Brown, University of Toronto
Lucas Kuhring, IMDEA Software Institute, Madrid, Spain; Eva Garcia, Universidad Autónoma de Madrid, Spain; Zsolt István, IMDEA Software Institute, Madrid, Spain
In order to keep up with big data workloads, distributed storage needs to offer low latency, high bandwidth and energy efficient access to data. To achieve these properties, most state of the art solutions focus either exclusively on software or on hardware-based implementation. FPGAs are an example of the latter and a promising platform for building storage nodes but they are more cumbersome to program and less flexible than software, which limits their adoption.
We make the case that, in order to be feasible in the cloud, solutions designed around programmable hardware, such as FPGAs, have to follow a service provider-centric methodology: the hardware should only provide functionality that is useful across all tenants and rarely changes. Conversely, application-specific functionality should be delivered through software that, in a cloud setting, is under the provider's control. Deploying FPGAs this way is less cumbersome, requires less hardware programming and flexibility increases overall.
We demonstrate the benefits of this approach by building an application-aware storage for Parquet files, a columnar data format widely used in big data frameworks. Our prototype offers transparent 10Gbps deduplication in hardware without sacrificing low latency operation and specializes to Parquet files using a companion library. This work paves the way for in-storage filtering of columnar data without having to implement file-type and tenant-specific parsing in the FPGA.
Automating Context-Based Access Pattern Hint Injection for System Performance and Swap Storage Durability
SeongJae Park, Yunjae Lee, Moonsub Kim, and Heon Y. Yeom, Seoul National University
Memory pressure is inevitable as the size of working sets is rapidly growing while the capacity of dynamic random access memory (DRAM) is not. Meanwhile, storage devices have evolved so that their speed is comparable to the speed of DRAM while their capacity scales are comparable to that of hard disk drives (HDD). Thus, hierarchial memory systems configuring DRAM as the main memory and high-end storages as swap devices will be common.
Due to the unique characteristics of these modern storage devices, the swap target decision should be optimal. It is essential to know the exact data access patterns of workloads for such an optimal decision, although underlying systems cannot accurately estimate such complex and dynamic patterns. For this reason, memory systems allow programs to voluntarily hint their data access pattern. Nevertheless, it is exhausting for a human to manually figure out the patterns and embed optimal hints if the workloads are huge and complex.
This paper introduces a compiler extension that automatically optimizes a program to voluntarily hint its dynamic data access patterns to the underlying swap system using a static/dynamic analysis based profiling result. To our best knowledge, this is the first profile-guided optimization (PGO) for modern swap devices. Our empirical evaluation of the scheme using realistic workloads shows consistent improvement in performance and swap device lifetime up to 2.65 times and 2.98 times, respectively.
Mania Abdi, Northeastern University; Amin Mosayyebzadeh, Boston University; Mohammad Hossein Hajkazemi, Northeastern University; Ata Turk, State Street; Orran Krieger, Boston University; Peter Desnoyers, Northeastern University
To get good performance for data stored in Object storage services like S3, data analysis clusters need to cache data locally. Recently these caches have started taking into account higher-level information from analysis framework, allowing prefetching based on predictions of future data accesses. There is, however, a broader opportunity; rather than using this information to predict one future, we can use it to select a future that is best for caching. This paper provides preliminary evidence that we can exploit the directed acyclic graph (DAG) of inter-task dependencies used by data-parallel frameworks such as Spark, Pig, and Hive to improve application performance, by optimizing caching for the critical path through the DAG for the application. We present experimental results for PIG running TPC-H queries, showing completion time improvements of up to 23% vs our implementation of MRD, a state-of-the-art DAG-based prefetching system, and improvements of up to 2.5x vs LRU caching. We then discuss the broader opportunity for building a system based on this opportunity.
3:30 pm–4:00 pm
Break with Refreshments
4:00 pm–5:30 pm
Flash & SSD
Session Chairs: Peter Desnoyers, Northeastern University, and Dean Hildebrand, Google
Pan Yang, Ni Xue, Yuqi Zhang, Yangxu Zhou, Li Sun, Wenwen Chen, Zhonggang Chen, Wei Xia, Junke Li, and Kihyoun Kwon, Samsung R&D Institute China Xi'an, Samsung Electronics
In solid-state drives (SSDs), garbage collection (GC) plays a key role in making free NAND blocks for newly coming data. The data copied from one block to another by GC affects both the performance and lifetime of SSD significantly. Placing the data with different “temperature” into different NAND blocks can reduce data copy overhead in GC. This paper proposes a scheme to place data according to its predicted future temperature. A neural network called LSTM is applied to increase the accuracy of temperature prediction in both temporal and spatial dimensions. And it also uses K-Means to do clustering and automatically dispatch similar “future temperature” data to the same NAND blocks. The results obtained show that performance and write amplification factor (WAF) are improved in various applications. In the best case, the WAF and 99.99% of the write latency are reduced by up to 43.5% and 79.3% respectively.
Qiao Li, Department of Computer Science, City University of Hong Kong; Min Ye, YEESTOR Microelectronics Co., Ltd; Yufei Cui, Department of Computer Science, City University of Hong Kong; Liang Shi, School of Computer Science and Software Engineering, East China Normal University; Xiaoqiang Li, YEESTOR Microelectronics Co., Ltd; Chun Jason Xue, Department of Computer Science, City University of Hong Kong
With latest development, NAND flash is experiencing increased errors. The read reference voltages are the key factor for RBER seen by ECC. The limited error correction capability of ECC determines a value range that the read voltages should fall into, otherwise a read failure followed by a read retry with tuned read voltage, would happen. Therefore, finding a correct read voltage with the smallest number of read failures has been a hot research problem. Previous methods in the literature are designed to either progressively tune the voltage value or empirically predict a read voltage based on error models. However, straightforward tuning leads to unpredictable large number of read retries, whereas complex modeling brings large overhead. This paper proposes a novel approach, by reserving a small set of cells as sentinels, which directly tell us the optimal voltage, as drifting caused errors exhibits strong locality. Experiments demonstrate the proposed technique is both efficient and effective.
Wonil Choi, Bhuvan Urgaonkar, and Mahmut Kandemir, Pennsylvania State University; Myoungsoo Jung, KAIST
We argue that, along with bandwidth and capacity, lifetime of flash devices is also a critical resource that needs to be explicitly and carefully managed, especially in emerging consolidated environments. We study the resulting multi-resource allocation problem in a setting where "fairness" across consolidated workloads is desired. Towards this, we propose to adapt the well-known notion of dominant resource fairness (DRF). We empirically show that using DRF with only bandwidth and capacity (and ignoring lifetime) may result in poor device lifetime. Incorporating lifetime, however, turns out to be non-trivial. We identify key challenges in this adaptation and present simple heuristics. We also discuss possible design choices which will be fully explored in future work.