{DIDACache}: A Deep Integration of Device and Application for Flash Based {Key-Value} Caching

Zhaoyan Shen; Yuvraj Patel; Feng Chen; Ram Kesavan; Yichen Jia; Ning Zheng; Dean Hildebrand; Zili Shao; Sudipta Sengupta; Ashok Sankar Harihara Subramony; Bikash Sharma; Moinuddin K. Qureshi; Eran Rom; Ohad Shacham; Bradley C. Kuszmaul; Donald E. Porter; Martin Farach-Colton

All sessions will be held in Grand Ballroom ABCFGH unless otherwise noted.

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
FAST '17 Full Proceedings (PDF)
FAST '17 Proceedings Interior (PDF, best for mobile devices)

Full Proceedings ePub (for iPad and most eReaders)
FAST '17 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
FAST '17 Full Proceedings (Mobi)

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files

FAST '17 Attendee List (PDF)

FAST '17 Web Archive (ZIP)

Tuesday, February 28

7:30 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

8:45 am–9:15 am

Opening Remarks and Presentation of Best Paper and Test of Time Awards

Program Co-Chairs: Geoff Kuenning, Harvey Mudd College, and Carl Waldspurger, CloudPhysics

View all Test of Time award winners.

9:15 am–10:45 am

Keynote Address

Memory-Driven Computing

Kimberly Keeton, Hewlett Packard Labs

Available Media

Data growth and data analytics requirements are outpacing the compute and storage technologies that have provided the foundation of processor-driven architectures for the last five decades. This divergence requires a deep rethinking of how we build systems, and points towards a memory-driven architecture where memory is the key resource and everything else, including processing, revolves around it. Memory-driven computing (MDC) brings together byte-addressable persistent memory, a fast memory fabric, task-specific processing, and a new software stack to address these data growth and analysis challenges. At Hewlett Packard Labs, we are exploring MDC hardware and software design through The Machine. This talk will review the trends that motivate MDC, illustrate how MDC benefits applications, provide highlights from our Machine-related work in data management and programming models, and outline challenges that MDC presents for the FAST community.

Kimberly Keeton, Hewlett Packard Labs

Dr. Kimberly Keeton is a Distinguished Technologist at Hewlett Packard Labs. She holds a Ph.D. and an M.S. in Computer Science from the University of California, Berkeley, and a B.S. in Computer Engineering and Engineering and Public Policy from Carnegie Mellon University. Her recent research is in the areas of NVM-aware data stores and data analytics frameworks. She has also worked in the areas of storage and information management, NoSQL databases, storage dependability, intelligent storage, and workload characterization. She was a co-architect of the Express Query database, which provides metadata services for HPE's StoreAll archiving solution. She is an ACM Distinguished Scientist and a Senior Member of the IEEE, and has served as Technical Program Committee Chair for multiple USENIX, ACM, IEEE and IFIP sponsored conferences.

10:45 am–11:15 am

Break with Refreshments

Grand Ballroom Foyer

11:15 am–12:30 pm

Garbage

Session Chair: Nitin Agrawal, Samsung Research

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

Ram Kesavan, Rohit Singh, and Travis Grusecki, NetApp; Yuvraj Patel, University of Wisconsin—Madison
Awarded Best Paper!

Available Media

NetApp®WAFL®is a transactional file system that uses the copy-on-write mechanism to support fast write performance and efficient snapshot creation. However, copy-on-write increases the demand on the file system to find free blocks quickly; failure to do so may impede allocations for incoming writes. Efficiency is also important, because the task may consume CPU and other resources. In this paper, we describe the evolution (over more than a decade) of WAFL’s algorithms and data structures for reclaiming space with minimal impact on the overall storage appliance performance.

Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs

Shiqin Yan, Huaicheng Li, Mingzhe Hao, and Michael Hao Tong, University of Chicago; Swaminathan Sundararaman, Parallel Machines; Andrew A. Chien and Haryadi S. Gunawi, University of Chicago

Available Media

TTFLASH is a “tiny-tail” flash drive (SSD) that eliminates GC-induced tail latencies by circumventing GCblocked I/Os with four novel strategies: plane-blocking GC, rotating GC, GC-tolerant read, and GC-tolerant flush. It is built on three SSD internal advancements: powerful controllers, parity-based RAIN, and capacitorbacked RAM, but is dependent on the use of intra-plane copyback operations. We show that TTFLASH comes significantly close to a “no-GC” scenario. Specifically, between 99–99.99th percentiles, TTFLASH is only 1.0 to 2.6× slower than the no-GC case, while a base approach suffers from 5–138× GC-induced slowdowns.

The Logic of Physical Garbage Collection in Deduplicating Storage

Fred Douglis, Abhinav Duggal, Philip Shilane, and Tony Wong, Dell EMC; Shiqin Yan, Dell EMC and University of Chicago; Fabiano Botelho, Rubrik, Inc.

Available Media

Most storage systems that write in a log-structured manner need a mechanism for garbage collection (GC), reclaiming and consolidating space by identifying unused areas on disk. In a deduplicating storage system, GC is complicated by the possibility of numerous references to the same underlying data. We describe two variants of garbage collection in a commercial deduplicating storage system, a logical GC that operates on the files containing deduplicated data and a physical GC that performs sequential I/O on the underlying data. The need for the second approach arises from a shift in the underlying workloads, in which exceptionally high duplication ratios or the existence of millions of individual small files result in unacceptably slow GC using the file-level approach. Under such workloads, determining the liveness of chunks becomes a slow phase of logical GC. We find that physical GC decreases the execution time of this phase by up to two orders of magnitude in the case of extreme workloads and improves it by approximately 10–60% in the common case, but only after additional optimizations to compensate for its higher initialization overheads.

12:30 pm–2:00 pm

Conference Luncheon

Terra Courtyard, Sponsored by NetApp

2:00 pm–3:40 pm

The System

Session Chair: Ming Zhao, Arizona State University

File Systems Fated for Senescence? Nonsense, Says Science!

Alex Conway and Ainesh Bakshi, Rutgers University; Yizheng Jiao and Yang Zhan, The University of North Carolina at Chapel Hill; Michael A. Bender, William Jannen, and Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Oracle Corporation and Massachusetts Institute of Technology; Donald E. Porter, The University of North Carolina at Chapel Hill; Jun Yuan, Farmingdale State College of SUNY; Martin Farach-Colton, Rutgers University

Available Media

File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions which eventually lead to slower performance, or aging. Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system implementors treat aging as a solved problem.

However, this paper describes realistic as well as synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. For example, on ext4 and ZFS, a few hundred git pull operations can reduce read performance by a factor of 2; performing a thousand pulls can reduce performance by up to a factor of 30. We further present microbenchmarks demonstrating that common placement strategies are extremely sensitive to file-creation order; varying the creation order of a few thousand small files in a real-world directory structure can slow down reads by 15–175x, depending on the file system.

We argue that these slowdowns are caused by poor layout. We demonstrate a correlation between read performance of a directory scan and the locality within a file system’s access patterns, using a dynamic layout score.

In short, many file systems are exquisitely prone to read aging for a variety of write workloads. We show, however, that aging is not inevitable. BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. BetrFS typically outperforms the other file systems in our benchmarks; aged BetrFS even outperforms the unaged versions of these file systems, excepting Btrfs. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.

To FUSE or Not to FUSE: Performance of User-Space File Systems

Bharath Kumar Reddy Vangoor, Stony Brook University; Vasily Tarasov, IBM Research-Almaden; Erez Zadok, Stony Brook University

Note: Due to technical difficulties, there is no audio of this talk.

Available Media

Traditionally, file systems were implemented as part of OS kernels. However, as complexity of file systems grew, many new file systems began being developed in user space. Nowadays, user-space file systems are often used to prototype and evaluate new approaches to file system design. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. As a result, the topic of user-space file systems remains rather controversial: while some consider user-space file systems a toy not to be used in production, others develop full-fledged production file systems in user space. In this paper we analyze the design and implementation of the most widely known user-space file system framework—FUSE—and characterize its performance for a wide range of workloads. We instrumented FUSE to extract useful statistics and traces, which helped us analyze its performance bottlenecks and present our analysis results. Our experiments indicate that depending on the workload and hardware used, performance degradation caused by FUSE can be completely imperceptible or as high as –83% even when optimized; and relative CPU utilization can increase by 31%.

Knockoff: Cheap Versions in the Cloud

Xianzheng Dou, Peter M. Chen, and Jason Flinn, University of Michigan

Note: Due to technical difficulties, there is no audio of this talk.

Available Media

Cloud-based storage provides reliability and ease-of-management. Unfortunately, it can also incur significant costs for both storing and communicating data, even after using techniques such as chunk-based deduplication and delta compression. The current trend of providing access to past versions of data exacerbates both costs.

In this paper, we show that deterministic recomputation of data can substantially reduce the cost of cloud storage. Borrowing a well-known dualism from the fault-tolerance community, we note that any data can be equivalently represented by a log of the nondeterministic inputs needed to produce that data. We design a file system, called Knockoff, that selectively substitutes nondeterministic inputs for file data to reduce communication and storage costs. Knockoff compresses both data and computation logs: it uses chunk-based deduplication for file data and delta compression for logs of nondeterminism. In two studies, Knockoff reduces the average cost of sending files to the cloud without versioning by 21% and 24%; the relative benefit increases as versions are retained more frequently.

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Salman Niazi, Mahmoud Ismail, Seif Haridi, and Jim Dowling, KTH Royal Institute of Technology; Steffen Grohsschmiedt, Spotify AB; Mikael Ronström, Oracle

Note: Due to technical difficulties, there is no audio of this talk.

Available Media

Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.

3:40 pm–4:10 pm

Break with Refreshments

TusCA Courtyard

4:10 pm–5:25 pm

Edward Sharpe and the Magnetic Zeros

Session Chair: Swami Sundararaman, Parallel Machines

Evolving Ext4 for Shingled Disks

Abutalib Aghayev, Carnegie Mellon University; Theodore Ts’o, Google, Inc.; Garth Gibson, Carnegie Mellon University; Peter Desnoyers, Northeastern University

Note: Due to technical difficulties, there is no audio of this talk.

Available Media

Drive-Managed SMR (ShingledMagnetic Recording) disks offer a plug-compatible higher-capacity replacement for conventional disks. For non-sequential workloads, these disks show bimodal behavior: After a short period of high throughput they enter a continuous period of low throughput.

We introduce ext4-lazy¹, a small change to the Linux ext4 file system that significantly improves the throughput in both modes. We present benchmarks on four different drive-managed SMR disks from two vendors, showing that ext4-lazy achieves 1.7-5.4x improvement over ext4 on a metadata-light file server benchmark. On metadata-heavy benchmarks it achieves 2-13x improvement over ext4 on drive-managed SMR disks as well as on conventional disks.

SMaRT: An Approach to Shingled Magnetic Recording Translation

Weiping He and David H.C. Du, University of Minnesota

Available Media

Shingled Magnetic Recording (SMR) is a new technique for increasing areal data density in hard drives. Drivemanaged SMR (DM-SMR) drives employ a shingled translation layer to mask internal data management and support block interface to the host software. Two major challenges of designing an efficient shingled translation layer for DM-SMR drives are metadata overhead and garbage collection overhead.

In this paper we introduce SMaRT, an approach to Shingled Magnetic Recording Translation which adapts its data management scheme as the drive utilization changes. SMaRT uses a hybrid update strategy which performs in-place update for the qualified tracks and outof- place updates for the unqualified tracks. Background Garbage Collection (GC) operations and on-demand GC operations are used when the free space becomes too fragmented. SMaRT also has a specially crafted space allocation and track migration scheme that supports automatic cold data progression to minimize GC overhead in the long term.

We implement SMaRT and compare it with a regular Hard Disk Drive (HDD) and a simulated Seagate DM-SMR drive. The experiments with several block I/O traces demonstrate that SMaRT performs better than the Seagate drive and even provides comparable performance as regular HDDs when drive space usage is below a certain threshold.

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-Level Transparent Local Erasure Coding

Yin Li and Hao Wang, Rensselaer Polytechnic Institute; Xuebin Zhang, Dell EMC/DSSD; Ning Zheng, Rensselaer Polytechnic Institute; Shafa Dahandeh, Western Digital; Tong Zhang, Rensselaer Polytechnic Institute

Available Media

This paper presents a simple yet effective design solution to facilitate technology scaling for hard disk drives (HDDs) being deployed in data centers. Emerging magnetic recording technologies improve storage areal density mainly through reducing the track pitch, which however makes HDDs subject to higher read retry rates. More frequent HDD read retries could cause intolerable tail latency for large-scale systems such as data centers. To reduce the occurrence of costly read retry, one intuitive solution is to apply erasure coding locally on each HDD or JBOD (just a bunch of disks). To be practically viable, local erasure coding must have very low coding redundancy, which demands very long codeword length (e.g., one codeword spans hundreds of 4kB sectors) and hence large file size. This makes local erasure coding mainly suitable for data center applications. This paper contends that local erasure coding should be implemented transparently within filesystems, and accordingly presents a basic design framework and elaborates on important design issues. Meanwhile, this paper derives the mathematical formulations for estimating its effect on reducing HDD read tail latency. Using Reed-Solomon (RS) based erasure codes as test vehicles, we carried out detailed analysis and experiments to evaluate its implementation feasibility and effectiveness. We integrated the developed design solution into ext4 to further demonstrate its feasibility and quantitatively measure its impact on average speed performance of various big data benchmarks.

6:00 pm–8:00 pm

Poster Session and Reception I

Santa Clara Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

View the list of accepted posters.

Wednesday, March 1

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:40 am

Corruption

Session Chair: Vasily Tarasov, IBM Research

Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions

Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

Available Media

We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous bugs related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. Our results have implications for the design of next generation fault-tolerant distributed and cloud storage systems.

Omid, Reloaded: Scalable and Highly-Available Transaction Processing

Ohad Shacham, Yahoo Research; Francisco Perez-Sorrosal, Yahoo; Edward Bortnikov and Eshcar Hillel, Yahoo Research; Idit Keidar, Technion—Israel Institute of Technology and Yahoo Research; Ivan Kelly, Midokura; Matthieu Morel, Skyscanner; Sameer Paranjpye, Arimo

Available Media

We present Omid—a transaction processing service that powers web-scale production systems at Yahoo. Omid provides ACID transaction semantics on top of traditional key-value storage; its implementation over Apache HBase is open sourced as part of Apache Incubator. Omid can serve hundreds of thousands of transactions per second on standard mid-range hardware, while incurring minimal impact on the speed of data access in the underlying key-value store. Additionally, as expected from always-on production services, Omid is highly available.

Application Crash Consistency and Performance with CCFS

Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, and Lanyue Lu, University of Wisconsin—Madison; Vijay Chidambaram, The University of Texas at Austin; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison
Awarded Best Paper!

Available Media

Recent research has shown that applications often incorrectly implement crash consistency. We present ccfs, a file system that improves the correctness of application-level crash consistency protocols while maintaining high performance. A key idea in ccfs is the abstraction of a stream. Within a stream, updates are committed in program order, thus helping correctness; across streams, there are no ordering restrictions, thus enabling scheduling flexibility and high performance. We empirically demonstrate that applications running atop ccfs achieve high levels of crash consistency. Further, we show that ccfs performance under standard filesystem benchmarks is excellent, in the worst case on par with the highest performing modes of Linux ext4, and in some cases notably better. Overall, we demonstrate that both application correctness and high performance can be realized in a modern file system.

High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System

Harendra Kumar and Yuvraj Patel, University of Wisconsin—Madison; Ram Kesavan and Sumith Makam, NetApp

Available Media

We introduce a low-cost incremental checksum technique that protects metadata blocks against in-memory scribbles, and a lightweight digest-based transaction auditing mechanism that enforces file system consistency invariants. Compared with previous work, our techniques reduce performance overhead by an order of magnitude. They also help distinguish scribbles from logic bugs. We also present a mechanism to pinpoint the cause of scribbles on production systems. Our techniques have been productized in the NetApp® WAFL® (Write Anywhere File Layout) file system with negligible performance overhead, greatly reducing corruption-related incidents over the past five years, based on millions of runtime hours.

10:40 am–11:10 am

Break with Refreshments

Grand Ballroom Foyer

11:10 am–12:25 pm

Frameworks

Session Chair: Danny Harnik, IBM Research—Haifa

Mirador: An Active Control Plane for Datacenter Storage

Jake Wires and Andrew Warfield, Coho Data

Available Media

This paper describes Mirador, a dynamic placement service implemented as part of an enterprise scale-out storage product. Mirador is able to encode multidimensional placement goals relating to the performance, failure response, and workload adaptation of the storage system. Using approaches from dynamic constraint satisfaction, Mirador migrates both data and client network connections in order to continuously adapt and improve the configuration of the storage system.

Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data

Florian Lautenschlager, QAware GmbH; Michael Philippsen and Andreas Kumlehn, Friedrich-Alexander-Universität Erlangen-Nürnberg; Josef Adersberger, QAware GmbH

Available Media

Anomalies in the runtime behavior of software systems, especially in distributed systems, are inevitable, expensive, and hard to locate. To detect and correct such anomalies (like instability due to a growing memory consumption, failure due to load spikes, etc.) one has to automatically collect, store, and analyze the operational data of the runtime behavior, often represented as time series. There are efficient means both to collect and analyze the runtime behavior. But traditional time series databases do not yet focus on the specific needs of anomaly detection (generic data model, specific built-in functions, storage efficiency, and fast query execution).

The paper presents Chronix, a domain specific time series database targeted at anomaly detection in operational data. Chronix uses an ideal compression and chunking of the time series data, a methodology for commissioning Chronix’ parameters to a sweet spot, a way of enhancing the data with attributes, an expandable set of analysis functions, and other techniques to achieve both faster query times and a significantly smaller memory footprint. On benchmarks Chronix saves 20%–68% of the space that other time series databases need to store the data and saves 80%–92% of the data retrieval time and 73%–97% of the runtime of analyzing functions.

Crystal: Software-Defined Storage for Multi-Tenant Object Stores

Raúl Gracia-Tinedo, Josep Sampé, Edgar Zamora, Marc Sánchez-Artigas, and Pedro García-López, Universitat Rovira i Virgili; Yosef Moatti and Eran Rom, IBM Research—Haifa

Available Media

Object stores are becoming pervasive due to their scalability and simplicity. Their broad adoption, however, contrasts with their rigidity for handling heterogeneous workloads and applications with evolving requirements, which prevents the adaptation of the system to such varied needs. In this work, we present Crystal, the first Software-Defined Storage (SDS) architecture whose core objective is to efficiently support multi-tenancy in object stores. Crystal adds a filtering abstraction at the data plane and exposes it to the control plane to enable high-level policies at the tenant, container and object granularities. Crystal translates these policies into a set of distributed controllers that can orchestrate filters at the data plane based on real-time workload information. We demonstrate Crystal through two use cases on top of OpenStack Swift: One that proves its storage automation capabilities, and another that differentiates IO bandwidth in a multi-tenant scenario. We show that Crystal is an extensible platform to deploy new SDS services for object stores with small overhead.

12:25 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:15 pm

Solid State Records

Session Chair: Philip Shilane, Dell EMC

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Se Kwon Lee, UNIST (Ulsan National Institute of Science and Technology); K. Hyun Lim, Hongik University; Hyunsub Song, Beomseok Nam, and Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology)

Available Media

Recent interest in persistent memory (PM) has stirred development of index structures that are efficient in PM. Recent such developments have all focused on variations of the B-tree. In this paper, we show that the radix tree, which is another less popular indexing structure, can be more appropriate as an efficient PM indexing structure. This is because the radix tree structure is determined by the prefix of the inserted keys and also does not require tree rebalancing operations and node granularity updates. However, the radix tree as-is cannot be used in PM. As another contribution, we present three radix tree variants, namely, WORT (Write Optimal Radix Tree), WOART (Write Optimal Adaptive Radix Tree), and ART+CoW. Of these, the first two are optimal for PM in the sense that they only use one 8-byte failure-atomic write per update to guarantee the consistency of the structure and do not require any duplicate copies for logging or CoW. Extensive performance studies show that our proposed radix tree variants perform considerable better than recently proposed B-tree variants for PM such NVTree, wB+Tree, and FPTree for synthetic workloads as well as in implementations within Memcached.

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Hyukjoong Kim and Dongkun Shin, Sungkyunkwan University; Yun Ho Jeong and Kyung Ho Kim, Samsung Electronics

Available Media

Recent advances in flash memory technology have reduced the cost-per-bit of flash storage devices such as solid-state drives (SSDs), thereby enabling the development of large-capacity SSDs for enterprise-scale storage. However, two major concerns arise in designing SSDs. The first concern is the poor performance of random writes in an SSD. Server workloads such as databases generate many random writes; therefore, this problem must be resolved to enable the usage of SSDs in enterprise systems. The second concern is that the size of the internal DRAM of an SSD is proportional to the capacity of the SSD. The peculiarities of flash memory require an address translation layer called flash translation layer (FTL) to be implemented within an SSD. The FTL must maintain the address mapping table in the internal DRAM. Although the previously proposed demand map loading technique can reduce the required DRAM size, the technique aggravates the poor random performance. We propose a novel address reshaping technique called sequentializing in host and randomizing in device (SHRD), which transforms random write requests into sequential write requests in the block device driver by assigning the address space of the reserved log area in the SSD. Unlike previous approaches, SHRD can restore the sequentially written data to the original location without requiring explicit copy operations by utilizing the address mapping scheme of the FTL.We implement SHRD in a real SSD device and demonstrate the improved performance resulting from SHRD for various workloads.

Graphene: Fine-Grained IO Management for Graph Computing

Hang Liu and H. Howie Huang, The George Washington University

Available Media

As graphs continue to grow, external memory graph processing systems serve as a promising alternative to inmemory solutions for low cost and high scalability. Unfortunately, not only does this approach require considerable efforts in programming and IO management, but its performance also lags behind, in some cases by an order of magnitude. In this work, we strive to achieve an ambitious goal of achieving ease of programming and high IO performance (as in-memory processing) while maintaining graph data on disks (as external memory processing). To this end, we have designed and developed Graphene that consists of four new techniques: an IO request centric programming model, bitmap based asynchronous IO, direct hugepage support, and data and workload balancing. The evaluation shows that Graphene can not only run several times faster than several external-memory processing systems, but also performs comparably with in-memory processing on large graphs.

3:15 pm–3:45 pm

Break with Refreshments

TusCA Courtyard

3:45 pm–5:25 pm

Savage Progress

Work-in-Progress (WiP) Session

View the list of accepted Work-in-Progress Reports.

6:00 pm–8:00 pm

Poster Session and Reception II

Santa Clara Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

View the list of accepted posters.

Thursday, March 2

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:40 am

Faster Faster

Session Chair: Tudor Marian, Google

vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O

Ming Chen, Stony Brook University; Dean Hildebrand, IBM Research-Almaden; Henry Nelson, Ward Melville High School; Jasmit Saluja, Ashok Sankar Harihara Subramony, and Erez Zadok, Stony Brook University

Available Media

Modern systems use networks extensively, accessing both services and storage across local and remote networks. Latency is a key performance challenge, and packing multiple small operations into fewer large ones is an effective way to amortize that cost, especially after years of significant improvement in bandwidth but not latency. To this end, the NFSv4 protocol supports a compounding feature to combine multiple operations. Yet compounding has been underused since its conception because the synchronous POSIX file-system API issues only one (small) request at a time.

We propose vNFS, an NFSv4.1-compliant client that exposes a vectorized high-level API and leverages NFS compound procedures to maximize performance. We designed and implemented vNFS as a user-space RPC library that supports an assortment of bulk operations on multiple files and directories. We found it easy to modify several UNIX utilities, an HTTP/2 server, and Filebench to use vNFS. We evaluated vNFS under a wide range of workloads and network latency conditions, showing that vNFS improves performance even for low-latency networks. On high-latency networks, vNFS can improve performance by as much as two orders of magnitude.

On the Accuracy and Scalability of Intensive I/O Workload Replay

Alireza Haghdoost and Weiping He, University of Minnesota; Jerry Fredin, NetApp; David H.C. Du, University of Minnesota

Available Media

We introduce a replay tool that can be used to replay captured I/O workloads for performance evaluation of high-performance storage systems. We study several sources in the stock operating system that introduce the uncertainty of replaying a workload. Based on the remedies of these findings, we design and develop a new replay tool called hfplayer that can more accurately replay intensive block I/O workloads in a similar unscaled environment. However, to replay a given workload trace in a scaled environment, the dependency between I/O requests becomes crucial. Therefore, we propose a heuristic way of speculating I/O dependencies in a block I/O trace. Using the generated dependency graph, hfplayer is capable of replaying the I/O workload in a scaled environment. We evaluate hfplayer with a wide range of workloads using several accuracy metrics and find that it produces better accuracy when compared with two exiting available replay tools.

On the Performance Variation in Modern Storage Stacks

Zhen Cao, Stony Brook University; Vasily Tarasov, IBM Research-Almaden; Hari Prasath Raman, Stony Brook University; Dean Hildebrand, IBM Research-Almaden; Erez Zadok, Stony Brook University

Available Media

Ensuring stable performance for storage stacks is important, especially with the growth in popularity of hosted services where customers expect QoS guarantees. The same requirement arises from benchmarking settings as well. One would expect that repeated, carefully controlled experiments might yield nearly identical performance results—but we found otherwise. We therefore undertook a study to characterize the amount of variability in benchmarking modern storage stacks. In this paper we report on the techniques used and the results of this study. We conducted many experiments using several popular workloads, file systems, and storage devices—and varied many parameters across the entire storage stack. In over 25% of the sampled configurations, we uncovered variations higher than 10% in storage performance between runs. We analyzed these variations and found that there was no single root cause: it often changed with the workload, hardware, or software configuration in the storage stack. In several of those cases we were able to fix the cause of variation and reduce it to acceptable levels. We believe our observations in benchmarking will also shed some light on addressing stability issues in production systems.

Enlightening the I/O Path: A Holistic Approach for Application Performance

Sangwook Kim, Apposha and Sungkyunkwan University; Hwanju Kim, Sungkyunkwan University and Dell EMC; Joonwon Lee and Jinkyu Jeong, Sungkyunkwan University

Available Media

In data-intensive applications, such as databases and keyvalue stores, reducing the request handling latency is important for providing better data services. In such applications, I/O-intensive background tasks, such as checkpointing, are the major culprit in worsening the latency due to the contention in shared I/O stack and storage. To minimize the contention, properly prioritizing I/Os is crucial but the effectiveness of existing approaches is limited for two reasons. First, statically deciding the priority of an I/O is insufficient since high-priority tasks can wait for low-priority I/Os due to I/O priority inversion. Second, multiple independent layers in modern storage stacks are not holistically considered by existing approaches which thereby fail to effectively prioritize I/Os throughout the I/O path.

In this paper, we propose a request-centric I/O prioritization that dynamically detects and prioritizes I/Os delaying request handling at all layers in the I/O path. The proposed scheme is implemented on Linux and is evaluated with three applications, PostgreSQL, MongoDB, and Redis. The evaluation results show that our scheme achieves up to 53% better request throughput and 42× better 99th percentile request latency (84 ms vs. 3581 ms), compared to the default configuration in Linux.

10:40 am–11:10 am

Break with Refreshments

Grand Ballroom Foyer

11:10 am–12:25 pm

Open Channel D

Session Chair: Irfan Ahmad, CachePhysics

LightNVM: The Linux Open-Channel SSD Subsystem

Matias Bjørling, CNEX Labs, Inc. and IT University of Copenhagen; Javier Gonzalez, CNEX Labs, Inc.; Philippe Bonnet, IT University of Copenhagen

Available Media

As Solid-State Drives (SSDs) become commonplace in data-centers and storage arrays, there is a growing demand for predictable latency. Traditional SSDs, serving block I/Os, fail to meet this demand. They offer a high-level of abstraction at the cost of unpredictable performance and suboptimal resource utilization. We propose that SSD management trade-offs should be handled through Open-Channel SSDs, a new class of SSDs, that give hosts control over their internals. We present our experience building LightNVM, the Linux Open-Channel SSD subsystem. We introduce a new Physical Page Address I/O interface that exposes SSD parallelism and storage media characteristics. LightNVM integrates into traditional storage stacks, while also enabling storage engines to take advantage of the new I/O interface. Our experimental results demonstrate that LightNVM has modest host overhead, that it can be tuned to limit read latency variability and that it can be customized to achieve predictable I/O latencies.

FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs

Jian Huang, Georgia Institute of Technology; Anirudh Badam, Laura Caulfield, Suman Nath, Sudipta Sengupta, and Bikash Sharma, Microsoft; Moinuddin K. Qureshi, Georgia Institute of Technology

Available Media

A longstanding goal of SSD virtualization has been to provide performance isolation between multiple tenants sharing the device. Virtualizing SSDs, however, has traditionally been a challenge because of the fundamental tussle between resource isolation and the lifetime of the device—existing SSDs aim to uniformly age all the regions of flash and this hurts isolation. We propose utilizing flash parallelism to improve isolation between virtual SSDs by running them on dedicated channels and dies. Furthermore, we offer a complete solution by also managing the wear. We propose allowing the wear of different channels and dies to diverge at fine time granularities in favor of isolation and adjusting that imbalance at a coarse time granularity in a principled manner. Our experiments show that the new SSD wears uniformly while the 99th percentile latencies of storage operations in a variety of multi-tenant settings are reduced by up to 3.1x compared to software isolated virtual SSDs.

DIDACache: A Deep Integration of Device and Application for Flash Based Key-Value Caching

Zhaoyan Shen, Hong Kong Polytechnic University; Feng Chen and Yichen Jia, Louisiana State University; Zili Shao, Hong Kong Polytechnic University

Available Media

In recent years, flash-based key-value cache systems have raised high interest in industry, such as Facebook’s McDipper and Twitter’s Fatcache. These cache systems typically use commercial SSDs to store and manage key-value cache data in flash. Such a practice, though simple, is inefficient due to the huge semantic gap between the key-value cache manager and the underlying flash devices. In this paper, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.

FAST '17 Technical Sessions

Tuesday, February 28

7:30 am–9:00 am

Continental Breakfast

8:45 am–9:15 am

Opening Remarks and Presentation of Best Paper and Test of Time Awards

9:15 am–10:45 am

Keynote Address

10:45 am–11:15 am

Break with Refreshments

11:15 am–12:30 pm

Garbage

12:30 pm–2:00 pm

Conference Luncheon

2:00 pm–3:40 pm

The System

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:25 pm

Edward Sharpe and the Magnetic Zeros

6:00 pm–8:00 pm

Poster Session and Reception I

Wednesday, March 1

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:40 am

Corruption

10:40 am–11:10 am

Break with Refreshments

11:10 am–12:25 pm

Frameworks

12:25 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:15 pm

Solid State Records

3:15 pm–3:45 pm

Break with Refreshments

3:45 pm–5:25 pm

Savage Progress

6:00 pm–8:00 pm

Poster Session and Reception II

Thursday, March 2

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:40 am

Faster Faster

10:40 am–11:10 am

Break with Refreshments

11:10 am–12:25 pm

Open Channel D