FAST '18 Technical Sessions

All sessions will be held in Hall East unless otherwise noted.

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Full Proceedings PDFs
 FAST '18 Full Proceedings (PDF)
 FAST '18 Proceedings Interior (PDF, best for mobile devices)
 FAST '18 Errata Slip (PDF)

Full Proceedings ePub (for iPad and most eReaders)
 FAST '18 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
 FAST '18 Full Proceedings (Mobi)

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files 
FAST '18 Attendee List (PDF)
FAST '18 Web Archive (ZIP)

Tuesday, February 13

8:00 am–9:00 am

Continental Breakfast

East Hall Foyer

9:00 am–9:15 am

Opening Remarks and Best Paper Awards

Program Co-Chairs: Nitin Agrawal, Samsung Research and Raju Rangaswami, Florida International University

9:15 am–10:30 am

Keynote Address

Session Chair: Greg Ganger, Carnegie Mellon University

RAID to ML: A Big Data Evolution

Garth Gibson, The Vector Institute, University of Toronto, and Carnegie Mellon University

 

Garth Gibson, The Vector Institute, University of Toronto, and Carnegie Mellon University

Garth Gibson is President and CEO of the Vector Institute for AI, a new research institute in Toronto, Canada, specializing in machine learning and deep learning. Created in 2017 with a five-year commitment of $135M (CDN), Vector comprises a growing community of 20 faculty, 20 staff, 100 students, and 35 sponsor companies.

Garth holds an academic appointment as professor of Computer Science at Carnegie Mellon University (CMU), and courtesy appointments in the Department of Electrical and Computer Engineering at CMU and the Department of Computer Science at the University of Toronto. He holds a PhD (1991) and MS from University of California, Berkeley, and a Bachelor of Mathematics from the University of Waterloo. In 1988, with David Patterson and Randy Katz, Garth published “A Case for Redundant Arrays of Inexpensive Disks (RAID),” which became the core of his dissertation and has been cited almost 4000 times according to Google Scholar.

Garth founded CMU’s Parallel Data Laboratory in 1992, whose consortium now includes 19 companies, and a community of more than 25 faculty, 10 staff, and 50 students, under the leadership of Professor Greg Ganger. He was a founding member of the steering committees for the USENIX File and Storage Technologies Conference (FAST) and the ACM SIGHPC Parallel Data Storage Workshop (PDSW), having recently resigned to focus on machine learning at Vector. Garth also founded and acted as Chief Technology Officer, then Chief Scientist, for a scalable storage and file system company, Panasas, whose technology was described in FAST ’08. In the last decade Garth co-created CMU’s Master’s of Computational Data Science, led a Systems major within that degree, and served as the School of Computer Science’s Associate Dean for Master’s Programs, a population of about 1000 students in 20 programs.

Garth Gibson is a Fellow of the ACM and the IEEE, and a holder of research awards including SIGOPS Hall of Fame, Jean-Claude Laprie Dependable Computing award, SIGMOD Test of Time award, IEEE Reynold B. Johnson Information Storage award, J. Wesley Graham Medal in Computing and Innovation, and a Los Alamos National Laboratory Outstanding Innovation award.

10:30 am–11:00 am

Break with Refreshments

East Hall Foyer, Sponsored by Alibaba Group

11:00 am–12:30 pm

Failing and Recovering

Session Chair: Vasily Tarasov, IBM Research

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz; H. Birali Runesha, Mingzhe Hao, and Huaicheng Li, University of Chicago

Available Media

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Protocol-Aware Recovery for Consensus-Based Storage

Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison
Awarded Best Paper!

Available Media

We introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-tolerant replication (CTRL), a PAR mechanism specific to replicated state machine (RSM) systems. We experimentally show that the CTRL versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the CTRL versions have little performance overhead.

WAFL Iron: Repairing Live Enterprise File Systems

Ram Kesavan, NetApp, Inc.; Harendra Kumar, Composewell Technologies; Sushrut Bhowmik, NetApp, Inc.

Available Media

Consistent and timely access to an arbitrarily damaged file system is an important requirement of enterprise class systems. Repairing file system inconsistencies is accomplished most simply when file system access is limited to the repair tool. Checking and repairing a file system while it is open for general access present unique challenges. In this paper, we explore these challenges, present our online repair tool for the NetApp® WAFL® file system, and show how it achieves the same results as offline repair even while client access is enabled. We present some implementation details and evaluate its performance. To the best of our knowledge, this publication is the first to describe a fully functional online repair tool.

12:30 pm–2:00 pm

Conference Luncheon and Test of Time Award Presentation

Grand Ballroom, Sponsored by NetApp

View all Test of Time award winners.

2:00 pm–3:30 pm

Revealing Flashy Secrets

Session Chair: Ming Zhao, Arizona State University

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Arash Tavakkol, Juan Gómez-Luna, and Mohammad Sadrosadati, ETH Zürich; Saugata Ghose, Carnegie Mellon University; Onur Mutlu, ETH Zürich and Carnegie Mellon University

Available Media

Solid-state drives (SSDs) are used in a wide array of computer systems today, including in datacenters and enterprise servers. As the I/O demands of these systems have increased, manufacturers have evolved SSD design to keep up with this demand. For example, manufacturers have introduced new high-bandwidth interfaces to replace the traditional SATA protocol. These new interfaces, such as the NVMe protocol, are designed specifically to enable the high amount of concurrent I/O that SSDs are capable of delivering.

While modern SSDs with sophisticated features such as the NVMe protocol are already on the market, SSD simulation tools have fallen behind, as they do not capture these new features. We compare the outputs of these simulators to the performance measured from real off-the-shelf SSDs, and find three major shortcomings in state-of-the-art SSD simulators. First, existing simulators do not model critical features of new protocols like NVMe, such as their use of multiple application-level queues for requests and the elimination of OS intervention. Second, existing simulators do not capture the effects of advanced SSD maintenance algorithms (e.g., garbage collection), as they do not properly emulate the steady-state conditions that exist in real SSDs. Third, existing simulators do not capture the full end-to-end latency of I/O requests, which can incorrectly skew the simulated behavior of SSDs that use emerging memory technologies, such as 3D XPoint. We show that without the accurate modeling of these features, results from existing simulators deviate significantly from real SSD performance.

In this work, we introduce a new simulator, called MQSim, that accurately models the performance of both modern SSDs and conventional SATA-based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency for requests in modern SSDs. We validate MQSim using several modern SSDs, and show that MQSim uncovers several real and important issues that were not captured by existing simulators, such as the performance impact of inter-flow interference in modern SSDs. We plan to release MQSim as an open-source tool, and we hope that it can enable several new research directions in the future.

PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density SSDs

Chun-yi Liu and Jagadish Kotra, The Pennsylvania State University; Myoungsoo Jung, Yonsei University; Mahmut Kandemir, The Pennsylvania State University

Available Media

3D NAND flash memories promise unprecedented flash storage capacities, which can be extremely important in certain application domains where both storage capacity and performance are first-class target metrics. However a block of 3D NAND flash contains many more pages than its 2D counterpart. This increased number of pages-per-block has numerous ramifications such as the longer erase latency, higher garbage collection costs, and increased write amplification factors, which can collectively prevent the 3D NAND flash products from becoming the mainstream in high-performance storage domain. In this paper, we introduce PEN, an architecture-level mechanism that enables partial-erase of flash blocks. Using our proposed partial-erase support, we also discuss how one can build a custom garbage collector for two types of flash translation layers (FTLs), namely, block-level FTL and hybrid FTL. Our experimental evaluations of PEN with a set of diverse real storage workloads indicate that the proposed approach can shorten the write latency by $44.3\%$ and $47.9\%$ for block-level FTL and hybrid FTL, respectively.

The CASE of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator

Huaicheng Li, Mingzhe Hao, and Michael Hao Tong, University of Chicago; Swaminatahan Sundararaman, Parallel Machines; Matias Bjørling, CNEX Labs; Haryadi S. Gunawi, University of Chicago

Available Media

We present FEMU, a QEMU-based flash emulator for fostering future full-stack software/hardware SSD research, with the following four "CASE" benefits. FEMU is cheap ($0) as it will be an open-sourced software; FEMU is relatively accurate, with only 0.5-38% variance from OpenChannel SSD in our tests; FEMU is scalable, upon our optimized QEMU stack, to support up to 32 parallel channels/chips without unintended queueing delays; FEMU is extensible, enabling various types of SSD research, such as internal-SSD, kernel-only and split-level research on it.

3:30 pm–4:00 pm

Break with Refreshments

East Hall Foyer

4:00 pm–5:30 pm

Understanding the Meta(data) Story

Session Chair: Dean Hildebrand, Google

Spiffy: Enabling File-System Aware Storage Applications

Kuei Sun, Daniel Fryer, Joseph Chu, Matthew Lakier, Angela Demke Brown, and Ashvin Goel, University of Toronto

Available Media

Many file system applications such as defragmentation tools, file system checkers or data recovery tools, operate at the storage layer. Today, developers of these storage applications require detailed knowledge of the file system format, which takes a significant amount of time to learn, often by trial and error, due to insufficient documentation or specification of the format. Furthermore, these applications perform ad-hoc processing of the file-system metadata, leading to bugs and vulnerabilities.

We propose Spiffy, an annotation language for specifying the on-disk format of a file system. File system developers annotate the data structures of a file system, and we use these annotations to generate a library that allows identifying, parsing and traversing file-system metadata, providing support for both offline and online storage applications. This approach simplifies the development of storage applications that work across different file systems because it reduces the amount of file-system specific code that needs to be written.

We have written annotations for the Linux Ext4, Btrfs and F2FS file systems, and developed several applications for these file systems, including a type-specific metadata corruptor, a file system converter, and an online storage layer cache that preferentially caches files for certain users. Our experiments show that applications that use the library to access file system metadata can achieve good performance and are robust against file system corruption errors.

Towards Robust File System Checkers

Om Rameshwar Gatla, Muhammad Hameed, and Mai Zheng, New Mexico State University; Viacheslav Dubeyko, Adam Manzanares, Filip Blagojevic, Cyril Guyot, and Robert Mateescu, Western Digital Research

Available Media

File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice.

In this work, we demonstrate via fault injection experiments that checkers of widely used file systems may leave the file system in an uncorrectable state if the repair procedure is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck, and then build a general logging library (i.e., rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: (1) rfsck-ext, a robust checker for Ext-family file systems, and (2) rfsck-xfs, a robust checker for XFS file system, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) comparing to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.

The Full Path to Full-Path Indexing

Yang Zhan, The University of North Carolina at Chapel Hill; Alex Conway, Rutgers University; Yizheng Jiao, The University of North Carolina at Chapel Hill; Eric Knorr, Rutgers University; Michael A. Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; William Jannen, Williams College; Rob Johnson, VMware Research; Donald E. Porter, The University of North Carolina at Chapel Hill; Jun Yuan, Stony Brook University

Available Media

This paper shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. Prior results indicated that renames are prohibitively expensive in full-path indexing.

The paper introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value-store API, and simplifies the overall design of the file system.

We implemented this mechanism in ArborFS, an extension of the BetrFS in-kernel, local file system for Linux. For instance, ArborFS performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS, but renames are competitive with standard, indirection-based file systems for a range of sizes. ArborFS outperforms relative-path file systems such as BetrFS as well as traditional file systems such as ext4, xfs and zfs across a variety of workloads.

6:00 pm–7:30 pm

Poster Session and Happy Hour

Grand Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Happy Hour. Take part in discussions with your colleagues over complimentary food and drinks. View the list of accepted posters.

Wednesday, February 14

8:00 am–9:00 am

Continental Breakfast

East Hall Foyer

9:00 am–10:30 am

Coding, Hashing, Hiding

Session Chair: Ethan L. Miller, University of California, Santa Cruz, and Pure Storage

Clay Codes: Moulding MDS Codes to Yield an MSR Code

Myna Vajha, Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, and P. Vijay Kumar, Indian Institute of Science, Bangalore; Alexandar Barg and Min Ye, University of Maryland; Srinivasan Narayanamurthy, Syed Hussain, and Siddhartha Nandi, NetApp ATG, Bangalore

Available Media

With increase in scale, the number of node failures in a data center increases sharply. To ensure availability of data, failure-tolerance schemes such as Reed-Solomon (RS) or more generally, Maximum Distance Separable (MDS) erasure codes are used. However, while MDS codes offer minimum storage overhead for a given amount of failure tolerance, they do not meet other practical needs of today's data centers. Although modern codes such as Minimum Storage Regenerating (MSR) codes are designed to meet these practical needs, they are available only in highly-constrained theoretical constructions, that are not sufficiently mature enough for practical implementation. We present {\em Clay codes} that extract the best from both worlds. Clay (short for Coupled-Layer) codes are MSR codes that offer a simplified construction for decoding/repair by using pairwise coupling across multiple stacked layers of any single MDS code. In addition, Clay codes provide the first practical implementation of an MSR code that offers (a) low storage overhead, (b) simultaneous optimality in terms of three key parameters: repair bandwidth, sub-packetization level and disk I/O, (c) uniform repair performance of data and parity nodes and (d) support for both single and multiple-node repairs, while permitting faster and more efficient repair.

While all MSR codes are vector codes, none of the distributed storage systems support vector codes. We have modified Ceph to support any vector code, and our contribution is now a part of Ceph's master codebase. We have implemented Clay codes, and integrated it as a plugin to Ceph. Six example Clay codes were evaluated on a cluster of Amazon EC2 instances and code parameters were carefully chosen to match known erasure-code deployments in practice. A particular example code, with storage overhead $1.25$x, is shown to reduce repair network traffic by a factor of $2.9$ in comparison with RS codes and similar reductions are obtained for both repair time and disk read.

Towards Web-based Delta Synchronization for Cloud Storage Services

He Xiao and Zhenhua Li, Tsinghua University; Ennan Zhai, Yale University; Tianyin Xu, UIUC; Yang Li and Yunhao Liu, Tsinghua University; Quanlu Zhang, Microsoft Research; Yao Liu, SUNY Binghamton

Available Media

Delta synchronization (sync) is crucial for network-level efficiency of cloud storage services. Practical delta sync techniques are, however, only available for PC clients and mobile apps, but not web browsers---the most pervasive and OS-independent access method. To understand the obstacles of web-based delta sync, we implement a delta sync solution, WebRsync, using state-of-the-art web techniques based on rsync, the de facto delta sync protocol for PC clients. Our measurements show that WebRsync severely suffers from the inefficiency of JavaScript execution inside web browsers, thus leading to frequent stagnation and even hanging. Given that the computation burden on the web browser mainly stems from data chunk search and comparison, we reverse the traditional delta sync approach by lifting all chunk search and comparison operations from the client side to the server side. Inevitably, this brings considerable computation overhead to the servers. Hence, we further leverage locality-aware chunk matching and lightweight checksum algorithms to reduce the overhead. The resulting solution, WebR2sync+, outpaces WebRsync by an order of magnitude, and is able to simultaneously support 6800--8500 web clients' delta sync using a standard VM server instance based on a Dropbox-like system architecture.

Stash in a Flash

Aviad Zuck, Technion—Israel Institute of Technology; Yue Li and Jehoshua Bruck, California Institute of Technology; Donald E. Porter, The University of North Carolina at Chapel Hill; Dan Tsafrir, Technion—Israel Institute of Technology and VMware Research Group

Available Media

Encryption is a useful tool to protect data confidentiality. Yet it is still challenging to hide the very presence of encrypted, secret data from a powerful adversary. This paper presents a new technique to hide data in flash by manipulating the voltage level of pseudo-randomlyselected flash cells to encode two bits (rather than one) in the cell. In this model, we have one “public” bit interpreted using an SLC-style encoding, and extract a private bit using an MLC-style encoding. The locations of cells that encode hidden data is based on a secret key known only to the hiding user.

Intuitively, this technique requires that the voltage level in a cell encoding data must be (1) not statistically distinguishable from a cell only storing public data, and (2) the user must be able to reliably read the hidden data from this cell. Our key insight is that there is a wide enough variation in the range of voltage levels in a typical flash device to obscure the presence of fine-grained changes to a small fraction of the cells, and that the variation is wide enough to support reliably re-reading hidden data. We demonstrate that our hidden data and underlying voltage manipulations go undetected by support vector machine based supervised learning which performs similarly to a random guess. The error rates of our scheme are low enough that the data is recoverable months after being stored. Compared to prior work, our technique provides 24x and 50x higher encoding and decoding throughput and doubles the capacity, while being 37x more power efficient.

10:30 am–11:00 am

Break with Refreshments

East Hall Foyer

11:00 am–12:30 pm

New Media and Old

Session Chair: Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology)

Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree

Deukyeon Hwang and Wook-Hee Kim, UNIST; Youjip Won, Hanyang University; Beomseok Nam, UNIST

Available Media

With the emergence of byte-addressable persistent memory (PM), a cache line, instead of a page, is expected to be the unit of data transfer between volatile and nonvolatile devices, but the failure-atomicity of write operations is guaranteed in the granularity of 8 bytes rather than cache lines. This granularity mismatch problem has generated interest in redesigning block-based data structures such as B+-trees, and attempts have been made to use in-memory data structures for PM. However, various methods of modifying B+-trees for PM degrade the efficiency of B+-trees due to the additional metadata and high rebalancing overhead caused by logging methods.

In this study, we develop Failure-Atomic ShifT (FAST) and Failure-Atomic In-place Rebalance (FAIR) algorithms. FAST and FAIR modify legacy B+-trees in a byte-addressable fashion but solves the granularity mismatch problem. Every 8-byte store instruction used in the FAST and FAIR algorithms transforms a B+-tree into another consistent state or a transient inconsistent state that read operations can tolerate. By making read operations transient inconsistency, we can eliminate expensive copy-on-write, logging, and even the necessity of read latches so that read transactions can be non-blocking. Our experimental results show that legacy B+-trees with FAST and FAIR schemes outperform the state-of-the-art persistent indexing structures by a large margin.

RFLUSH: Rethink the Flush

Jeseong Yeon and Minseong Jeong, Chungbuk National University; Sungjin Lee, DGIST; Eunji Lee, Chungbuk National University, University of Wisconsin—Madison

Available Media

A FLUSH command has been used for decades to enforce persistence and ordering of updates in a storage device. The command forces all the data in the volatile buffer to non-volatile media to achieve persistency. This lumpsum approach to flushing has two performance consequences. First, it slows down non-volatile materialization of the writes that actually need to be flushed. Second, it deprives the writes that need not to be flushed of an opportunity for absorbing future writes and coalescing. We attempt to characterize the problems of this semantic gap of flushing in storage devices and propose RFLUSH that allows a fine-grained control over flushing in them. The RFLUSH command delivers a range of LBAs that need to be flushed and thus enables the storage device to force only a subset of data in its buffer. We implemented this fine-grained flush command in a storage device using an open-source flash development platform and modified the F2FS file system to make use of the command in processing fsync requests as a case study. Performance evaluation using the prototype implementation shows that the inclusion of RFLUSH improves the throughput by up to 5.6x; reduces the write traffic by up to 43%; and eliminates the long tail in the response time.

Barrier-Enabled IO Stack for Flash Storage

Youjip Won, Hanyang University; Jaemin Jung, Texas A&M University; Gyeongyeol Choi, Joontaek Oh, and Seongbae Son, Hanyang University; Jooyoung Hwang and Sangyeun Cho, Samsung Electronics
Awarded Best Paper!

Available Media

This work is dedicated to eliminating the overhead required for guaranteeing the storage order in the modern IO stack. The existing block device adopts a prohibitively expensive approach in ensuring the storage order among write requests: interleaving the write requests with Transfer-and-Flush. Exploiting the cache barrier command for Flash storage, we overhaul the IO scheduler, the dispatch module, and the filesystem so that these layers are orchestrated to preserve the ordering condition imposed by the application with which the associated data blocks are made durable. The key ingredients of Barrier-Enabled IO stack are Epoch-based IO scheduling, Order-Preserving Dispatch, and Dual-Mode Journaling. Barrier-enabled IO stack can control the storage order without Transfer-and-Flush overhead. We implement the barrier-enabled IO stack in server as well as in mobile platforms. SQLite performance increases by 270% and 75%, in server and in smartphone, respectively. In a server storage, BarrierFS brings as much as by 43$\times$ and by 73$\times$ performance gain in MySQL and SQLite, respectively, against EXT4 via relaxing the durability of a transaction.

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:30 pm

Long Live the File System!

Session Chair: Andrew Warfield, University of British Columbia

High-Performance Transaction Processing in Journaling File Systems

Yongseok Son, Sunggon Kim, and Heon Young Yeom, Seoul National University; Hyuck Han, Dongduk Women's University

Available Media

Journaling file systems provide crash-consistency to applications by keeping track of uncommitted changes in the journal area (journaling) and writing committed changes to their original area at a certain point (checkpointing). They generally use coarse-grained locking to access shared data structures and perform I/O operations by a single thread. For these reasons, journaling file systems often face the problem of lock contention and underutilization of I/O bandwidth on multi-cores with high-performance storage. To address these issues, we design journaling and checkpointing schemes that enable concurrent updates on data structures and parallelize I/O operations. We implement our schemes in EXT4/JBD2 and evaluate them on a 72-core machine with a high-performance NVMe SSD. The experimental results show that our optimized file system improves the performance by up to about 2.2x and 1.5x compared to the existing EXT4 file system and a state-of-art scalable file system, respectively.

Designing a True Direct-Access File System with DevFS

Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison; Yuangang Wang, Jun Xu, and Gopinath Palani, Huawei Technologies

Available Media

We present DevFS, a direct-access file system embedded completely within a storage device. DevFS provides direct access without compromising file system integrity, concurrency, crash consistency, and security. A novel reverse-caching mechanism enables the usage of host memory for inactive objects, thus reducing memory load upon the device. Evaluation of an emulated DevFS prototype shows more than 2x higher I/O throughput with direct access and up to a 78% reduction in device RAM utilization.

FStream: Managing Flash Streams in the File System

Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty, Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong, Samsung Electronics. Co., Ltd.

Available Media

The performance and lifespan of a solid-state drive (SSD) depend not only on the current input workload but also on its internal media fragmentation formed over time. The recently proposed streams gives a means for the host system to control how data are placed on the physical media (abstrated by a stream) and effectively reduce the media fragmentation. This work proposes FStream, a file system approach to taking advantage of this facility. FStream extracts streams at the file system level and avoids complex application level data mapping to streams. Experimental results showed that FStream enhances the filebench performance by 5%-35% and reduces WAF (Write Amplification Factor) by 7%-46%. For a NoSQL database benchmark, performance is improved by up to 38% and WAF is reduced by up to 81%.

3:30 pm–4:00 pm

Break with Refreshments

East Hall Foyer

4:00 pm–5:30 pm

6:00 pm–8:00 pm

Conference Reception

Grand Ballroom

Mingle with fellow attendees at the the Conference Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.

Thursday, February 15

8:00 am–9:00 am

Continental Breakfast

East Hall Foyer

9:00 am–10:30 am

Distribute and Conquer

Session Chair: Angela Demke Brown, University of Toronto

Improving Docker Registry Design Based on Production Workload Analysis

Ali Anwar, Virginia Tech; Mohamed Mohamed and Vasily Tarasov, IBM Research—Almaden; Michael Littley, Virginia Tech; Lukas Rupprecht, IBM Research—Almaden; Yue Cheng, George Mason University; Nannan Zhao, Virginia Tech; Dimitrios Skourtis, Amit S. Warke, and Heiko Ludwig, and Dean Hildebrand, IBM Research—Almaden; Ali R. Butt, Virginia Tech

Available Media

Containers offer an efficient way to run workloads as independent microservices that can be developed, tested and deployed in an agile manner. To facilitate this process, container frameworks offer a registry service that enables users to publish and version container images and share them with others. The registry service plays a critical role in the startup time of containers since many container starts entail the retrieval of container images from a registry. To support research efforts on optimizing the registry service, large-scale and realistic traces are required. In this paper, we perform a comprehensive characterization of a large-scale registry workload based on traces that we collected over the course of 75 days from five IBM data centers hosting production-level registries. We present a trace replayer to perform our analysis and infer a number of crucial insights about container workloads, such as request type distribution, access patterns, and response times. Based on these insights, we derive design implications for the registry and demonstrate their ability to improve performance. Both the traces and the replayer are open-sourced to facilitate further research.

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Guangyan Zhang and Zican Huang, Tsinghua University; Xiaosong Ma, Qatar Computing Research Institute, HBKU; Songlin Yang, Zhufan Wang, and Weimin Zheng, Tsinghua University

Available Media

Existing RAID solutions partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures.

We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction. Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.

Logical Synchronous Replication in the Tintri VMstore File System

Gideon Glass, Arjun Gopalan, Dattatraya Koujalagi, Abhinand Palicherla, and Sumedh Sakdeo, Tintri, Inc

Available Media

A standard feature of enterprise data storage systems is synchronous replication: updates received from clients by one storage system are replicated to a remote storage system and are only acknowledged to clients after having been stored persistently on both storage systems. Traditionally these replication schemes require configuration on a coarse granularity, e.g. on a LUN, filesystem volume, or whole-system basis. In contrast to this, we present a new architecture which operates on a fine granularity---individual files and directories. To implement this, we use a combination of novel per-file capabilities and existing techniques to solve the following problems: tracking parallel writes in flight on independent storage systems; replicating arbitrary filesystem operations; efficiently resynchronizing after a disconnect; and verifying the integrity of replicated data between two storage systems.

10:30 am–11:00 am

Break with Refreshments

East Hall Foyer

11:00 am–12:00 pm

Dedup: Last but Not Least

Session Chair: Youjip Won, Hanyang University

ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching

Zhichao Cao, Hao Wen, Fenggang Wu, and David H.C. Du, Department of Computer Science, University of Minnesota, Twin Cities

Available Media

Data deduplication has been widely applied in storage systems to improve the efficiency of space utilization. In data deduplication systems, the data restore performance is seriously hindered by read amplification since the accessed data chunks are scattered over many containers. A container consisting of hundreds or thousands data chunks is the data unit to be read from or write to the storage. Several schemes such as forward assembly, container-based caching, and chunk-based caching are used to reduce the number of container-reads during the restore process. However, how to effectively use these schemes to get the best restore performance is still unclear.

In this paper, we first study the trade-offs of using these schemes in terms of read amplification and computing time. We then propose a combined data chunk caching and forward assembly scheme called ALACC (Adaptive Look-Ahead Chunk Caching) for improving restore performance which can adapt to different deduplication workloads with a fixed total amount of memory. This is accomplished by extending and shrinking the look-ahead window adaptively to cover an appropriate data recipe range and dynamically deciding the memory to be allocated to forward assembly area and chunk-based caching. Our evaluations show the restore throughput of ALACC is higher than that of the optimum case of various configurations using the fixed amount of memory allocated to forward assembly and to chunk-based caching.

UKSM: Swift Memory Deduplication via Hierarchical and Adaptive Memory Region Distilling

Nai Xia and Chen Tian, State Key Laboratory for Novel Software Technology, Nanjing University, China; Yan Luo and Hang Liu, Department of Electrical and Computer Engineering, University of Massachusetts Lowell, USA; Xiaoliang Wang, State Key Laboratory for Novel Software Technology, Nanjing University, China

Available Media

In cloud computing, deduplication can reduce memory footprint by eliminating redundant pages. The responsiveness of a deduplication process to newly generated memory pages is critical. State-of-the-art Content Based Page Sharing (CBPS) approaches lack responsiveness as they equally scan every page while finding redundancies. We propose a new deduplication system UKSM, which prioritizes different memory regions to accelerate the deduplication process and minimize application penalty. With UKSM, memory regions are organized as a distilling hierarchy, where a region in a higher level receives more CPU cycles. UKSM adaptively promotes/demotes a region among levels according to the region’s estimated deduplication benefit and penalty. UKSM further introduces an adaptive partial-page hashing scheme which adjusts a global page hashing strength parameter according to the global degree of page similarity. Experiments demonstrate that, with the same amount of CPU cycles in the same time envelop, UKSM can achieve up to 12.6and 5 more memory saving than CBPS approaches on static and dynamic workloads, respectively.