# FAST '18 Technical Sessions

All sessions will be held in Hall East unless otherwise noted.

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

FAST '18 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
FAST '18 Full Proceedings (Mobi)

### This content is available to:

FAST '18 Attendee List (PDF)
FAST '18 Web Archive (ZIP)

East Hall Foyer

## Opening Remarks and Best Paper Awards

Program Co-Chairs: Nitin Agrawal, Samsung Research and Raju Rangaswami, Florida International University

### 9:15 am–10:30 am

Session Chair: Greg Ganger, Carnegie Mellon University

## RAID to ML: A Big Data Evolution

Garth Gibson, The Vector Institute, University of Toronto, and Carnegie Mellon University

East Hall Foyer

## Spiffy: Enabling File-System Aware Storage Applications

Kuei Sun, Daniel Fryer, Joseph Chu, Matthew Lakier, Angela Demke Brown, and Ashvin Goel, University of Toronto

Available Media

Many file system applications such as defragmentation tools, file system checkers or data recovery tools, operate at the storage layer. Today, developers of these storage applications require detailed knowledge of the file system format, which takes a significant amount of time to learn, often by trial and error, due to insufficient documentation or specification of the format. Furthermore, these applications perform ad-hoc processing of the file-system metadata, leading to bugs and vulnerabilities.

We propose Spiffy, an annotation language for specifying the on-disk format of a file system. File system developers annotate the data structures of a file system, and we use these annotations to generate a library that allows identifying, parsing and traversing file-system metadata, providing support for both offline and online storage applications. This approach simplifies the development of storage applications that work across different file systems because it reduces the amount of file-system specific code that needs to be written.

We have written annotations for the Linux Ext4, Btrfs and F2FS file systems, and developed several applications for these file systems, including a type-specific metadata corruptor, a file system converter, and an online storage layer cache that preferentially caches files for certain users. Our experiments show that applications that use the library to access file system metadata can achieve good performance and are robust against file system corruption errors.

## Towards Robust File System Checkers

Om Rameshwar Gatla, Muhammad Hameed, and Mai Zheng, New Mexico State University; Viacheslav Dubeyko, Adam Manzanares, Filip Blagojevic, Cyril Guyot, and Robert Mateescu, Western Digital Research

Available Media

File systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to complete the repair without interruption, which may not be true in practice.

In this work, we demonstrate via fault injection experiments that checkers of widely used file systems may leave the file system in an uncorrectable state if the repair procedure is interrupted unexpectedly. To address the problem, we first fix the ordering issue in the undo logging of e2fsck, and then build a general logging library (i.e., rfsck-lib) for strengthening checkers. To demonstrate the practicality, we integrate rfsck-lib with existing checkers and create two new checkers: (1) rfsck-ext, a robust checker for Ext-family file systems, and (2) rfsck-xfs, a robust checker for XFS file system, both of which require only tens of lines of modification to the original versions. Both rfsck-ext and rfsck-xfs are resilient to faults in our experiments. Also, both checkers incur reasonable performance overhead (i.e., up to 12%) comparing to the original unreliable versions. Moreover, rfsck-ext outperforms the patched e2fsck by up to nine times while achieving the same level of robustness.

## The Full Path to Full-Path Indexing

Yang Zhan, The University of North Carolina at Chapel Hill; Alex Conway, Rutgers University; Yizheng Jiao, The University of North Carolina at Chapel Hill; Eric Knorr, Rutgers University; Michael A. Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; William Jannen, Williams College; Rob Johnson, VMware Research; Donald E. Porter, The University of North Carolina at Chapel Hill; Jun Yuan, Stony Brook University

Available Media

This paper shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. Prior results indicated that renames are prohibitively expensive in full-path indexing.

The paper introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value-store API, and simplifies the overall design of the file system.

We implemented this mechanism in ArborFS, an extension of the BetrFS in-kernel, local file system for Linux. For instance, ArborFS performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS, but renames are competitive with standard, indirection-based file systems for a range of sizes. ArborFS outperforms relative-path file systems such as BetrFS as well as traditional file systems such as ext4, xfs and zfs across a variety of workloads.

## Poster Session and Happy Hour

Grand Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Happy Hour. Take part in discussions with your colleagues over complimentary food and drinks. View the list of accepted posters.

East Hall Foyer

## Coding, Hashing, Hiding

Session Chair: Ethan L. Miller, University of California, Santa Cruz, and Pure Storage

## Clay Codes: Moulding MDS Codes to Yield an MSR Code

Myna Vajha, Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, and P. Vijay Kumar, Indian Institute of Science, Bangalore; Alexandar Barg and Min Ye, University of Maryland; Srinivasan Narayanamurthy, Syed Hussain, and Siddhartha Nandi, NetApp ATG, Bangalore

Available Media

With increase in scale, the number of node failures in a data center increases sharply. To ensure availability of data, failure-tolerance schemes such as Reed-Solomon (RS) or more generally, Maximum Distance Separable (MDS) erasure codes are used. However, while MDS codes offer minimum storage overhead for a given amount of failure tolerance, they do not meet other practical needs of today's data centers. Although modern codes such as Minimum Storage Regenerating (MSR) codes are designed to meet these practical needs, they are available only in highly-constrained theoretical constructions, that are not sufficiently mature enough for practical implementation. We present {\em Clay codes} that extract the best from both worlds. Clay (short for Coupled-Layer) codes are MSR codes that offer a simplified construction for decoding/repair by using pairwise coupling across multiple stacked layers of any single MDS code. In addition, Clay codes provide the first practical implementation of an MSR code that offers (a) low storage overhead, (b) simultaneous optimality in terms of three key parameters: repair bandwidth, sub-packetization level and disk I/O, (c) uniform repair performance of data and parity nodes and (d) support for both single and multiple-node repairs, while permitting faster and more efficient repair.

While all MSR codes are vector codes, none of the distributed storage systems support vector codes. We have modified Ceph to support any vector code, and our contribution is now a part of Ceph's master codebase. We have implemented Clay codes, and integrated it as a plugin to Ceph. Six example Clay codes were evaluated on a cluster of Amazon EC2 instances and code parameters were carefully chosen to match known erasure-code deployments in practice. A particular example code, with storage overhead $1.25$x, is shown to reduce repair network traffic by a factor of $2.9$ in comparison with RS codes and similar reductions are obtained for both repair time and disk read.

## Towards Web-based Delta Synchronization for Cloud Storage Services

He Xiao and Zhenhua Li, Tsinghua University; Ennan Zhai, Yale University; Tianyin Xu, UIUC; Yang Li and Yunhao Liu, Tsinghua University; Quanlu Zhang, Microsoft Research; Yao Liu, SUNY Binghamton

Available Media

Delta synchronization (sync) is crucial for network-level efficiency of cloud storage services. Practical delta sync techniques are, however, only available for PC clients and mobile apps, but not web browsers---the most pervasive and OS-independent access method. To understand the obstacles of web-based delta sync, we implement a delta sync solution, WebRsync, using state-of-the-art web techniques based on rsync, the de facto delta sync protocol for PC clients. Our measurements show that WebRsync severely suffers from the inefficiency of JavaScript execution inside web browsers, thus leading to frequent stagnation and even hanging. Given that the computation burden on the web browser mainly stems from data chunk search and comparison, we reverse the traditional delta sync approach by lifting all chunk search and comparison operations from the client side to the server side. Inevitably, this brings considerable computation overhead to the servers. Hence, we further leverage locality-aware chunk matching and lightweight checksum algorithms to reduce the overhead. The resulting solution, WebR2sync+, outpaces WebRsync by an order of magnitude, and is able to simultaneously support 6800--8500 web clients' delta sync using a standard VM server instance based on a Dropbox-like system architecture.

## Stash in a Flash

Aviad Zuck, Technion—Israel Institute of Technology; Yue Li and Jehoshua Bruck, California Institute of Technology; Donald E. Porter, The University of North Carolina at Chapel Hill; Dan Tsafrir, Technion—Israel Institute of Technology and VMware Research Group

Available Media

Encryption is a useful tool to protect data confidentiality. Yet it is still challenging to hide the very presence of encrypted, secret data from a powerful adversary. This paper presents a new technique to hide data in flash by manipulating the voltage level of pseudo-randomlyselected flash cells to encode two bits (rather than one) in the cell. In this model, we have one “public” bit interpreted using an SLC-style encoding, and extract a private bit using an MLC-style encoding. The locations of cells that encode hidden data is based on a secret key known only to the hiding user.

Intuitively, this technique requires that the voltage level in a cell encoding data must be (1) not statistically distinguishable from a cell only storing public data, and (2) the user must be able to reliably read the hidden data from this cell. Our key insight is that there is a wide enough variation in the range of voltage levels in a typical flash device to obscure the presence of fine-grained changes to a small fraction of the cells, and that the variation is wide enough to support reliably re-reading hidden data. We demonstrate that our hidden data and underlying voltage manipulations go undetected by support vector machine based supervised learning which performs similarly to a random guess. The error rates of our scheme are low enough that the data is recoverable months after being stored. Compared to prior work, our technique provides 24x and 50x higher encoding and decoding throughput and doubles the capacity, while being 37x more power efficient.

East Hall Foyer

## New Media and Old

Session Chair: Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology)

## Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree

Deukyeon Hwang and Wook-Hee Kim, UNIST; Youjip Won, Hanyang University; Beomseok Nam, UNIST

Available Media

With the emergence of byte-addressable persistent memory (PM), a cache line, instead of a page, is expected to be the unit of data transfer between volatile and nonvolatile devices, but the failure-atomicity of write operations is guaranteed in the granularity of 8 bytes rather than cache lines. This granularity mismatch problem has generated interest in redesigning block-based data structures such as B+-trees, and attempts have been made to use in-memory data structures for PM. However, various methods of modifying B+-trees for PM degrade the efficiency of B+-trees due to the additional metadata and high rebalancing overhead caused by logging methods.

In this study, we develop Failure-Atomic ShifT (FAST) and Failure-Atomic In-place Rebalance (FAIR) algorithms. FAST and FAIR modify legacy B+-trees in a byte-addressable fashion but solves the granularity mismatch problem. Every 8-byte store instruction used in the FAST and FAIR algorithms transforms a B+-tree into another consistent state or a transient inconsistent state that read operations can tolerate. By making read operations transient inconsistency, we can eliminate expensive copy-on-write, logging, and even the necessity of read latches so that read transactions can be non-blocking. Our experimental results show that legacy B+-trees with FAST and FAIR schemes outperform the state-of-the-art persistent indexing structures by a large margin.

## RFLUSH: Rethink the Flush

Jeseong Yeon and Minseong Jeong, Chungbuk National University; Sungjin Lee, DGIST; Eunji Lee, Chungbuk National University, University of Wisconsin—Madison

Available Media

A FLUSH command has been used for decades to enforce persistence and ordering of updates in a storage device. The command forces all the data in the volatile buffer to non-volatile media to achieve persistency. This lumpsum approach to flushing has two performance consequences. First, it slows down non-volatile materialization of the writes that actually need to be flushed. Second, it deprives the writes that need not to be flushed of an opportunity for absorbing future writes and coalescing. We attempt to characterize the problems of this semantic gap of flushing in storage devices and propose RFLUSH that allows a fine-grained control over flushing in them. The RFLUSH command delivers a range of LBAs that need to be flushed and thus enables the storage device to force only a subset of data in its buffer. We implemented this fine-grained flush command in a storage device using an open-source flash development platform and modified the F2FS file system to make use of the command in processing fsync requests as a case study. Performance evaluation using the prototype implementation shows that the inclusion of RFLUSH improves the throughput by up to 5.6x; reduces the write traffic by up to 43%; and eliminates the long tail in the response time.

## Barrier-Enabled IO Stack for Flash Storage

Youjip Won, Hanyang University; Jaemin Jung, Texas A&M University; Gyeongyeol Choi, Joontaek Oh, and Seongbae Son, Hanyang University; Jooyoung Hwang and Sangyeun Cho, Samsung Electronics
Awarded Best Paper!

Available Media

This work is dedicated to eliminating the overhead required for guaranteeing the storage order in the modern IO stack. The existing block device adopts a prohibitively expensive approach in ensuring the storage order among write requests: interleaving the write requests with Transfer-and-Flush. Exploiting the cache barrier command for Flash storage, we overhaul the IO scheduler, the dispatch module, and the filesystem so that these layers are orchestrated to preserve the ordering condition imposed by the application with which the associated data blocks are made durable. The key ingredients of Barrier-Enabled IO stack are Epoch-based IO scheduling, Order-Preserving Dispatch, and Dual-Mode Journaling. Barrier-enabled IO stack can control the storage order without Transfer-and-Flush overhead. We implement the barrier-enabled IO stack in server as well as in mobile platforms. SQLite performance increases by 270% and 75%, in server and in smartphone, respectively. In a server storage, BarrierFS brings as much as by 43$\times$ and by 73$\times$ performance gain in MySQL and SQLite, respectively, against EXT4 via relaxing the durability of a transaction.

## Long Live the File System!

Session Chair: Andrew Warfield, University of British Columbia

## High-Performance Transaction Processing in Journaling File Systems

Yongseok Son, Sunggon Kim, and Heon Young Yeom, Seoul National University; Hyuck Han, Dongduk Women's University

Available Media

Journaling file systems provide crash-consistency to applications by keeping track of uncommitted changes in the journal area (journaling) and writing committed changes to their original area at a certain point (checkpointing). They generally use coarse-grained locking to access shared data structures and perform I/O operations by a single thread. For these reasons, journaling file systems often face the problem of lock contention and underutilization of I/O bandwidth on multi-cores with high-performance storage. To address these issues, we design journaling and checkpointing schemes that enable concurrent updates on data structures and parallelize I/O operations. We implement our schemes in EXT4/JBD2 and evaluate them on a 72-core machine with a high-performance NVMe SSD. The experimental results show that our optimized file system improves the performance by up to about 2.2x and 1.5x compared to the existing EXT4 file system and a state-of-art scalable file system, respectively.

## Designing a True Direct-Access File System with DevFS

Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison; Yuangang Wang, Jun Xu, and Gopinath Palani, Huawei Technologies

Available Media

We present DevFS, a direct-access file system embedded completely within a storage device. DevFS provides direct access without compromising file system integrity, concurrency, crash consistency, and security. A novel reverse-caching mechanism enables the usage of host memory for inactive objects, thus reducing memory load upon the device. Evaluation of an emulated DevFS prototype shows more than 2x higher I/O throughput with direct access and up to a 78% reduction in device RAM utilization.

## FStream: Managing Flash Streams in the File System

Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty, Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong, Samsung Electronics. Co., Ltd.

Available Media

The performance and lifespan of a solid-state drive (SSD) depend not only on the current input workload but also on its internal media fragmentation formed over time. The recently proposed streams gives a means for the host system to control how data are placed on the physical media (abstrated by a stream) and effectively reduce the media fragmentation. This work proposes FStream, a file system approach to taking advantage of this facility. FStream extracts streams at the file system level and avoids complex application level data mapping to streams. Experimental results showed that FStream enhances the filebench performance by 5%-35% and reduces WAF (Write Amplification Factor) by 7%-46%. For a NoSQL database benchmark, performance is improved by up to 38% and WAF is reduced by up to 81%.

East Hall Foyer

## Work-in-Progress Reports (WiPs)

Session Chairs: Anirudh Badam, Microsoft Research, and Gala Yadgar, Technion—Israel Institute of Technology

## Conference Reception

Grand Ballroom

Mingle with fellow attendees at the the Conference Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.

East Hall Foyer

## Distribute and Conquer

Session Chair: Angela Demke Brown, University of Toronto

## Improving Docker Registry Design Based on Production Workload Analysis

Ali Anwar, Virginia Tech; Mohamed Mohamed and Vasily Tarasov, IBM Research—Almaden; Michael Littley, Virginia Tech; Lukas Rupprecht, IBM Research—Almaden; Yue Cheng, George Mason University; Nannan Zhao, Virginia Tech; Dimitrios Skourtis, Amit S. Warke, and Heiko Ludwig, and Dean Hildebrand, IBM Research—Almaden; Ali R. Butt, Virginia Tech

Available Media

Containers offer an efficient way to run workloads as independent microservices that can be developed, tested and deployed in an agile manner. To facilitate this process, container frameworks offer a registry service that enables users to publish and version container images and share them with others. The registry service plays a critical role in the startup time of containers since many container starts entail the retrieval of container images from a registry. To support research efforts on optimizing the registry service, large-scale and realistic traces are required. In this paper, we perform a comprehensive characterization of a large-scale registry workload based on traces that we collected over the course of 75 days from five IBM data centers hosting production-level registries. We present a trace replayer to perform our analysis and infer a number of crucial insights about container workloads, such as request type distribution, access patterns, and response times. Based on these insights, we derive design implications for the registry and demonstrate their ability to improve performance. Both the traces and the replayer are open-sourced to facilitate further research.

## RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Guangyan Zhang and Zican Huang, Tsinghua University; Xiaosong Ma, Qatar Computing Research Institute, HBKU; Songlin Yang, Zhufan Wang, and Weimin Zheng, Tsinghua University

Available Media

Existing RAID solutions partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures.

We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction. Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.

## Logical Synchronous Replication in the Tintri VMstore File System

Gideon Glass, Arjun Gopalan, Dattatraya Koujalagi, Abhinand Palicherla, and Sumedh Sakdeo, Tintri, Inc

Available Media

A standard feature of enterprise data storage systems is synchronous replication: updates received from clients by one storage system are replicated to a remote storage system and are only acknowledged to clients after having been stored persistently on both storage systems. Traditionally these replication schemes require configuration on a coarse granularity, e.g. on a LUN, filesystem volume, or whole-system basis. In contrast to this, we present a new architecture which operates on a fine granularity---individual files and directories. To implement this, we use a combination of novel per-file capabilities and existing techniques to solve the following problems: tracking parallel writes in flight on independent storage systems; replicating arbitrary filesystem operations; efficiently resynchronizing after a disconnect; and verifying the integrity of replicated data between two storage systems.

East Hall Foyer

## Dedup: Last but Not Least

Session Chair: Youjip Won, Hanyang University

## ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching

Zhichao Cao, Hao Wen, Fenggang Wu, and David H.C. Du, Department of Computer Science, University of Minnesota, Twin Cities

Available Media

Data deduplication has been widely applied in storage systems to improve the efficiency of space utilization. In data deduplication systems, the data restore performance is seriously hindered by read amplification since the accessed data chunks are scattered over many containers. A container consisting of hundreds or thousands data chunks is the data unit to be read from or write to the storage. Several schemes such as forward assembly, container-based caching, and chunk-based caching are used to reduce the number of container-reads during the restore process. However, how to effectively use these schemes to get the best restore performance is still unclear.

In this paper, we first study the trade-offs of using these schemes in terms of read amplification and computing time. We then propose a combined data chunk caching and forward assembly scheme called ALACC (Adaptive Look-Ahead Chunk Caching) for improving restore performance which can adapt to different deduplication workloads with a fixed total amount of memory. This is accomplished by extending and shrinking the look-ahead window adaptively to cover an appropriate data recipe range and dynamically deciding the memory to be allocated to forward assembly area and chunk-based caching. Our evaluations show the restore throughput of ALACC is higher than that of the optimum case of various configurations using the fixed amount of memory allocated to forward assembly and to chunk-based caching.

## UKSM: Swift Memory Deduplication via Hierarchical and Adaptive Memory Region Distilling

Nai Xia and Chen Tian, State Key Laboratory for Novel Software Technology, Nanjing University, China; Yan Luo and Hang Liu, Department of Electrical and Computer Engineering, University of Massachusetts Lowell, USA; Xiaoliang Wang, State Key Laboratory for Novel Software Technology, Nanjing University, China

Available Media

In cloud computing, deduplication can reduce memory footprint by eliminating redundant pages. The responsiveness of a deduplication process to newly generated memory pages is critical. State-of-the-art Content Based Page Sharing (CBPS) approaches lack responsiveness as they equally scan every page while finding redundancies. We propose a new deduplication system UKSM, which prioritizes different memory regions to accelerate the deduplication process and minimize application penalty. With UKSM, memory regions are organized as a distilling hierarchy, where a region in a higher level receives more CPU cycles. UKSM adaptively promotes/demotes a region among levels according to the region’s estimated deduplication benefit and penalty. UKSM further introduces an adaptive partial-page hashing scheme which adjusts a global page hashing strength parameter according to the global degree of page similarity. Experiments demonstrate that, with the same amount of CPU cycles in the same time envelop, UKSM can achieve up to 12.6and 5 more memory saving than CBPS approaches on static and dynamic workloads, respectively.