# FAST '23 Technical Sessions

View mode:

## Opening Remarks and Awards

Program Co-Chairs: Ashvin Goel, University of Toronto, and Dalit Naor, The Academic College of Tel Aviv–Yaffo

## Building and Operating a Pretty Big Storage System (My Adventures in Amazon S3)

Andy Warfield, Amazon

## Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)

Saurabh Kadekodi, Shashwat Silas, Arif Merchant, and David Clausen, Google

Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.

We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon which is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be interest to theory-minded readers.

## ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage

Xiaolu Li, Huazhong University of Science and Technology; Keyun Cheng, Kaicheng Tang, and Patrick P. C. Lee, The Chinese University of Hong Kong; Yuchong Hu and Dan Feng, Huazhong University of Science and Technology; Jie Li and Ting-Yi Wu, Huawei Technologies Co., Ltd., Hong Kong

Minimum-storage regenerating (MSR) codes are provably optimal erasure codes that minimize the repair bandwidth (i.e., the amount of traffic being transferred during a repair operation), with the minimum storage redundancy, in distributed storage systems. However, the practical repair performance of MSR codes still has significant room to improve, as the mathematical structure of MSR codes make their repair operations difficult to parallelize. We present ParaRC, a parallel repair framework for MSR codes. ParaRC exploits the subpacketization nature of MSR codes to parallelize the repair of sub-blocks and balance the repair load (i.e., the amount of traffic transferred at a node) across the available nodes. We show that there exists a trade-off between the repair bandwidth and the maximum repair load, and further propose a fast heuristic that approximately minimizes the maximum repair load with limited search time for large coding parameters. We prototype our heuristic in ParaRC and show that ParaRC reduces the degraded read and full-node recovery times over the traditional centralized repair approach in MSR codes by up to 59.3% and 39.2%, respectively.

## InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication

Iwona Kotlarska, Andrzej Jackowski, Krzysztof Lichota, Michal Welnicki, and Cezary Dubnicki, 9 Lives Data; Konrad Iwanicki, University of Warsaw

Backup solutions increasingly offer cloud tiering, that is, moving selected data from on-premise storage to the cloud. As subsequent backups usually contain repeating data, it is reasonable to pair cloud tiering with deduplication to significantly reduce the cloud storage utilization, and hence the associated costs. However, solutions for cloud tiering with deduplication that would harness the scaling potential of the cloud tier and minimize the total expenditures due to this tier are essentially still lacking. This paper aims to bridge this gap. First, it introduces InftyDedup, a novel system for cloud tiering with deduplication, which aims to maximize scalability by utilizing different cloud services, not only for storage but also computation. Accordingly, it performs deduplication in the cloud, using distributed batch algorithms, in effect allowing for processing multi-petabyte backups for a couple of dollars. Second, the paper presents algorithms for InftyDedup that employ multiple types of cloud storage to further reduce the costs of the cloud tier. They take into account the characteristics of each data chunk to efficiently select between cloud services providing hot and cold data stores, thereby reducing the overall costs by up to 26%-44%. The solutions are implemented in a state-of-the-art commercial backup system and evaluated in the cloud of a hyperscaler.

## Deployed System: Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, PDL; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Tsinghua University and Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99 tail latency by 48.05%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We commit to releasing the dataset to the public for fail-slow study.

## ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance

Jinghuan Yu, City University of Hong Kong; Sam H. Noh, Ulsan National Institute of Science & Technology; Young-ri Choi, UNIST; Chun Jason Xue, City University of Hong Kong

Log-Structure Merge-tree (LSM) based Key-Value (KV) systems are widely deployed. A widely acknowledged problem with LSM-KVs is write stalls, which refers to sudden performance drops under heavy write pressure. Prior studies have attributed write stalls to a particular cause such as a resource shortage or a scheduling issue. In this paper, we conduct a systematic study on the causes of write stalls by evaluating RocksDB with a variety of storage devices and show that the conclusions that focus on the individual aspects, though valid, are not generally applicable. Through a thorough review and further experiments with RocksDB, we show that data overflow, which refers to the rapid expansion of one or more components in an LSM-KV system due to a surge in data flow into one of the components, is able to explain the formation of write stalls. We contend that by balancing and harmonizing data flow among components, we will be able to reduce data overflow and thus, write stalls. As evidence, we propose a tuning framework called SysName{} (Automatic Data Overflow Control) that automatically adjusts the system configurations, specifically, the number of threads and the batch size, to minimize data overflow in RocksDB. Our extensive experimental evaluations with RocksDB show that SysName{} reduces the duration of write stalls by as much as 87.9% and improves performance by as much as 322.8% compared with the auto-tuned RocksDB. Compared to the manually optimized state-of-the-art SILK, SysName{} achieves up to 66% higher throughput for the synthetic write-intensive workload that we used, while achieving comparable performance for the real-world YCSB workloads. However, SILK has to use over 20% more DRAM on average.

## FUSES: A Fully Memory-Disaggregated Key-Value Store

Jiacheng Shen, The Chinese University of Hong Kong; Pengfei Zuo, Huawei Cloud; Xuchuan Luo, Fudan University; Tianyi Yang, The Chinese University of Hong Kong; Yuxin Su, Sun Yat-sen University; Yangfan Zhou, Fudan University; Michael Lyu, The Chinese University of Hong Kong

## ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems

Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and Jiajie Sheng, Huazhong University of Science and Technology

Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

## GL-Cache: Group-level learning for efficient and high-performance caching

Juncheng Yang, Carnegie Mellon University; Ziming Mao, Yale University; Yao Yue, Twitter; Rashmi Vinayak, Carnegie Mellon University

Web applications rely heavily on software caches to achieve low-latency, high-throughput services. To adapt to changing workloads, three types of learned caches (learned evictions) have been designed in recent years: object-level learning, learning-from-distribution, and learning-from-simple-experts. However, we argue that the learning granularity in existing approaches is either too fine (object-level), incurring significant computation and storage overheads, or too coarse (workload or expert-level) to capture the differences between objects and leaves a considerable efficiency gap.

In this work, we propose a new approach for learning in caches (group-level learning), which clusters similar objects into groups and performs learning and eviction at the group level. Learning at the group level accumulates more signals for learning, leverages more features with adaptive weights, and amortizes overheads over objects, thereby achieving both high efficiency and high throughput.

We designed and implemented GL-Cache on an open-source production cache to demonstrate group-level learning. Evaluations on 118 production block I/O and CDN cache traces show that GL-Cache has a higher hit ratio and throughput than state-of-the-art designs. Compared to LRB (object-level learning), GL-Cache improves throughput by 228 times and hit ratio by 7% on average across cache sizes. For 10% of the traces (P90), GL-Cache provides a 25% hit ratio increase from LRB. Compared to the best of all learned caches, GL-Cache achieves a 64\% higher throughput, a 3% higher hit ratio on average, and a 13% hit ratio increase at the P90.

## SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Redwan Ibne Seraj Khan and Ahmad Hossein Yazdani, Virginia Tech; Yuqi Fu, University of Virginia; Arnab K. Paul, BITS Pilani, Goa, India; Bo Ji and Xun Jian, Virginia Tech; Yue Cheng, University of Virginia; Ali R. Butt, Virginia Tech

Emerging deep learning (DL) applications exhibit unique characteristics that pose new challenges when deriving higher performance and efficiency. DL learning is I/O intensive as data samples need to be fetched continuously from a remote storage. At the same time, GPUs have been extensively employed to support these applications. As GPUs become more powerful, I/O performance lags behind creating a major bottleneck, especially in distributed DL (DDL). The exponentially growing datasets, such as visual analysis, preclude storing datasets entirely in memory. However, as resource demands keep changing at different stages of training, efficiently matching DL job needs with available system resources is one of the main challenges. The recent observation that while DL treats all data samples equally, some samples contribute more towards building the accuracy of a model, and hence have more importance, has created an opportunity for optimizations by exploiting data locality.

In this work, we propose SHADE---a system that detects fine-grained importance among data samples leveraging which it takes caching and eviction decisions during the runtime of DL jobs. The key strength of SHADE is that it manages to increase the hit rate in cache and thus increases training performance by adopting a novel rank-based approach to detect relative importance of samples coupled with a technique to dynamically update importance score throughout training. Evaluations on real-world datasets using representative CV (Computer Vision) models show that, SHADE ensures higher throughput, higher hit rates, reduced minibatch load times, and increases memory utilization up to 4.5 x compared to advanced caching policies even when these state-of-the-art techniques are caching the same up to 7.5 x more data compared to SHADE.

## Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

Lei Liu, Beihang University; Xinglei Dou and Yuetao Chen, ICT, CAS

Latency-critical services have been widely deployed in cloud environments. For cost-efficiency, multiple services are usually co-located on a server. Thus, run-time resource scheduling becomes the pivot for QoS control in these complicated co-location cases. However, the scheduling exploration space enlarges rapidly with the increasing server resources, making the schedulers hardly provide ideal solutions quickly. More importantly, we observe that there are "resource cliffs" in the scheduling exploration space. They affect the exploration efficiency and always lead to severe QoS fluctuations in previous schedulers. To address these problems, we propose a novel ML-based intelligent scheduler – OSML. It learns the correlation between architectural hints (e.g., IPC, cache misses, memory footprint, etc.), scheduling solutions and the QoS demands based on a data set we collected from 11 widely deployed services running on off-the-shelf servers. OSML employs multiple ML models to work collaboratively to predict QoS variations, shepherd the scheduling, and recover from QoS violations in complicated co-location cases. OSML can intelligently avoid resource cliffs during scheduling and reach an optimal solution much faster than previous approaches for co-located LC services. Experimental results show that OSML supports higher loads and meets QoS targets with lower scheduling overheads and shorter convergence time than previous studies.

## FAST '23 Poster Session and Reception

See the Call for Posters and WiPs for information on how to submit your poster. The submission deadline is Thursday, January 19, 2023.

## CJFS: Concurrent Journaling for Better Scalability

Joontaek Oh, Seung Won Yoo, and Hojin Nam, KAIST; Changwoo Min, Virginia Tech; Youjip Won, KAIST

In this paper, we propose CJFS, Concurrent Journaling Filesystem. CJFS extends EXT4 filesystem and addresses the fundamental limitations of the EXT4 journaling design, which are the main cause for the poor scalability of EXT4 filesystem. The heavy-weight EXT4 journal suffers from two limitations. First, the journal commit is a strictly serial activity. Second, the journal commit uses the original page cache entry, not the copy of it, and subsequently any access to the in-flight page cache entry is blocked. To address these limitations, we propose four techniques, namely Dual Thread Journaling, Multi-version Shadow Paging, Opportunistic Coalescing, and Compound Flush. With Dual Thread design, CJFS can commit a transaction before the preceding journal commit finishes. With Multi-version Shadow Paging, CJFS can be free from the transaction conflict even though there can exist multiple committing transactions. With Opportunistic Coalescing, CJFS can mitigate the transaction lock-up overhead in journal commit so that it can increase the coalescing degree -- i.e., the number of system calls associated with a single transaction -- of a running transaction. With Compound Flush, CJFS minimizes the number of flush calls. CJFS improves the throughput by 81%, 68% and 125% in filebench varmail, dbench, and OLTP-Insert on MySQL, respectively, against EXT4 by removing the transaction conflict and lock-up overhead.

## Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities

Aditya Basu and Jack Sampson, Pennsylvania State University; Zhiyun Qian, UC Riverside; Trent Jaeger, Pennsylvania State University

File name confusion attacks, such as malicious symlinks and file squatting, have long been studied as sources of security vulnerabilities. However, a recently emerged type, i.e., \textit{\textbf{case-sensitivity-induced name collisions}}, has not been scrutinized. These collisions are introduced by differences in name resolution under case-sensitive and case-insensitive file systems or directories. A prominent example is the recent github vulnerability (CVE-2021-21300) which can lead to code execution on a victim client when it clones a maliciously crafted repository onto a case-insensitive file system. With trends including ext4 adding support for per-directory case-insensitivity and the broad deployment of the Windows Subsystem for Linux, the prerequisites for such vulnerabilities are increasingly likely to exist even in a single system.

In this paper, we make a first effort to investigate how and where the lack of any uniform approach to handling name collisions leads to a diffusion of responsibility and resultant vulnerabilities. Interestingly, we demonstrate the existence of a range of novel security challenges arising from name collisions and their inconsistent handling by low-level utilities and applications. Specifically, our experiments show that utilities handle many name collision scenarios unsafely, leaving the responsibility to applications whose developers are unfortunately not yet aware of the threats. We examine three case studies as a first step towards systematically understanding the emerging type of name collision vulnerability.

## ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit

Tabassum Mahmud, Om Rameshwar Gatla, Duo Zhang, Carson Love, Ryan Bumann, and Mai Zheng, Iowa State University

File systems play an essential role in modern society for managing precious data. To meet diverse needs, they often support many configuration parameters. Such flexibility comes at the price of additional complexity which could lead to subtle configuration-related issues. To address the challenge, we study the configuration-related issues of two major file systems (i.e., Ext4 and XFS) in depth, and identify a prevalent pattern called multilevel configuration dependencies. Based on the study, we build an extensible tool called ConfD to extract the dependencies automatically, and create six plugins to address different configuration-related issues. We apply the prototype to analyze Ext4 and XFS and extract more than 100 configuration dependencies for the two file systems with a low false positive rate. Moreover, we have identified various configuration-related issues including 17 specification issues, 18 configuration handling issues, and 10 regression test failures induced by valid configurations.

## HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers

Xiaobin He, National Supercomputer Center, Wuxi; Bin Yang, Tsinghua University, National Supercomputer Center, Wuxi; Jie Gao and Wei Xiao, National Supercomputer Center, Wuxi; Qi Chen, Tsinghua University; Shupeng Shi and Dexun Chen, National Supercomputer Center, Wuxi; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University; Zuo-Ning Chen, Chinese Academy of Engineering

Current supercomputers introduce SSDs to form a Burst Buffer (BB) layer to meet the HPC application's growing I/O requirements. BBs can be divided into two types by deployment location. One is the local BB, which is known for its scalability and performance. The other is the shared BB, which has the advantage of data sharing and deployment costs. How to unify the advantages of the local BB and the shared BB is a key issue in the HPC community.

## Fisc: A Large-scale Cloud-native-oriented File System

Qiang Li, Alibaba Group; Lulu Chen, Fudan University; Xiaoliang Wang, Nanjing University; Shuo Huang, Alibaba Group; Qiao Xiang, Xiamen University; Yuanyuan Dong, Wenhui Yao, Minfei Huang, Puyuan Yang, Shanyang Liu, Zhaosheng Zhu, Huayong Wang, Haonan Qiu, Derui Liu, Shaozong Liu, Yujie Zhou, Yaohui Wu, Zhiwu Wu, Shang Gao, Chao Han, Zicheng Luo, Yuchao Shao, Gexiao Tian, Zhongjie Wu, Zheng Cao, and Jinbo Wu, Alibaba Group; Jiwu Shu, Xiamen University; Jie Wu, Fudan University; Jiesheng Wu, Alibaba Group

The wide adoption of Cloud Native shifts the boundary between cloud users and CSPs (Cloud Service Providers) from VM-based infrastructure to container-based applications. However, traditional file systems face challenges. First, the traditional file system (\eg, Tectonic, Colossus, HDFS) clients are sophisticated and compete with the scarce resources in the application containers. Second, it is challenging for CSP to help the I/O pass from the containers to the storage clusters while guaranteeing their security, availability, and performance.

To provide file system service for cloud-native applications, we design system{}, a cloud-native-oriented file system. system{} introduces four key designs: 1) a lightweight file system client in the container, 2) a DPU-based virtio-system{} device to implement the hardware offloading, 3) a storage-aware mechanism to address the I/O to the storage node to improve the I/O's availability and realizes local read, 4) a full path QoS mechanism to guarantee the QoS of hybrid deployed applications. system{} has been deployed in production for over three years. It now serves cloud-native applications running over 3 million cores. Results show that system{} client only consumes 80% CPU resources compared to the traditional file system client. The production environment shows that the online searching task's latency is less than 500 $\mu$s when accessing the remote storage cluster.

## TENET: Memory Safe and Fault tolerant Persistent Transactional Memory

R. Madhava Krishnan, Virginia Tech; Diyu Zhou, École Polytechnique Fédérale de Lausanne (EPFL); Wook-Hee Kim, Konkuk University; Sudarsun Kannan, Rutgers University; Sanidhya Kashyap, EPFL; Changwoo Min, Virginia Tech

Byte-addressable Non-Volatile Memory (NVM) allows pro- grams to directly access storage using memory interface with- out going through the expensive conventional storage stack. However, direct access to NVM makes the NVM data vulner- able to software memory bugs (memory safety) and hardware errors (fault tolerance). This issue is critical because, unlike DRAM, corrupted data can persist forever, even after the sys- tem restart. Albeit the plethora of research on NVM programs and systems, there is little attention protecting NVM data from software bugs and hardware errors. In this paper, we propose TENET, a new NVM program- ming framework, which guarantees memory safety and fault- tolerance to protect NVM data against software bugs and hardware errors. TENET provides the most popular Persistent Transactional Memory (PTM) programming model. TENET leverages the concurrency and commit-time guarantees of a PTM to provide performant and cost-efficient memory safety and fault tolerance. Our evaluations shows that TENET of- fers the protection for NVM data at a modest performance overhead and storage cost, as compared to other PTMs with partial or no memory safety and fault-tolerance support.

## MEFS: Per-File Virtualization for Userspace Persistent Memory Filesystems

Shawn Zhong, Chenhao Ye, Guanzhou Hu, Suyan Qu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Michael Swift, University of Wisconsin-Madison

Persistent memory (PM) can be directly accessed from userspace without kernel involvement, but most PM filesystems still perform metadata operations in the kernel for security and rely on the kernel for cross-process synchronization.

We present per-file virtualization, where a virtualization layer implements a complete set of file functionalities, including metadata management, crash consistency, and concurrency control, fully in userspace. We observe that not all file metadata has to be maintained by the kernel and propose embedding insensitive metadata into file data for userspace management. For crash consistency, copy-on-write (CoW) benefits from the embedding of the block mapping since the mapping can be efficiently updated without kernel involvement. For cross-process synchronization, we introduce lock-free optimistic concurrency control (OCC) at the user level, which tolerates process crashes and brings better scalability.

Based on per-file virtualization, we implement MEFS, a library PM filesystem that maintains the embedded metadata as a compact log. Experimental results show that on concurrent workloads, MEFS achieves up to 8.3 times throughput compared to SplitFS. For real-world applications, MEFS provides up to 53% speedup for YCSB on LevelDB and 90% for TPC-C on SQLite compared to NOVA.

## On Stacking a Persistent Memory File System on Legacy File Systems

Hobin Woo, Samsung Electronics; Daegyu Han, Sungkyunkwan University; Seungjoon Ha, Samsung Electronics; Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology); Beomseok Nam, Sungkyunkwan University

In this work, we design and implement a Stackable Persistent memory File System (SPFS). SPFS can be stacked on any disk-optimized file system to improve I/O performance by absorbing frequent order-preserving small synchronous writes in NVMM while also exploiting the VFS cache of the underlying disk-optimized file system for non-synchronous writes. A stackable file system must be lightweight in that it manages only NVMM and not the disk or VFS cache. Therefore, SPFS manages all file system metadata including extents using simple but highly efficient dynamic hash tables. To manage extents using hash tables, we design a novel Extent Hashing algorithm that exhibits fast insertion as well as fast scan performance. Our extensive performance study shows that SPFS effectively improves I/O performance of the lower file system by up to 9.9x.

## Citron: Distributed Range Lock Management with One-sided RDMA

Jian Gao, Youyou Lu, Minhui Xie, Qing Wang, and Jiwu Shu, Tsinghua University

Range lock enables concurrent accesses to disjoint parts of a shared storage. However, existing range lock managers rely on centralized CPU resources to process lock requests, which results in server-side CPU bottleneck and suboptimal performance when placed in a distributed scenario.

We propose Citron, an RDMA-enabled distributed range lock manager that bypasses server-side CPU by using only one-sided RDMA in range lock acquisition and release paths. Citron manages range locks with a static data structure called segment tree, which effectively accommodates dynamically located and sized ranges but only requires limited and nearly constant synchronization costs from the clients. Citron can also scale up itself in microseconds to adapt to a shared storage of a growing size at runtime. Evaluation shows that under various workloads, Citron delivers up to 3.35x throughput and 76.4% lower tail latency than CPU-based approaches.

## Patronus: High-Performance and Protective Remote Memory

Bin Yan, Youyou Lu, Qing Wang, Minhui Xie, and Jiwu Shu, Tsinghua University

RDMA-enabled remote memory (RM) systems are gaining popularity with improved memory utilization and elasticity. However, since it is commonly believed that fine-grained RDMA permission management is impractical, existing RM systems forgo memory protection, an indispensable property in a real-world deployment. In this paper, we propose PATRONUS, an RM system that can simultaneously offer protection and high performance. PATRONUS introduces a fast permission management mechanism by exploiting advanced RDMA hardware features with a set of elaborate software techniques. Moreover, to retain the high performance under exception scenarios (e.g., client failures, illegal access), PA- TRONUS attaches microsecond-scaled leases to permission and reserves spare RDMA resources for fast recovery. We evaluate PATRONUS over two one-sided data structures and two function-as-a-service (FaaS) applications. The experiment shows that the protection only brings 2.4 % to 27.7 % overhead among all the workloads, and our system performs at most ×5.2 than the best competitor.

## Deployed System: More Than Capacity, Performance-oriented Evolution of Pangu in Alibaba

Qiang Li, Alibaba Group; Qiao Xiang, Haohao Song, Ridi Wen, and Yuxin Wang, Xiamen University; Wenhui Yao, Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, Huanyong Wang, and Shanyang Liu, Alibaba Group; Lulu Chen, Fudan University; Zhiwu wu, Haonan Qiu, Derui Liu, Gexiao Tian, Chao Han, Shaozong Liu, Yaohui Wu, Zicheng Luo, Yuchao Shao, Junping Wu, Zheng Cao, Zhongjie Wu, Jiaji Zhu, and Jinbo Wu, Alibaba Group; Jiwu Shu, Xiamen University; Jiesheng Wu, Alibaba Group

This paper presents how the Pangu storage system continuously evolves with hardware technologies and the business model to provide high-performance, reliable storage services with a 100-microsecond level of I/O latency. Pangu’s evolution includes two phases. In the first phase, Pangu embraces the emergence of the solid-state drive (SSD) storage and remote direct memory access (RDMA) network technologies by innovating its file system and designing a user-space storage operating system to substantially reduce the I/O latency while providing high throughput and IOPS. In the second phase, Pangu evolves from a volume-oriented storage provider to a performance-oriented one. To adapt to this change of business model, Pangu upgrades its infrastructure with storage servers of much higher SSD volume and RDMA bandwidth from 25Gbps to 100Gbps. It introduces a series of key designs, including traffic amplification reduction, remote direct cache access, and CPU computation offloading, to ensure Pangu fully harvests the performance improvement brought by hardware upgrade. Other than introducing these technology innovations, we also share our operating experiences during Pangu’s evolution, and discuss important lessons learned from them.

## Work-in-Progress Reports (WiPs)

See the Call for Posters and WiPs for information on how to submit your work-in-progress report. The submission deadline is Thursday, January 19, 2023.

## λ-IO: A Unified IO Stack across the Host and Computational Storage Device

Zhe Yang, Youyou Lu, Xiaojian Liao, Youmin Chen, Junru Li, Siyu He, and Jiwu Shu, Tsinghua University

The emerging computational storage device offers an opportunity for in-storage computing. It alleviates the overhead of data movement between the host and the device, and thus accelerates data-intensive applications. In this paper, we present λ-IO, a unified IO stack managing both computation and storage resources across the host and the device. We propose a set of designs – interface, runtime, and scheduling – to tackle three critical issues. We implement λ-IO in full-stack software and hardware environment, and evaluate it with synthetic and production applications against Linux IO, showing up to 5.12× performance improvement.

## Revitalizing the Forgotten On-Chip DMA to Expedite Data Movement in NVM-based Storage Systems

Jingbo Su, Jiahao Li, Luofan Chen, and Cheng Li, University of Science and Technology of China; Kai Zhang and Liang Yang, SmartX; Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology); Yinlong Xu, University of Science and Technology of China

Data-intensive applications executing on NVM-based storage systems experience serious bottlenecks when moving data between DRAM and NVM. We advocate for the use of the long-existing but recently neglected on-chip DMA to expedite data movement with three contributions. First, we explore new latency-oriented optimization directions, driven by a comprehensive DMA study, to design a high-performance DMA module, which significantly lowers the I/O size threshold to observe benefits. Second, we propose a new data movement engine, Fastmove, that coordinates the use of the DMA along with the CPU with judicious scheduling and load splitting such that the DMA’s limitations are compensated, and the overall gains are maximized. Finally, with a general kernel-based design, simple APIs, and DAX file system integration, Fastmove allows applications to transparently exploit the DMA and its new features without code change. We run three data-intensive applications MySQL, GraphWalker, and Filebench atop NOVA, ext4-DAX, and XFS-DAX, with standard benchmarks like TPC-C, and popular graph algorithms like PageRank. Across single- and multi-socket settings, compared to the conventional CPU-only NVM accesses, Fastmove introduces to TPC-C with MySQL 1.13-2.16× speedups of peak throughput, reduces the average latency by 17.7-60.8%, and saves 37.1-68.9% CPU usage spent in data movement. It also shortens the execution time of graph algorithms with GraphWalker by 39.7-53.4%, and introduces 1.12-1.27× throughput speedups for Filebench.

## NVMeVirt: A Versatile Software-defined Virtual NVMe Device

Sang-Hoon Kim, Ajou University; Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, and Jin-Soo Kim, Seoul National University

There have been drastic changes in the storage device landscape recently. Many novel storage device types and concepts are actively introduced and commercialized. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models that these next-generation device types require. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices.

In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device types with custom features, and NVMeVirt allows to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace (ZNS) SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt such as studying the performance characteristics of database engines and extending the NVMe specification for improved key-value SSD performance.

## Deployed System: SMRSTORE: A Storage Engine for Cloud Object Storage on HM-SMR Drives

Su Zhou, Erci Xu, Hao Wu, Yu Du, Jiacheng Cui, Wanyu Fu, Chang Liu, Yingni Wang, Wenbo Wang, Shouqu Sun, Xianfei Wang, Bo Feng, Biyun Zhu, Xin Tong, Weikang Kong, Linyan Liu, Zhongjie Wu, Jinbo Wu, Qingchao Luo, and Jiesheng Wu, Alibaba Cloud

Cloud object storage vendors are in the forever pursuit of better cost efficiency. The emerging Shingled Magnetic Recording (SMR) drive becomes an economically favorable choice due to significantly improved areal density. However, SMR drives were mostly deployed in the archival-class object storage because they require zone-level sequential write and erase in bulk. For standard-class object storage, previous studies and our preliminary exploration reveal that existing SMR drive solutions can suffer from severe performance penalty and unpredictability.

In this paper, we propose SMRSTORE, an SMR-based storage engine for standard-class object storage without compromising performance or durability. The key features of SMRSTORE include directly bridging the semantics of distributed file system with the zoned namespace in SMR drives, using a complete log-structured design, and applying guided data placement to reduce GC activities and achieve consistent performance. The evaluation shows that SMRSTORE delivers comparable performance as Ext4 on the Conventional Magnetic Recording (CMR) drives, and can be up to 2.16x faster than F2FS on SMR drives. Currently, we have deployed SMRSTORE in Alibaba Cloud Object Storage Service (OSS) to store hundreds of PBs of data in standard class. We plan to use SMR drives for all classes of OSS in the near future.

## Multi-view Feature-based SSD Failure Prediction: What, When, and Why

Yuqi Zhang and Wenwen Hao, Samsung R&D Institute China Xi'an, Samsung Electronics; Ben Niu and Kangkang Liu, Tencent; Shuyang Wang, Na Liu, and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics; Yongwong Gwon and Chankyu Koh, Samsung Electronics

Solid state drives (SSDs) play an important role in large-scale data centers. SSD failures affect the stability of storage system and would cause additional maintenance overhead. To predict and handle SSD failures in advance, this paper proposes a multi-view and multi-task random forest (MVTRF) scheme. MVTRF predicts SSD failures based on multi-view features extracted from both long-term and short-term monitoring data of SSDs. Particularly, multi-task learning is adopted to simultaneously predict what type of failure it is and when it will occur through the same model. We also extract the key decisions of MVTRF to analyze why the failure will occur. These details of failure would be useful for verifying and handling SSD failures. The proposed MVTRF is evaluated on the large-scale real data from data centers. The experimental results show that MVTRF has higher failure prediction accuracy and improves precision by 24.4% and recall by 18.6% on average compared to the existing schemes. The results also demonstrate the effectiveness of MVTRF on failure type and time prediction and failure cause identification, which helps to improve the efficiency of failure handling.

## Fast Application Launch on Personal Computing/Communication Devices

Junhee Ryu, SK hynix; Dongeun Lee, Texas A&M University - Commerce; Kang G. Shin, University of Michigan; Kyungtae Kang, Hanyang University

We present Paralfetch, a novel prefetcher to speed up app launches on personal computing/communication devices by: 1) accurate collection of launch-related disk read requests, 2) pre-scheduling of these requests to improve I/O throughput during prefetching, and 3) overlapping app execution with disk prefetching for hiding disk access time from the app execution. We have implemented Paralfetch under Linux kernels on a desktop/laptop PC, a Raspberry Pi 3 board, and an Android smartphone. Tests with popular apps show that Paralfetch significantly reduces app launch times on flash-based drives, and outperforms GSoC Prefetch and FAST, which are representative app prefetchers available for Linux-based systems.

## Integrated Host-SSD Mapping Table Management for Improving User Experience of Smartphones

Yoona Kim and Inhyuk Choi, Seoul National University; Juhyung Park, Jaeheon Lee, and Sungjin Lee, DGIST; Jihong Kim, Seoul National University

Host Performance Booster (HPB) was proposed to improve the performance of high-capacity mobile flash storage systems by utilizing unused host DRAM memory. In this paper, we investigate how HPB should be managed so that the user experience of smartphones can be enhanced from HPB-enabled high-performance mobile storage systems. From our empirical study on Android environments, we identified two requirements for an efficient HPB management scheme in smartphones. First, HPB should be managed in a foreground app-centric manner so that the user-perceived latency can be greatly reduced. Second, the capacity of the HPB memory should be dynamically adjusted so as not to degrade user experience of the foreground app. As an efficient HPB management solution that meets the identified requirements, we propose an integrated host-SSD mapping table management scheme, HPBvalve, for smartphones. HPBvalve prioritizes the foreground app in managing mapping table entries in the HPB memory. HPBvalve dynamically resizes the overall capacity of the HPB memory depending on the memory pressure status of the smartphone. Our experimental results using the prototype implementation demonstrate that HPBvalve improves UX-critical app launching time by up to 43% (250ms) over the existing HPB management scheme, without negatively affecting memory pressure. Meanwhile, the L2P mapping misses are alleviated by up to 78%.