All sessions will be held in the Santa Clara Ballroom unless otherwise noted.
All the times listed below are in Pacific Standard Time (UTC-8).
Papers and Proceedings
Papers are available for download below to registered attendees now and to everyone beginning on Tuesday, February 24. Paper abstracts and proceedings front matter are available to everyone now. Copyright to the individual works is retained by the author(s).
Proceedings Front Matter
Proceedings Cover |
Title Page and List of Organizers |
Message from the Program Co-Chairs |
Table of Contents

9:00 am–9:15 am
Opening Remarks and Erik Riedel Best Paper Awards
Program Co-Chairs: André Brinkmann, Johannes Gutenberg-Universität Mainz, and Philip Shilane, Dell Technologies
9:15 am–10:35 am
Cloud Technologies I
Session Chair: Ming Zhao, Arizona State University
Here, There and Everywhere: The Past, the Present and the Future of Local Storage in Cloud
Leping Yang, Shanghai Jiao Tong University; Yanbo Zhou, Gong Zeng, Li Zhang, Saisai Zhang, Ruilin Wu, Chaoyang Sun, Shiyi Luo, Wenrui Li, Keqiang Niu, Xiaolu Zhang, Junping Wu, Jiaji Zhu, and Jiesheng Wu, Alibaba Group; Mariusz Barczak and Wayne Gao, Solidigm; Ruiming Lu, Erci Xu, and Guangtao Xue, Shanghai Jiao Tong University
Awarded Best Paper!
Cloud local storage has been a popular service among many vendors thanks to its near-physical performance and affordable price. In this paper, we revisit the evolution of the cloud local storage at Alibaba Cloud. We systematically analyze and evaluate the motivations, architectures and pros/cons of different methodologies from user space stack to hardware offloading. We also explore the future of local storage including a hybrid solution by integrating Elastic Block Storage to achieve better performance, availability and cost efficiency.
Cost-efficient Archive Cloud Storage with Tape: Design and Deployment
Qing Wang, Tsinghua University; Fan Yang, Qiang Liu, and Geng Xiao, Huawei Cloud; Yongpeng Chen and Hao Lan, Tsinghua University; Leiming Chen, Bangzhu Chen, Chenrui Liu, Pingchang Bai, Bin Huang, Zigan Luo, Mingyu Xie, and Yu Wang, Huawei Cloud; Youyou Lu, Tsinghua University; Huatao Wu, Huawei Cloud; Jiwu Shu, Tsinghua University and Minjiang University
TapeOBS is an archive storage service offered by Huawei Cloud, which delivers high cost-efficiency by leveraging tape to store large volumes of archived data. Although tape boasts a low total cost of ownership, its inherent characteristics (e.g., a limited number of drives within a tape library) pose unique challenges when developing a large-scale distributed storage system. To address these challenges, we take a holistic approach in designing TapeOBS. At the high level, we introduce a fully asynchronous tape pool, which supports data scheduling and erasure coding in a batched manner, aligning with the features of tape hardware. Within a tape library, we design a tape-tailored local storage engine and incorporate techniques such as dedicated drives to optimize performance. TapeOBS began its gradual rollout at the end of 2022 and officially started serving customers in 2024. As of this writing, TapeOBS has stored hundreds of petabytes of raw user data.
Towards Condensed and Efficient Read-Only File System via Sort-Enhanced Compression
Hao Huang, Yifeng Zhang, Yanqi Pan, Wen Xia, Xiangyu Zou, and Darong Yang, Harbin Institute of Technology, Shenzhen; Jubin Zhong and Hua Liao, Huawei Technologies Co., Ltd
Read-only compressed file systems have become increasingly popular in space-sensitive scenarios, such as IoT and Docker containers. To construct condensed images, they divide the data into blocks (e.g., 1 MB) and compress blocks separately. However, we observe that block-based compression cannot fully utilize the compression benefits due to the data mixture problem, while its performance issues hinder practical usage.
We propose RubikFS, a sort-enhanced read-only file system. Our key idea is to solve data mixture by sorting and clustering similar data chunks in a file system-favored block granularity. This is achieved by similarity sorter, which builds a similarity graph to measure the similarity of data chunks and clusters similar chunks by subgraph partitioning. Moreover, sorting can also group data with the same hotness to minimize read amplification. We then introduce an array of techniques, including data grouper, data chunker, and hotness grouper, to implement condensed and efficient RubikFS. Experiments suggest that, compared to existing read-only compressed file systems, RubikFS increases the compression ratio by up to 42.60% and reduces unnecessary reads by up to 70.70%.
ACOS: Apple’s Geo-Distributed Object Store at Exabyte Scale
Benjamin Baron, Aline Bousquet, Eric Metens, Swapnil Pimpale, Nick Puz, Marc de Saint Sauveur, Varsha Muzumdar, and Vinay Ari, Apple
Over the last two decades, with the advent of mobile computing and Internet streaming, Apple has expanded its user base and services significantly. With this growth, we have seen an increased volume and diversity of data storage, ranging from backups, personal photos and videos, to music libraries, TV shows, and live streaming. In this paper, we present ACOS, Apple's object store designed to meet the specific requirements of user-facing and internal services, by accommodating a wide range of content and access patterns. ACOS has been in production for over a decade, storing several exabytes of objects and serving billions of requests per day. With its geo-replicated architecture using both local and regional replication mechanisms, ACOS is cost-efficient and highly scalable, available, and durable. The evaluation of our production deployment shows its throughput and latency performance as well as its resilience to hardware and data center failures. This paper presents the design and evolution of ACOS, evaluates its performance in production, and demonstrates its capacity to scale and support Apple's current storage needs and future growth.
Note: This paper's title and abstract have been updated per the errata slip published in the conference proceedings.
10:35 am–11:05 am
Coffee and Tea Break
Mezzanine East/West
11:05 am–12:25 pm
AI and LLMs I
Session Chair: Bingzhe Li, The University of Texas at Dallas
SolidAttention: Low-Latency SSD-based Serving on Memory-Constrained PCs
Xinrui Zheng, Dongliang Wei, Jianxiang Gao, Yixin Song, Zeyu Mi, and Haibo Chen, Shanghai Jiao Tong University
AI personal computers (AIPCs) enable the local deployment of large language model (LLM) inference, offering enhanced privacy guarantees and customizable serving. However, such deployments are constrained by limited memory capacity, primarily due to the substantial key-value (KV) cache overhead. This paper introduces SolidAttention, an LLM inference engine which addresses these limitations through a tight co-design of dynamic attention sparsity algorithms and SSD-based storage management. Specifically, to maximize SSD bandwidth utilization, SolidAttention consolidates multiple KV pairs into coarse-grained blocks and implements speculative prefetching mechanisms that exploit temporal locality in sparse attention. By fine-grained orchestration of computation and I/O operations while reusing synchronization points, SolidAttention further minimizes SSD-induced blocking latency. With a 128k-token context, SolidAttention improves the inference speed by up to 3.1× and reduces the KV cache memory footprint by up to 98% without compromising inference accuracy.
CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving
Yang Liu and Yunfei Gu, Shanghai Jiao Tong University; Liqiang Zhang, Jinan Inspur Data Technology Co., Ltd; Chentao Wu, Guangtao Xue, Jie Li, and Minyi Guo, Shanghai Jiao Tong University; Junhao Hu, Peking University; Jie Meng, Huawei Cloud
Large Language Models (LLMs) are increasingly deployed in agent-based applications with complex prompt structures comprising both invariant and dynamic segments. Existing KV cache reuse strategies—PositionDependent Caching (PDC) and Position-Independent Caching (PIC)—inadequately address these scenarios, imposing either strict positional constraints or introducing significant computational overhead due to Positionally Misaligned KV Drift (PMKD) and window padding problems. We identify a distinct pattern in agent workflows termed Relative-Position-Dependent Caching (RPDC), where reusable segments maintain consistent relative ordering despite absolute position shifts. To address this pattern, we propose CacheSlide, a novel KV cache management system that enhances positional-encoding similarity for fixed segments, computes attention for only a minimal subset of tokens, combines new and cached KVs using learned weights, and implements layer-wise and spill-aware KV-cache optimizations. Our implementation extends vLLM’s KV cache management with Chunked Contextual Position Encoding and Weighted Correction Attention. Experimental evaluation across multiple LLMs and agent benchmarks demonstrates that CacheSlide significantly outperforms state-of-the-art baselines, achieving 3.11-4.3× reduction in latency and 3.5-5.8× improvement in throughput, establishing a new efficiency frontier for agent-based LLM applications.
Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional Computation–Storage Awareness
Shipeng Hu and Guangyan Zhang, Tsinghua University; Yuqi Zhou, China University of Geosciences Beijing; Yaya Wei and Ziyan Zhong, China Telecom Omni-channel Operation Center; Jike Chen, Tsinghua University
In interactive LLM serving, historical key–value tensors (KVs) of multi-round conversations are often cached in a two-tier storage system consisting of host memory and SSDs, which provides large capacity at low cost. However, loading KVs from two-tier storage in existing approaches increases serving latency by up to 3.8× and decreases throughput by up to 2.0× compared to an ideal large-memory setting on our interactive conversation workload. This inefficiency arises from poor coordination between compute engine and two-tier storage.
This paper proposes Bidaw, an efficient KV caching approach with two-tier storage that enables bidirectional awareness between compute and storage. Bidaw introduces two key mechanisms. First, the compute engine schedules requests with KV-loading latency awareness by separating requests whose KVs reside in different storage layers and reordering them by KV size to reduce blocking. Second, the storage system improves host memory hit rates by leveraging LLM-generated responses to predict user access patterns during KV eviction. For further optimization, Bidaw balances storage footprint against computational savings by selectively caching storage-efficient history tensors.
Experiments on our interactive conversation workload and a public multi-round conversation workload of interactive LLM serving show that Bidaw reduces response latency by up to 3.58× and improves throughput by up to 1.83× over state-of-the-art approaches, approaching the theoretical upper bound achieved when all KVs reside entirely in host memory.
Accelerating Model Loading in LLM Inference by Programmable Page Cache
Yubo Liu, Hongbo Li, Xiaojia Huang, Yongfeng Wang, Hanjun Guo, Hui Chen, Yuxin Ren, and Ning Jia, Huawei Technologies Co., Ltd.
This paper examines the model loading bottleneck during the LLM inference startup. Existing solutions often optimize model loading performance at the expense of compatibility. However, compatibility is a crucial factor determining whether a technology can be widely applied in real-world scenarios. This work achieves both high performance and strong compatibility by optimizing the cache policy of the kernel file system. We design PPC, a programmable page cache framework that allows users to customize page cache policies in a non-intrusive, flexible, and lightweight manner. Furthermore, we design MAIO, a cache policy implemented based on PPC, to optimize model loading. MAIO introduces an I/O template-based mechanism to fully utilize SSD bandwidth, XPU affinity, and data locality to enhance the efficiency of prefetching and eviction. Our evaluation shows that MAIO reduces the model loading latency by up to 79% compared to existing optimizations. In a real-world application, MAIO achieves up to 36% improvement in inference throughput over other tested solutions in the elastic deployment scenario.
12:25 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:40 pm
Indexing
Session Chair: Zhichao Cao, Arizona State University
OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector Search
Hao Guo and Youyou Lu, Tsinghua University
Approximate Nearest Neighbor Search (ANNS) is widely used in various scenarios. For billion-scale ANNS, on-disk graph-based indexes, which organize the vectors as a graph and store them on disk, are favored for their performance and cost-efficiency. However, existing indexes can not maintain a stable search performance while inserting new vectors.
In this paper, we propose to use direct insert, which directly inserts vectors into the on-disk index, rather than buffering them in memory and merging them to disk in batches like existing systems. This approach can even out the interference of insert with frontend search, thus stabilizing the performance. We evaluate direct insert by integrating it into a billion-scale graph-based ANNS index named OdinANN. With a fixed insert rate, OdinANN outperforms state-of-the-art ANNS indexes in search latency and throughput, and it consistently shows stable performance in billion-scale vector datasets.
DMTree: Towards Efficient Tree Indexing on Disaggregated Memory via Compute-side Collaborative Design
Guoli Wei, University of Science and Technology of China; Yongkun Li, University of Science and Technology of China and Anhui Provincial Key Laboratory of High Performance Computing, USTC; Haoze Song, The University of Hong Kong; Tao Li and Lulu Yao, University of Science and Technology of China; Yinlong Xu, University of Science and Technology of China and Anhui Provincial Key Laboratory of High Performance Computing, USTC; Heming Cui, The University of Hong Kong
Distinguished Artifact Award Winner
Disaggregated memory (DM) separates computing and memory resources into distinct resource pools, enhancing resource utilization and scalability. However, this new architecture presents fundamental design challenges on range indexes. Existing works fail to achieve high performance: they either suffer from the network bandwidth bottleneck or are fragile due to high RDMA IOPS demands. The key reason is that they all follow a typical design paradigm that uses private compute-side caching, where each compute server holds a private cache space and aggressively consumes the bandwidth and IOPS between compute servers and memory servers.
We propose a new compute-side collaborative design. It offloads data locating and locking operations from memory servers to compute servers and thus fully utilizes unsaturated RDMA resources between compute servers to mitigate bottlenecks on memory servers. We implement a prototype called DMTree. Experiments show that DMTree outperforms existing state-of-the-art range indexes on DM for both point operations (i.e., searches, inserts, and updates) and range operations (i.e., scans) under various workloads and parameter settings.
An Efficient Cloud Storage Model with Compacted Metadata Management for Performance Monitoring Timeseries Systems
Kai Zhang, The Chinese University of Hong Kong; Tianyu Wang, Shenzhen University; Zili Shao, The Chinese University of Hong Kong
Cloud-based performance monitoring timeseries systems are emerging due to their flexibility and pay-as-you-go capabilities. However, these systems encounter a major bottleneck in query performance, mainly attributed to the prolonged access latency of cloud storage and metadata redundancy of large number of timeseries. Thus, it is critical to optimize query performance within cloud environment and reduce metadata redundancy.
In this paper, we propose CloudTS, which is a novel timeseries data storage model with query optimization for cloud storage. CloudTS separately manages metadata and data, and introduces an efficient global metadata management for both space saving and query speedup. CloudTS also transparently supports the time-partitioned tag-based query model in performance monitoring timeseries systems. For metadata, a global tag dictionary is built to reduce metadata redundancy and a novel timeseries-tag mapping technique with a two-dimension bitmap is designed so the mapping of timeseries and tags can be efficiently accomplished to support tag-based queries. For data, the compressed data chunks are put into objects by timeseries group. We have implemented a fully functional prototype of CloudTS and evaluated it with production timeseries data and synthetic workloads based on Amazon S3. In comparison, Cortex, a cloud-based timeseries system widely adopted by industries, and Apache Parquet and JSON Time Series, two representative cloud storage formats, are utilized in the evaluation. Experimental results show that CloudTS can improve query performance by 1.37x on average compared with Cortex, and outperforms Apache Parquet and JSON Time Series as well.
"Range as a Key" is the Key! Fast and Compact Cloud Block Store Index with RASK
Haoru Zhao, Mingkai Dong, and Erci Xu, Shanghai Jiao Tong University; Zhongyu Wang, Alibaba Group; Haibo Chen, Shanghai Jiao Tong University
In cloud block store, indexing is on the critical path of I/O operations and typically resides in memory. With the scaling of users and the emergence of denser storage media, the index has become a primary memory consumer, causing memory strain. Our extensive analysis of production traces reveals that write requests exhibit a strong tendency to target continuous block ranges in cloud storage systems. Thus, compared to current per-block indexing, our insight is that we should directly index block ranges (i.e., range-as-a-key) to save memory.
In this paper, we propose RASK, a memory-efficient and high-performance tree-structured index that natively indexes ranges. While range-as-a-key offers the potential to save memory and improve performance, realizing this idea is challenging due to the range overlap and range fragmentation issues. To handle range overlap efficiently, RASK introduces the log-structured leaf, combined with range-tailored search and garbage collection. To reduce range fragmentation, RASK employs range-aware split and merge mechanisms. Our evaluations on four production traces show that RASK reduces memory footprint by up to 98.9% and increases throughput by up to 31.0× compared to ten state-of-the-art indexes.
Holistic and Automated Task Scheduling for Distributed LSM-tree-based Storage
Yuanming Ren, Siyuan Sheng, and Zhang Cao, The Chinese University of Hong Kong; Yongkun Li, University of Science and Technology of China; Patrick P. C. Lee, The Chinese University of Hong Kong
Mitigating latency fluctuations for distributed key-value (KV) stores is critical, yet it is often hindered by the tight coupling of foreground and background tasks related to data distribution and storage management. Using Cassandra, a widely deployed distributed LSM-tree-based KV store, as a case study, we observe that foreground read tasks are often interfered with by background compaction tasks, yet compaction tasks are critical for achieving high read performance. We propose HATS, a holistic and automated task scheduling framework that judiciously co-schedules read and compaction tasks, so as to mitigate latency fluctuations and achieve load balancing. HATS features coarse-grained and fine-grained replica selection for reads as well as adaptive rate control for compaction. We implement HATS atop Cassandra and demonstrate its improved latency and throughput performance over state-of-the-art distributed LSM-tree-based KV stores.
3:40 pm–4:10 pm
Coffee and Tea Break
Mezzanine East/West
4:10 pm–5:10 pm
Work-in-Progress Reports (WiPs)
Short, pithy, and fun, Work-in-Progress reports introduce interesting new or ongoing work. View the list of accepted Work-in-Progress Reports.
5:30 pm–7:00 pm
FAST '26 Poster Session and Reception
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over food and beverages. View the complete list of accepted posters.
9:00 am–10:00 am
Keynote Address
Everything You Always Wanted to Know about Storage Analysis (But Were Afraid to Ask)
Erez Zadok, Stony Brook University
Storage, often the slowest component of modern systems, has evolved into a complex and challenging field. Over the years, the storage stack has grown deeper, making it difficult to work on storage systems, which are unforgiving of errors. Therefore, understanding their behavior and performance under various conditions is crucial. This talk delves into my 30-year-long journey in characterizing storage and file system performance, highlighting mistakes made along the way, insights gained from benchmarking, and strategies to optimize storage system performance (hint: "it depends"). Given the ubiquitous nature of stored data, storage researchers have a diverse range of fields and areas to explore. This talk also weaves my experiences collaborating with researchers in different fields, including Formal Methods, NLP, AI/ML, Cryptography, and Visual Analytics.

Erez Zadok is a Professor of Computer Science at Stony Brook University. He received his PhD degree from Columbia University. He directs the File Systems and Storage Lab (FSL) at the Computer Science Department at Stony Brook University, where he joined as faculty in 2001. His current research interests include file systems and storage, operating systems, performance and benchmarking, energy efficiency, security and privacy, networking, and applied ML. He received two SUNY Chancellor's Awards for Excellence, the NSF CAREER Award, and several industry awards. Zadok's funding exceeds $25M and includes federal, state, and industry awards. Zadok published over 150 refereed journal/article papers and one book. Zadok graduated 14 PhDs and over 170 MS students, and has advised over 45 undergraduates. Zadok's service includes chairing several major conferences (FAST and USENIX ATC) and serving on several steering committees. Zadok is the Editor-in-Chief of ACM Transactions on Storage (TOS). He is a member of the USENIX, ACM SIGOPS, and IEEE Computer Society.
10:00 am–10:20 am
FAST '26 Test of Time Award Presentation
Changman Lee, Dongho Sim, Joo-Young Hwang, and Sangyeun Cho
Read the original paper:
F2FS: A New File System for Flash Storage
Changman Lee, Dongho Sim, Joo-Young Hwang, and Sangyeun Cho
Published in the Proceedings of the 13th USENIX Conference on File and Storage Technologies, February 2015
10:20 am–10:50 am
Coffee and Tea Break
Mezzanine East/West
10:50 am–12:30 pm
AI and LLMs II
Session Chair: Jian Zhang, The Hong Kong University of Science and Technology
Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca
Omkar Desai, Syracuse University; Ziyang Jiao, Huaibei Normal University; Shuyi Pei, Samsung Semiconductor; Janki Bhimani, Florida International University; Bryan S. Kim, Syracuse University
Input data preprocessing is a common bottleneck when concurrently training multimedia machine learning (ML) models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data loading system that optimizes cache partitioning and data sampling for the data storage and ingestion (DSI) pipeline. The design of Seneca contains two key techniques. First, Seneca uses a performance model for the data pipeline to optimally partition the cache for three different forms of data (encoded, decoded, and augmented). Second, Seneca opportunistically serves cached data over uncached ones during random batch sampling so that concurrent jobs benefit from each other. We implement Seneca by modifying PyTorch and demonstrate its effectiveness by comparing it against several state-of-the-art caching systems for DNN training. Seneca reduces the makespan by 45.23% compared to PyTorch and increases data processing throughput by up to 3.45× compared to the next best dataloader.
GPU Checkpoint/Restore Made Fast and Lightweight
Shaoxun Zeng, Tingxu Ren, Jiwu Shu, and Youyou Lu, Tsinghua University
Distinguished Artifact Award Winner
System-level GPU checkpoint/restore (C/R) enables several critical features such as elastic scaling, task switching, and fault tolerance, for modern GPU workloads in a unified and application-transparent manner. However, existing approaches present fundamental limitations: they fail to simultaneously achieve low C/R latency and low overhead imposed on normal GPU execution, while also lacking efficient support for incremental checkpointing. We propose GCR, a GPU checkpoint/restore system that addresses all these limitations simultaneously. GCR employs a hybrid C/R scheme through control/data separation to deliver low C/R latency and negligible overhead imposed on normal GPU execution. To efficiently support incremental checkpointing, GCR introduces shadow execution on the CPU to reduce the overhead of dirty buffer identification, utilizing dirty templates for both lightweight CPU shadow execution and identification at a fine-grained instruction level.
Our evaluations demonstrate that GCR reduces GPU checkpointing latency by 72.1% and 63.6% compared to cuda-ckpt (NVIDIA’s official solution) and PhOS (the current state-of-the-art), respectively, and restoration latency by 54.2% and 87.1%, while imposing negligible overhead (less than 1%). GCR also supports efficient incremental checkpointing, which reduces checkpoint sizes by 86.6% and latency by 43.8%.
Fast Cloud Storage for AI Jobs via Grouped I/O API with Transparent Read/Write Optimizations
Yingyi Hao, Shanghai Jiao Tong University; Ting Yao, Huawei Cloud; Xingda Wei, Dingyan Zhang, and Tianle Sun, Shanghai Jiao Tong University; Yiwen Zhang, Zhiyong Fu, and Huatao Wu, Huawei Cloud; Rong Chen, Shanghai Jiao Tong University
The emergence of AI workloads has placed rigorous bandwidth requirements on cloud storage, which are challenging to meet due to inherent hardware restrictions in cost-efficient disaggregated storage architectures, as well as the non-triviality of implementing application-tailored optimizations.
This paper presents AITURBO, a cloud storage system for AI jobs with high bandwidth demands. AITURBO first utilizes the high-bandwidth compute fabric between accelerators to meet AI applications’ bandwidth demands without incurring additional storage cost. AITURBO further introduces a simple yet powerful grouped I/O API that allows AITURBO to automatically derive optimized read and write plans at the storage layer. These plans enable optimizations that are comparable or better than application-level ones, because they capture common I/O patterns in AI workloads and have a holistic view from the storage layer’s perspective. Under common AI workloads such as checkpoint reads and writes and KV-cache reads, AITURBO achieves comparable or better performance than state-of-the-art systems, with and without application-level optimizations, including systems such as Megatron, Gemini, and Mooncake, typically with minimal application-level code changes. AITURBO has been deployed in training jobs in HUAWEI’s production cloud to support efficient training workloads.
AdaCheck: An Adaptive Checkpointing System for Efficient LLM Training with Redundancy Utilization
Weijie Liu, Shengwei Li, Zhiquan Lai, and Keshi Ge, National University of Defense Technology; Qiaoling Chen, Nanyang Technological University; Peng Sun, Shanghai AI Laboratory; Dongsheng Li and Kai Lu, National University of Defense Technology
The development of large language models (LLMs) relies on sophisticated parallel training techniques, involving prolonged training runs with thousands of workers. Checkpointing systems are essential for handling failures in large-scale training. However, existing checkpointing systems are almost offline solutions tailored to specific parallelisms or model architectures. They lack adaptability to diverse parallel strategies and fail to recognize that most model states can be excluded from checkpoints, missing optimization opportunities.
In this paper, we present AdaCheck, an adaptive checkpointing system that achieves minimized checkpoint size by characterizing and exploiting state redundancy across various parallelisms, model architectures, and training iterations. We model the state redundancy induced by parallelisms and model architectures using the abstraction tensor redundancy, and propose an offline redundancy utilization method to create checkpoints with a reduced set of states. To fully identify tensor redundancy, we design an efficient redundancy detector, which employs a hash-based data consistency check method and a ring-based communication algorithm. Besides, we introduce a novel online redundancy utilization method, which further reduces checkpoint size by exploiting the state redundancy across training iterations.
Experimental results demonstrate that AdaCheck is adaptable to various parallelisms, including irregular parallelisms generated by automatic planners, as well as diverse model architectures, encompassing both dense and sparse architectures. Compared with state-of-the-art checkpointing approaches, AdaCheck can reduce checkpoint size by 6.00–896×, increase the checkpointing frequency by 1.46–111×, and incur almost no overhead on training throughput for LLM training.
Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC
Qingyuan Liu, Mo Zou, Hengbin Zhang, Dong Du, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University
Awarded Best Paper!
Distinguished Artifact Award Winner
File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLMs) to generate and evolve a file system from prompts, effectively addressing the need for robust evolution. Despite the widespread success of LLMs in code generation, attempts to create a functional file system have thus far been unsuccessful, mainly due to the ambiguity of natural language prompts.
This paper introduces SYSSPEC, a framework for developing generative file systems. Its key insight is to replace ambiguous natural language with principles adapted from formal methods. Instead of imprecise prompts, SYSSPEC employs a multi-part specification that accurately describes a file system's functionality, modularity, and concurrency. The specification acts as an unambiguous blueprint, guiding LLMs to generate expected code flexibly. To manage evolution, we develop a DAG-structured patch that operates on the specification itself, enabling new features to be added without violating existing invariants. Moreover, the SYSSPEC toolchain features a set of LLM-based agents with mechanisms to mitigate hallucination during construction and evolution. We demonstrate our approach by generating SPECFS, a concurrent file system. SPECFS demonstrates equivalent level of correctness to that of a manually-coded baseline across hundreds of regression tests. We further confirm its evolvability by seamlessly integrating 10 real-world features from Ext4. Our work shows that a specification-guided approach makes generating and evolving complex systems not only feasible but also highly effective.
2:00 pm–3:40 pm
SSDs and CXL
Session Chair: Ethan Miller, University of California, Santa Cruz, and Pure Storage
Cylon: Fast and Accurate Full-System Emulation of CXL-SSDs
Dongha Yoon, Hansen Idden, Jinshu Liu, Berkay Inceisci, Sam H. Noh, and Huaicheng Li, Virginia Tech
We present Cylon, a fast and extensible full-system emulator for CXL-SSDs built on FEMU. Cylon bridges the gap between closed hardware prototypes and slow software simulators by faithfully reproducing sub-μs cache hits and tens-of-μs misses that fall to NAND through a hybrid execution path that mitigates hypervisor trap overheads. Cylon supports configurable caching policies and provides an application-level interface for hardware-software co-design. Validated against a real CXL-SSD prototype, Cylon accurately models performance across a wide range of applications, from microbenchmarks to full-scale workloads. Our evaluation shows that Cylon reproduces realistic latency distributions, executes unmodified applications at near bare-metal speed, and scales to system-level studies. By combining speed, fidelity, and extensibility, Cylon fills a critical gap for evaluating today’s CXL-SSDs and exploring next-generation architectures that blend CXL-enabled memory and storage semantics.
Xerxes: Extensive Exploration of Scalable Hardware Systems with CXL-Based Simulation Framework
Yuda An and Shushu Yi, Peking University; Bo Mao, Xiamen University; Qiao Li, Mohamed bin Zayed University of Artificial Intelligence; Mingzhe Zhang, Institute of Information Engineering, Chinese Academy of Sciences; Diyu Zhou, Peking University; Ke Zhou, Huazhong University of Science and Technology (HUST); Nong Xiao, Sun Yat-sen University; Guangyu Sun, Yingwei Luo, and Jie Zhang, Peking University
Compute Express Link (CXL) is an emerging industry standard that offers high-performance cache-coherent interconnects to heterogeneous devices, including host CPUs, computation accelerators, and memory devices. It aims to support high system scalability, peer-to-peer communication, and high-speed data transmission. To this end, the latest version of the CXL protocol introduces several new features, including port-based routing, device-managed coherence, and PCIe 6.0 support. However, the absence of CXL hardware and the methodological limitations of existing simulators hinder the exploration of these new architectures. To bridge this gap, we propose Xerxes, a novel simulation framework designed from the ground up to faithfully model the emerging features in the latest CXL protocol. It employs a dedicated interconnect layer to support interconnection within diverse system topologies. It also implements important components to conduct specific functions required by these features. Utilizing Xerxes, we comprehensively explore multiple aspects of CXL systems, including system topologies, device-managed coherences, and impacts of PCIe characteristics, and derive key observations that can inspire new designs of high-performance CXL systems. The codes of Xerxes are open-sourced and available at https://github.com/ChaseLab-PKU/Xerxes.
Characterizing and Emulating FDP SSDs with WARP
Inho Song and Shoaib Asif Qazi, Virginia Tech; Javier González, Samsung Electronics; Matias Bjørling, Western Digital; Sam H. Noh and Huaicheng Li, Virginia Tech
Flexible Data Placement (FDP) promises to reduce write amplification by steering writes across reclaim unit handles (RUHs), yet outcomes vary widely across devices. This paper presents WARP, the first open emulator and comprehensive study of FDP SSDs. Our cross-device, cross-workload characterization shows that FDP sustains near-1 WAF when RUH isolation aligns with object lifetimes, but fails under misclassification, RUH interference, or adversarial invalidations. WARP reproduces hardware WAF trends while exposing per-RUH dynamics and configurable policies hidden in real devices. With WARP, we explore the firmware design space for FDP and demonstrate policies that reduce WAF beyond current hardware. By combining empirical characterization with a transparent emulator, this work advances FDP research from anecdotal reports to principled understanding and provides a platform for future FDP-aware system design.
ScaleSwap: A Scalable OS Swap System for All-Flash Swap Arrays
Taehwan Ahn, Chanhyeong Yu, Sangjin Lee, and Yongseok Son, Chung-Ang University
This paper presents a scalable OS swap system, ScaleSwap, designed to enhance core and SSD scalability on all-flash swap arrays. Specifically, ScaleSwap first enables a one-to-one swap model where each core exclusively manages its own swap resources, enabling core-centric swap in/out operations. Second, ScaleSwap devises opportunistic inter-core swap assistance, allowing each core to delegate swap metadata access to other cores as needed. Finally, ScaleSwap adopts core-affinity page and LRU management to mitigate LRU lock contention during swap in/out operations. We implement ScaleSwap in the Linux kernel and evaluate its performance on a 128-core machine with an all-flash swap array comprising eight NVMe SSDs. Our evaluation shows that ScaleSwap achieves up to 3.4× higher throughput and up to 11.5× lower average latency than the Linux swap system. Furthermore, ScaleSwap outperforms two prior systems, TMO and ExtMEM, by up to 64% and 5×, respectively.
DPAS: A Prompt, Accurate and Safe I/O Completion Method for SSDs
Dongjoo Seo, University of California, Irvine; Jihyeon Jung and Yeohwan Yoon, Kookmin University; Ping-Xiang Chen, University of California, Irvine; Yongsoo Joo and Sung-Soo Lim, Kookmin University; Nikil Dutt, University of California, Irvine
Modern SSDs demand faster I/O completion methods. While polling is a potential alternative to interrupts, it suffers under CPU contention. Hybrid polling mitigates this by sleeping early and polling later, yet it cannot keep up with rapidly varying I/O latencies and incurs context-switch overheads. We introduce PAS, an accurate latency tracking method for hybrid polling that adjusts sleep duration using the two most recent I/Os, and DPAS, which dynamically switches among polling, interrupts, and PAS to overcome the inherent drawbacks of hybrid polling. Experiments show that PAS reduces CPU usage by 21 percentage points compared to Linux hybrid polling for 4 KB random reads, and DPAS improves YCSB performance by 9% on a 3D XPoint SSD and 5% on a TLC NAND SSD, even under simultaneous CPU contention and I/O interference.
3:40 pm–4:10 pm
Coffee and Tea Break
Mezzanine East/West
4:10 pm–5:30 pm
Virtualization
Session Chair: Mai Zheng, Iowa State University
How Soon is Now? Preloading Images for Virtual Disks with ThinkAhead
Xinqi Chen, Shanghai Jiao Tong University; Yu Zhang, Alibaba Group; Erci Xu, Shanghai Jiao Tong University; Changhong Wang, Jifei Yi, Qiuping Wang, Shizhuo Sun, and Zhongyu Wang, Alibaba Group; Haonan Wu, Shanghai Jiao Tong University; Junping Wu, Hailin Peng, Rong Liu, Yinhu Wang, Jiaji Zhu, and Jiesheng Wu, Alibaba Group; Guangtao Xue, Shanghai Jiao Tong University; Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3; Patrick P. C. Lee, The Chinese University of Hong Kong
Efficient cloud computing relies on high-performance Elastic Block Storage (EBS) services, where virtual disk (VD) image loading significantly affects user experience. While the commonly used "lazy loading" approach reduces cold-start time from minutes to sub-seconds, our trace analysis of around 160,000 real-world image loading events in Alibaba EBS reveals that slow I/Os during initial block access constitute the primary performance bottleneck, accounting for 40% of all slow I/Os. We propose ThinkAhead, a data-driven image preloading system for VDs. ThinkAhead comprises various techniques to predict efficient block preloading sequences for images with historical traces based on runtime conditions and address corner cases with limited or no historical traces. Trace-driven simulation and cluster experiments show that ThinkAhead improves the data block hit rate by up to 7.27× and reduces tail waiting time by up to 98.7% across various image types compared to lazy loading and various baselines.
CoFS: A Filesystem for Fast Container Startup
Li Wang, Jinxu Du, Yang Yang, Qingbo Wu, Tao Liu, and Haoze Wu, KylinSoft
The running of applications in containers has emerged as a popular trend in the industry. The cold start of a container involves a sequential time-consuming process of image downloading and image unpacking. The high cold-start latency significantly prolongs the startup time of containerized applications and could potentially violate responsiveness SLAs in serverless computing or during service automatic scaling to handle burst requests. To accelerate container startup, state-of-the-art systems pull the container image on demand. Unfortunately, they suffer from userspace I/O interposition overhead, maintainability, and/or performance fluctuation.
This paper presents CoFS, a novel filesystem based on extended FUSE for fast container startup. The insight is that the container image is built only once with a fixed read-only filesystem tree from the perspective of containers. This motivates CoFS to construct a minimal perfect hash function (MPHF) at image building time and to store metadata of files in a container image in a dense array indexed by their hash value. MPHF is collision-free and space-optimal. Leveraging the excellent properties of MPHF, CoFS accomplishes lookup request through less than one single I/O operation in most cases (unless the filename is excessively long) from kernel space, effectively avoiding the costly userspace lookup process in FUSE. Furthermore, CoFS constructs another MPHF that enables parallel lookup based on full path hashing, so as to further accelerate the path resolution. For data access, CoFS leverages sparse files provided by the in-kernel host filesystem to implement fine-grained data caching, and accesses cached data from kernel space. The evaluation shows that CoFS outperforms state-of-the-art systems that achieve fast container startup, and compared to fuse-loopback, a FUSE-based loopback filesystem, the lookup performance improves by up to 86%.
RosenBridge: A Framework for Enabling Express I/O Paths Across the Virtualization Boundary
Shi Qiu, Xiamen University; Li Wang, KylinSoft; Jianqin Yan and Ruofan Xiong, Xiamen University; Leping Yang, Shanghai Jiao Tong University; Xin Yao, Renhai Chen, and Gong Zhang, Huawei; Dongsheng Li, The National University of Defense Technology; Jiwu Shu, Tsinghua University; Yiming Zhang, Shanghai Jiao Tong University and Xiamen University
With the emergence of high-performance storage devices, the overhead of the storage stack has become a major I/O performance bottleneck. To alleviate this problem, a number of express I/O paths have been proposed based on the concept of near-data processing (NDP), such as query I/O resubmission (XRP) and GPU-direct storage (GDS). Unfortunately, in virtualized environments, none of these bare-metal express I/O paths can cross the virtualization boundary between the guest virtual machines (VMs) and the host, leaving applications inside guest VMs unable to benefit from them.
This paper presents RosenBridge, a framework for enabling express I/O paths across the virtualization boundary. At the core of RosenBridge is a new paravirtualized I/O device called virtio-ndp, which allows the guest to offload NDP optimizations to the hypervisor in the host userspace based on uBPF (userspace Berkeley Packet Filter). We connect the virtio-ndp backend with the host kernel’s asynchronous I/O stack for efficient I/O scheduling, and provide a set of helper functions for convenient guest-host address translation. We strictly limit the memory access scope of uBPF programs to guarantee security and collaboratively throttle the multi-path I/O to ensure fairness among guest VMs. We demonstrate the effectiveness of RosenBridge through two use cases, respectively supporting XRP and GDS in virtualized environments. Evaluation shows that RosenBridge significantly outperforms the state-of-the-art I/O paravirtualization frameworks (virtio and vhost) in I/O performance, while effectively reducing CPU usage. Compared to the bare-metal express I/O paths (XRP and GDS), RosenBridge only incurs a slight performance degradation.
MlsDisk: Trusted Block Storage for TEEs Based on Layered Secure Logging
Erci Xu, Shanghai Jiao Tong University; Xinyi Yu, Lujia Yin, and Xinyuan Luo, NICE Lab, Xiamen University; Shaowei Song, Qingsong Chen, and Shoumeng Yan, Ant Group; Jiwu Shu, Tsinghua University; Hongliang Tian, Ant Group; Yiming Zhang, Shanghai Jiao Tong University and NICE Lab, Xiamen University
Trusted Execution Environments (TEEs) enable users to run sensitive applications in private memory regions. SGX-PFS is the state-of-the-art secure storage solution for TEEs that ensures data confidentiality, integrity, freshness, and consistency (CIFC). Unfortunately, SGX-PFS uses Merkle Hash Trees to protect in-place persisted data and suffers from poor I/O performance and is thus of limited use in practice.
This paper presents MlsDisk, a secure virtual disk that adopts out-of-place logging to provide efficient trusted block storage for TEEs. The challenge is that the complexity of indexing and garbage collection (GC) in log-structured storage makes it difficult to ensure security. We therefore adopt a layered design to break down the indexing and GC into four layers of abstractions, which facilitates reasoning about CIFC properties. Evaluation shows that MlsDisk, with CIFC guarantees, outperforms SGX-PFS by 7.3×–21.1× on microbenchmarks and 1.4×–3.6× on trace-driven workloads.
9:00 am–10:20 am
Cloud Technologies II
Session Chair: Uday Kiran Jonnala, Dell Technologies
SkySync: Accelerating File Synchronization with Collaborative Delta Generation
Zhihao Zhang, Xiamen University and Alibaba Cloud; Huiba Li, Alibaba Cloud; Lu Tang, Xiamen University; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Tsinghua University; Yiming Zhang, Shanghai Jiao Tong University and Xiamen University
File synchronization (sync) is of increasing significance for not only intra-cloud but also inter-cloud applications and services, as cloud computing is evolving into the Sky computing paradigm with the illusion of utility computing on an infrastructure of multiple geographically-distributed clouds. However, existing file sync schemes, mainly including fixed-sized chunking (FSC) based sync and content-defined-chunking (CDC) based sync, heavily rely on complex algorithms for generating the delta data. These algorithms perform costly processing operations including (i) file chunking, (ii) chunk checksum computation, and (iii) chunk searching, which incur high computational overhead thus lowering sync performance. This paper presents SkySync, a novel file sync scheme based on collaborative delta generation. Our insight is that the conventional storage layer has already maintained rich metadata (like checksums and cryptographic digests) for management purpose, e.g., to verify integrity and detect errors. Therefore, we leverage the existent metadata of the storage layer to obtain the chunk checksums with simple adaptation and combination, thus effectively reducing the computational overhead. We further streamline the chunk searching process by reusing checksum data produced during prior computations. We have implemented the FSC-based and CDC-based SkySync schemes by enhancing the communication protocol of the state-of-the-art rsync and dsync, respectively. Evaluation results show that compared to the existing file sync schemes (rsync and dsync), SkySync significantly reduces the computational overhead by up to 89.3% and improves the client and server sync performance by 1.1× ∼2×, while maintaining a consistent level of network traffic.
ParaSync: Exploiting Fine-Grained Parallelism for Efficient File Synchronization
Zhihao Zhang, NICE Lab, Xiamen University; and Alibaba Cloud; Lu Tang, NICE Lab, Xiamen University; Huiba Li, Alibaba Cloud; Yue Yu, Sun Yat-sen University; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Tsinghua University; Yiming Zhang, Shanghai Jiao Tong University and NICE Lab, Xiamen University
File synchronization (sync) based on Content-Defined Chunking (CDC) is gaining increasing importance for data migration over networks owing to its effectiveness in detecting and eliminating duplicate data within synchronized files. CDC-based sync schemes typically comprise three phases: file chunking, chunk matching, and delta reconstruction. Unfortunately, existing sync schemes fail to exploit parallelism inherent in these phases due to two dependencies: a sequential bottleneck in chunking, where checksums are computed only after boundaries are finalized, and rigid client-server stalls that serialize matching and reconstruction. This paper presents ParaSync, a novel CDC-based file sync scheme that breaks these dependencies to exploit fine-grained parallelism. First, ParaSync’s multi-threaded chunking algorithm reduces checksum computation to a lightweight combination step, decoupling it from boundary identification while preserving invariability. Second, ParaSync designs a streaming chunk matching method that removes the all-or-nothing exchange dependency on both the client and the server sides. Finally, ParaSync introduces an efficient absolute-offset-based pipelined delta reconstruction process that maximizes the overlap between network and disk I/O operations. We have done extensive experiments over both WANs and LANs using diverse real-world datasets. The results show that compared to the state-of-the-art file sync schemes, ParaSync achieves up to 7.6× speedup for file chunking and significantly improves the overall sync performance by up to 3.7×, while maintaining a consistent level of network traffic.
Discard-Based Garbage Collection for Distributed Log-Structured Storage Systems in ByteDance
Runhua Bian, ByteDance and Tsinghua University; Liqiang Zhang, Jinxin Liu, Jiacheng Zhang, Jianong Zhong, and Jiahao Gu, ByteDance; Hao Guo, Tsinghua University; Zhihong Guo, Yunhao Li, Fenghao Zhang, Jiangkun Zhao, Yangming Chen, and Guojun Li, ByteDance; Ruwen Fan, Tsinghua University; Haijia Shen, Chengyu Dong, Yao Wang, and Rui Shi, ByteDance; Jiwu Shu and Youyou Lu, Tsinghua University
ByteStore is a distributed append-only storage system that serves as the foundational storage layer of the ByteDance infrastructure. Initially, storage services on ByteStore use compaction for garbage collection (GC). Additional writes induced by compaction and the SSD space occupied by stale data result in millions of dollars in extra Total Cost of Ownership (TCO) per month. Aggressive compaction releases the SSD space, but at the cost of more write operations and faster SSD wear, thus failing to reduce TCO.
Based on our analysis of the traces from the block storage service (ByteDrive) deployed on ByteStore, we propose DisCoGC, a Discard-and-Compaction combined Garbage Collection scheme, which employs a discard mechanism to reclaim the space occupied by stale data without moving valid data. Production cluster metrics monitor and offline experiments demonstrate that DisCoGC achieves approximately 20% reduction in TCO, without sacrificing performance.
PolarStore: High-Performance Data Compression for Large-Scale Cloud-Native Databases
Qingda Hu, Xinjun (Jimmy) Yang, Feifei Li, Junru Li, Ya Lin, Yuqi Zhou, Yicong Zhu, Junwei Zhang, Rongbiao Xie, Ling Zhou, Bin Wu, and Wenchao Zhou, Alibaba Cloud Computing
In recent years, resource elasticity and cost optimization have become essential for RDBMSs. While cloud-native RDBMSs provide elastic computing resources via disaggregated computing and storage, storage costs remain a critical user concern. Consequently, data compression emerges as an effective strategy to reduce storage costs. However, existing compression approaches in RDBMSs present a stark trade-off: software-based approaches incur significant performance overheads, while hardware-based alternatives lack the flexibility required for diverse database workloads.
In this paper, we present PolarStore, a compressed shared storage system for cloud-native RDBMSs. PolarStore employs a dual-layer compression mechanism that combines in-storage compression in PolarCSD hardware with lightweight compression in software. This design leverages the strengths of both approaches. PolarStore also incorporates database-oriented optimizations to maintain high performance on critical I/O paths. Drawing from large-scale deployment experiences, we also introduce hardware improvements for PolarCSD to ensure host-level stability and propose a compression-aware scheduling scheme to improve cluster-level space efficiency. PolarStore is currently deployed on thousands of storage servers within PolarDB, managing over 100 PB of data. It achieves a compression ratio of 3.55 and reduces storage costs by approximately 60%. Remarkably, these savings are achieved while maintaining performance comparable to uncompressed clusters.
10:20 am–10:50 am
Coffee and Tea Break
Mezzanine East/West
10:50 am–12:30 pm
Tiering, Mobile, and SSDs
Session Chair: Vasily Tarasov, IBM Research
Unleashing Zoned UFS: Cross-Layer Optimizations for Next-Generation Mobile Storage
Jungae Kim, SK hynix Inc.; Jaegeuk Kim, Google; Kyu-Jin Cho, Seoul National University; Sungjin Park, Jinwoo Kim, Jieun Kim, and Iksung Oh, SK hynix Inc.; Chul Lee, Bart Van Assche, Daeho Jeong, and Konstantin Vyshetsky, Google; Jin-Soo Kim, Seoul National University
Zoned UFS (ZUFS) has emerged as a next-generation mobile storage technology that reduces the logical-to-physical (L2P) mapping overhead of conventional UFS (CUFS) by enforcing sequential writes within fixed-size zones. While the concept appears straightforward, deploying ZUFS in commercial smartphones introduces non-trivial challenges across the mobile storage stack. In this paper, we identify three key obstacles: managing limited SRAM across multiple open zones, ensuring end-to-end write ordering guarantees, and mitigating severe garbage collection overhead caused by large zones.
We address these challenges through cross-layer optimizations spanning device firmware, the SCSI/UFS driver, the block layer, F2FS, and Android framework: a dynamic device-side buffer management scheme that opportunistically shares SRAM, a write ordering mechanism that eliminates reordering hazards, and a proactive garbage collection framework that reclaims free space in the background. Evaluation on a commercial smartphone shows that ZUFS sustains over 2x higher write throughput under fragmentation and reduces mobile game loading time by 14% compared to CUFS, while maintaining stable read performance. These results demonstrate that ZUFS’s full potential can only be realized through coordinated redesign across the entire mobile storage stack.
DOGI: Data Placement with Oracle-Guided Insights for Log-Structured Systems
Jeeyun Kim, Pohang University of Science and Technology (POSTECH); Seonggyun Oh and Jungwoo Kim, Daegu Gyeongbuk Institute of Science and Technology; Jisung Park, Pohang University of Science and Technology (POSTECH); Jaeho Kim, Gyeongsang National University; Sungjin Lee, Pohang University of Science and Technology (POSTECH); Sam H. Noh, Virginia Tech
Log-structured systems have become the backbone of modern data-intensive applications thanks to their high write throughput. Their efficiency, however, is deteriorated by the write amplification factor (WAF) induced by garbage collection. Despite extensive studies, there still exists a wide gap between practice and optimality. In this paper, we bridge this gap with two key contributions. We first design NoDaP, a near-optimal oracle baseline that sets the upper bound for WAF reduction. Then, guided by insights from NoDaP, we propose DOGI, an oracle-inspired data placement technique that combines simple yet effective heuristics with lightweight machine learning. DOGI predicts invalidation times for data blocks with high accuracy, dynamically tunes group configurations, and finds the sweet spot between fine-grained data placement and misprediction penalty. Our experiments, using simulations and a prototype on a zoned device, show that DOGI reduces WAF by up to 23.2% while improving write throughput by up to 13.3% over the best-performing baseline.
Getting the MOST out of your Storage Hierarchy with Mirror-Optimized Storage Tiering
Kaiwei Tu, University of Wisconsin–Madison; Kan Wu, Google; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin–Madison
We present Mirror-Optimized Storage Tiering (MOST), a novel tiering-based approach optimized for modern storage hierarchies. The key idea of MOST is to combine the load-balancing advantages of mirroring with the space-efficiency advantages of tiering. Specifically, MOST dynamically mirrors a small amount of hot data across storage tiers to efficiently balance load, avoiding costly migrations. As a result, MOST is as space-efficient as classic tiering while achieving better bandwidth utilization under I/O-intensive workloads. We implement MOST in Cerberus, a user-level storage management layer based on CacheLib. We show the efficacy of Cerberus through a comprehensive empirical study: across a range of static and dynamic workloads, Cerberus achieves better throughput than competing approaches on modern storage hierarchies especially under I/O-intensive and dynamic workloads.
Rearchitecting Buffered I/O in the Era of High-Bandwidth SSDs
Yekang Zhan, Tianze Wang, Zheng Peng, Haichuan Hu, Jiahao Wu, Xiangrui Yang, and Qiang Cao, Huazhong University of Science and Technology; Hong Jiang, University of Texas at Arlington; Jie Yao, Huazhong University of Science and Technology
Buffered I/O via page cache has been prevalently used by applications for decades due to its user-friendliness and high performance. However, the existing buffered I/O architecture fails to effectively utilize high-bandwidth Solid-State Drives (SSDs) caused by 1) costly page caching overused for buffering all incoming writes in the critical path, 2) the limited concurrency of page management, and 3) the high read-before-write penalty for partial-page writes.
This paper rearchitects buffered I/O and proposes a write-scrap buffering approach (WSBuffer) to remove the aforementioned shackles of buffered I/O on writes to proactively exploit fast SSDs while retaining all the advantages of buffered I/O on reads. WSBuffer first presents a novel memory-page buffering structure, scrap buffer, to efficiently buffer SSD-I/O unfriendly writes and expensive partial-page writes. WSBuffer further proposes a buffer-minimized data access mechanism to partially buffer small and unaligned parts of user writes via the scrap buffer while directly sending large and aligned parts to underlying SSDs. Finally, WSBuffer devises an opportunistic two-stage dirty-data flushing mechanism and a concurrent page management mechanism to achieve fluent and fast dirty-data flushing. The experimental results show that WSBuffer outperforms Linux file systems of EXT4, F2FS, BTRFS and XFS, as well as the state-of-the-art buffered I/O optimization of ScaleCache by up to 3.91X and 82.80X in throughput and latency respectively.
FailureMiner: A Joint Key Decision Mining Scheme for Practical SSD Failure Prediction and Analysis
Shuyang Wang, Yuqi Zhang, and Haonan Luo, Samsung R&D Institute China Xi'an, Samsung Electronics; Kangkang Liu, Tencent; Gil Kim, JongSung Na, Claude Kim, Geunrok Oh, and Kyle Choi, Samsung Electronics; Ni Xue and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics
As SSDs become increasingly popular in enterprise data centers, SSD failures have become a key concern for storage system reliability. In this paper, we propose FailureMiner, a joint key decision mining scheme based on SSD monitoring attributes to accurately and clearly identify SSD failure patterns in production environments. First, to address the imbalance between healthy and failed samples caused by the limited number of failed SSDs, FailureMiner introduces selective downsampling to carefully remove non-critical healthy samples, thereby focusing more on the subtle differences between easily confused failure patterns and health patterns. Second, FailureMiner streamlines the decision-making process of the machine learning model in failure prediction by capturing key decision steps based on their joint contribution. By filtering out redundant and noisy information, FailureMiner can capture joint key decisions, i.e., the simplified attribute combinations and value ranges relevant to failures, thus enabling accurate and interpretable identification of failure patterns.
FailureMiner is evaluated on real-world datasets, and the results show that our scheme improves precision and recall by an average of 38.6% and 80.5% respectively, compared with the existing failure prediction methods. The extracted joint key decisions have been deployed in Tencent's data centers to predict failures across more than 350,000 SSDs over a year, enhancing SSD reliability. The joint key decisions also reveal the failure patterns and factors affecting SSD health, which further helps operators handle failures and manufacturers improve product reliability.
12:30 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:20 pm
Integrity
Session Chair: Gala Yadgar, Technion—Israel Institute of Technology
DRBoost: Boosting Degraded Read Performance in MSR-Coded Storage Clusters
Xiao Niu, Guangyan Zhang, Zhiyue Li, and Sijie Cai, Tsinghua University
Minimum Storage Regenerating (MSR) codes have strong potential for building efficient and reliable storage systems due to their excellent fault tolerance and low repair bandwidth. However, to meet MSR code constraints and optimize storage performance, systems often adopt large chunk sizes. This leads to significant I/O amplification during degraded reads, as entire chunks must be reconstructed to access a single object.
In this paper, we propose DRBoost, an approach that boosts degraded read performance in MSR-coded storage clusters by reducing repair bandwidth and eliminating access fragmentation for healthy data. DRBoost introduces three key techniques: (1) a partial-chunk reconstruction algorithm that reduces repair bandwidth by leveraging two forms of data reuse; (2) a reconstruction-friendly coding layout that improves reuse efficiency and accommodates objects of diverse sizes; and (3) a fragmentation-free storage layout that avoids unnecessary request splitting. Extensive experiments under various conditions and workloads show that DRBoost reduces degraded read latency by one to two orders of magnitude, significantly improving system responsiveness.
LESS is More for I/O-Efficient Repairs in Erasure-Coded Storage
Keyun Cheng, The Chinese University of Hong Kong; Guodong Li, Shandong University; Xiaolu Li, Huazhong University of Science and Technology; Sihuang Hu, Shandong University; Patrick P. C. Lee, The Chinese University of Hong Kong
I/O efficiency is critical for erasure-coded repair performance in modern distributed storage. We propose LESS, a family of repair-friendly erasure code constructions that reduces both the amount of data accessed and the number of I/O seeks in single-block repairs, while ensuring balanced reductions across blocks. LESS layers multiple extended sub-stripes formed by widely deployed Reed-Solomon coding, and is configurable to balance the trade-off between the amount of data accessed and I/O seeks. Evaluation shows that LESS on HDFS reduces both single-block repair and full-node recovery times compared to state-of-the-art I/O-optimal erasure codes.
Advancing Data Integrity in Linux
Anuj Gupta, Samsung Semiconductor; Christoph Hellwig, unaffiliated; Kanchan Joshi, Vikash Kumar, and Javier Gonzalez, Samsung Semiconductor; Roshan R Nair, EPFL; Jinyoung Choi, Samsung Semiconductor
Standalone hardware-only or software-only methods fail to provide comprehensive coverage in detecting data corruption. End-to-end data protection (E2EDP) addresses this by carrying per-block protection information (PI) throughout the I/O stack, from the application through the system software to the storage device. Although devices have supported PI for more than a decade, Linux remains incomplete in its support and utilization of these capabilities.
This paper closes two fundamental gaps in the mainline Linux kernel and introduces a PI-aware filesystem design with implementation and evaluation. First, we add a new io_uring interface that allows applications to exchange integrity metadata alongside the data. Second, we add flexible PI placement in Linux’s block-integrity enabling support for a device configuration that was otherwise rejected. Finally, we introduce Filesystem Protection Information (FS-PI), a PI-aware design direction in which filesystems generate and verify integrity metadata directly while leveraging PI-capable hardware. We implement FS-PI in two major filesystems: in XFS, FS-PI introduces native data checksumming, extending data-integrity guarantees for the first time; in BTRFS, FS-PI replaces the checksum tree with a lightweight path that reduces metadata traffic, write amplification, and device wear.
We evaluated the cost and gains of FS-PI in BTRFS and XFS. The evaluation shows that FS-PI improves BTRFS performance by 26%, reduces host CPU utilization by 58%, and reduces device writes by 52%, extending SSD lifetime by 23%.
CETOFS: A High-Performance File System with Host-Server Collaboration for Remote Storage
Wenqing Jia, Dejun Jiang, and Jin Xiong, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences; and University of Chinese Academy of Sciences
With the development of high-performance RDMA network and modern storage devices, disaggregated NVMe SSDs have become increasingly popular due to the high resource utilization and superior performance. Although existing kernel file systems (e.g., Ext4) can directly access the disaggregated SSD via the NVMe-over-RDMA protocol, the data path suffers from heavy kernel stack overhead and thus degraded performance. The extra networking latency introduced in accessing remote storage also prevents file system from achieving scalable concurrent accesses and efficient failure-atomic IO. In this paper, we present CETOFS, a high-performance file system with host-server collaboration for disaggregated NVMe SSD. CETOFS designs a userspace-kernel collaborative architecture to place data plane entirely in userspace meanwhile separate permission checking from in-kernel control plane. Then CETOFS exploits the processing capability of remote storage server to offload three tasks: permission checking, concurrency control, and failure-atomic IO guaranteeing. The offloading mechanisms greatly reduce the networking overhead. We implement CETOFS and evaluate it against both kernel and userspace file systems. The evaluation shows CETOFS achieves high-performance data path that reduces latency by up to 52% for single-threaded file access and improves throughput by up to 19X for concurrent accesses.
3:20 pm–3:50 pm
Coffee and Tea Break
Mezzanine East/West
3:50 pm–5:10 pm
OS
Session Chair: Yu Liang, ETH Zurich
Cache-Centric Multi-Resource Allocation for Storage Services
Chenhao Ye, Shawn (Wanxiang) Zhong, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin–Madison
We present HARE, a cache-centric multi-resource allocation algorithm for storage services. HARE introduces a holistic allocation model that captures the demand correlation between cache size and other resources (e.g., I/O, network), and uses a novel two-phase harvest/redistribute method to optimize resource allocation across tenants, maximizing the throughput of each while maintaining fairness. To demonstrate that HARE is widely applicable, we built two systems. The first, HopperKV, is a cloud-native key-value store that modifies Redis to cache data from DynamoDB. The second, BunnyFS, is a microkernel-style local filesystem for NVMe SSDs. Our evaluation shows that HARE is effective for multi-resource allocation in storage. Both systems are scalable and adaptive: HopperKV achieves up to a 1.9x performance improvement, and BunnyFS achieves up to 1.4x.
Lockify: Understanding Linux Distributed Lock Management Overheads in Shared Storage
Taeyoung Park, Yunjae Jo, Daegyu Han, Beomseok Nam, and Jaehyun Hwang, Sungkyunkwan University
This paper presents Lockify, a novel distributed lock manager (DLM) for shared-disk file systems. Our key observation in shared-storage scenarios is that, for file or directory creation, lock acquisition overhead in the Linux kernel DLM increases with the number of clients, even in low-contention scenarios. Lockify minimizes this lock acquisition latency by avoiding unnecessary communication with remote directory nodes through self-owner notifications and asynchronous ownership management. We implement Lockify in the Linux kernel and evaluate its performance on real-world workloads using two representative shared-disk file systems, GFS2 and OCFS2. Our experimental results demonstrate that Lockify improves overall throughput by ~6.4× compared to the kernel DLM and O2CB, consistently across different numbers of clients.
uCache: A Customizable Unikernel-based IO Cache
Ilya Meignan--Masson, Masanori Misono, Viktor Leis, and Pramod Bhatotia, Technical University of Munich
Data-intensive cloud applications require high-performance IO caching to fully leverage modern storage systems like NVMe SSDs and cloud storage. Today’s developers face a dilemma: choose a simple, but often slow, OS-level IO cache, or a fast, but complex, userspace cache. This trade-off forces developers to sacrifice either performance or simplicity. This paper introduces uCache, a novel IO cache that resolves this fundamental tension. Leveraging a unikernel-based libOS architecture, uCache seamlessly integrates application-specific knowledge directly into the OS-level cache. It combines an mmap-like memory surface with an explicit, conventional interface, offering fine-grained control over cache behavior. This design allows for the seamless integration of application-specific semantics within the OS-level cache itself—a capability previously confined to complex userspace solutions; thus, ensuring scalability, performance, and adaptability. The core of uCache’s flexibility is the uVFS abstraction, which enables direct adaptation to diverse IO backends while maintaining filesystem compatibility without the performance overheads of the traditional OS IO stack. Our evaluation demonstrates that uCache effectively merges the simplicity of OS-level caches with the performance and flexibility of userspace solutions. For out-of-memory workloads, uCache achieves performance on par with kernel-bypass IO libraries, proving it can eliminate the IO caching bottleneck for data-intensive applications.
UnICom: A Universally High-Performant I/O Completion Mechanism for Modern Computer Systems
Riwei Pan, City University of Hong Kong; Yu Liang, ETH Zurich and Inria-Paris; Sam H. Noh, Virginia Tech; Lei Li and Nan Guan, City University of Hong Kong; Tei-Wei Kuo, Delta Electronics and National Taiwan University; Chun Jason Xue, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Modern computer systems are increasingly equipped with dozens to hundreds of cores, while high-performance Solid-State Drives (SSDs), enabled by NVMe and emerging technologies such as CXL-SSDs, provide massive I/O bandwidth and microsecond-scale latency. Yet, software overhead in the I/O stack remains a critical bottleneck, often contributing up to 50% of total I/O latency. Existing I/O completion mechanisms fall short: polling achieves low latency but wastes CPU cycles, whereas interrupts conserve CPU resources but incur significant wake-up overhead. This paper presents UnICom (Universal I/O Completion), a new I/O completion mechanism that unifies the benefits of polling and interrupts while avoiding their drawbacks. The key insight is that a kernel trap is negligible compared to disk I/O latency, yet enables access to kernel infrastructure for efficiency and security. Building on this, UnICom introduces three core techniques: TagSched, a lightweight tag-guided scheduling mechanism that minimizes sleep and wake-up overhead; TagPoll, a centralized kernel-level I/O completion thread that consolidates polling across threads and processes; and SKIP, a kernel-assisted direct-access mechanism that eliminates complex user-space permission management. Together, these techniques enable efficient multi-process support and direct SSD access while bypassing much of the kernel I/O stack. We implement UnICom in the Linux kernel and evaluate it against ext4, BypassD, and io_uring. Across all experiments, UnICom consistently delivers high I/O performance, matching or exceeding the best of polling and interrupts under both low and high CPU utilization.
