Papers are available for download below to registered attendees now and to everyone beginning on Monday, July 7. Paper abstracts and proceedings front matter are available to everyone now. Copyright to the individual works is retained by the author(s).
All sessions will be held in Grand and Liberty Ballroom unless otherwise noted.
Proceedings Front Matter
Proceedings Cover |
Title Page and List of Organizers |
Message from the Program Co-Chairs |
Table of Contents

9:00 am–10:00 am
Presentation of the USENIX Lifetime Achievement Award ("The Flame")
The USENIX Lifetime Achievement Award ("The Flame") recognizes and celebrates singular contributions to the USENIX community of both intellectual achievement and service that are not recognized in any other forum.
OSDI '25 and USENIX ATC '25 Joint Keynote Address
Accelerating Software Development: The LLM (R)evolution
Emery Berger, University of Massachusetts Amherst and Amazon Web Services
Large language models are achieving state-of-the-art results across a wide variety of domains, eclipsing past work in well-studied areas like auto-completion. I argue that they also presage a "Cambrian explosion"—a wave of radically new AI-powered software development tools that will make all our lives easier. I propose a paradigm for how we can best rethink existing tools to leverage a combination of LLMs and PL technologies like static and dynamic analysis. This approach promises to evolve our software tools far beyond their current capacities, including profilers that suggest optimizations, debuggers that identify and propose fixes using real-world knowledge, coverage analyzers that synthesize new tests, compilers that propose fixes for compile-time errors, and data analysis frameworks that analyze your data.

Emery Berger is a Professor of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system, and an Amazon Scholar at Amazon Web Services. At UMass, Professor Berger leads the PLASMA lab, whose research has led to numerous impactful software systems (see https://github.com/plasma-umass). Professor Berger is also the developer and sole maintainer of the influential CSrankings.org site, which has served over 3 million users. He served six years as an elected member of the SIGPLAN Executive Committee and a decade as Associate Editor of TOPLAS; he served as Program Chair for PLDI 2016 and co-Program Chair of ASPLOS 2021, and received the ACM SIGPLAN Distinguished Service Award in 2024. His honors include an NSF CAREER Award, Most Influential Paper Awards at OOPSLA, PLDI, and ASPLOS, five CACM Research Highlights, and Best Paper Awards at FAST, OOPSLA, SOSP, and OSDI; he is an ACM Fellow.
10:00 am–10:30 am
Coffee and Tea Break
Grand and Liberty Foyer
10:45 am–12:45 pm
Distributed Systems and Data Centers I
Session Chair: Yu Hua, Huazhong University of Science and Technology
Basilisk: Using Provenance Invariants to Automate Proofs of Undecidable Protocols
Tony Nuda Zhang and Keshav Singh, University of Michigan; Tej Chajed, University of Wisconsin-Madison; Manos Kapritsos, University of Michigan; Bryan Parno, Carnegie Mellon University
Awarded Best Paper!
Distributed protocols are challenging to design correctly. One promising approach to improve their reliability uses formal verification to prove that a protocol satisfies a desired safety property. These proofs require finding an inductive invariant that holds initially, implies safety, and is inductive. Devising an inductive invariant is a difficult task that prior work has either automated by requiring the protocol to be expressed in a decidable but restrictive fragment of logic, or required the developer to find by a painful search process.
In this work we aim to automatically find inductive invariants without restricting the logic. We do so using two key insights. Our first insight is that many of the complex inter-host properties that prior work required the developer to provide can instead be derived using Provenance Invariants, a class of invariants that relate a local variable in a host to its provenance, i.e., the protocol step that caused it to have its current value. By tracing the provenance of one host variable back to another host, we can derive an invariant relating the two hosts' states. Second, we develop an algorithm called Atomic Sharding to derive Provenance Invariants automatically by statically analyzing the protocol's steps.
We implement these ideas in a tool called Basilisk and apply it to 16 distributed protocols. Basilisk automatically finds a set of invariants and proves their inductiveness, with little or no developer assistance. In all cases, these generated invariants are sufficient for us to prove safety without needing to identify any new inductive invariants.
Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems
Chang Lou, University of Virginia; Dimas Shidqi Parikesit, University of Virginia and Bandung Institute of Technology; Yujin Huang, The Pennsylvania State University; Zhewen Yang and Senapati Diwangkara, Johns Hopkins University; Yuzhuo Jing, University of Michigan; Achmad Imam Kistijantoro, Bandung Institute of Technology; Ding Yuan, University of Toronto; Suman Nath, Microsoft Research; Peng Huang, University of Michigan
Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers.
In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
Picsou: Enabling Replicated State Machines to Communicate Efficiently
Reginald Frank, Micah Murray, Chawinphat Tankuranand, Junseo Yoo, Ethan Xu, and Natacha Crooks, UC Berkeley; Suyash Gupta, University of Oregon; Manos Kapritsos, University of Michigan
Replicated state machines (RSMs) cannot communicate effectively today as there is no formal framework or efficient protocol to do so. To address this issue, we introduce a new primitive, Cross-Cluster Consistent Broadcast (C3B) and present Picsou, a practical C3B implementation. Picsou draws inspiration from networking and TCP to allow two RSMs to communicate with constant metadata overhead in the failure-free case and a minimal number of message resends in the case of failures. Picsou is flexible and allows both crash fault tolerant and Byzantine fault tolerant protocols to communicate. At the heart of Picsou's good performance and generality is the concept of Quacks (quorum acknowledgments). Quacks allow nodes in each RSM to precisely determine when messages have definitely been received, or likely lost. Our results are promising: we obtain up to 24× better performance than prior solutions on microbenchmarks and applications, ranging from disaster recovery to data reconciliation.
FineMem: Breaking the Allocation Overhead vs. Memory Waste Dilemma in Fine-Grained Disaggregated Memory Management
Xiaoyang Wang and Yongkun Li, University of Science and Technology of China; Kan Wu, Google; Wenzhe Zhu and Yuqi Li, University of Science and Technology of China; Yinlong Xu, University of Science and Technology of China and Anhui Provincial Key Laboratory of High Performance Computing
RDMA-enabled memory disaggregation has emerged as an attractive approach to reducing memory costs in modern data centers. While RDMA enables efficient remote read/write operations, it presents challenges in remote memory (de)allocation. Consequently, existing systems adopt coarse-grained allocations (in GBs), leading to memory waste.
We introduce FineMem, an RDMA-connected remote memory management system that enables high-performance, fine-grained memory allocation. FineMem addresses latency and scalability challenges related to fine-grained allocations. It removes RDMA memory region (MR) registration costs from allocation paths through per-compute node MR pre-registration, while ensuring remote memory isolation using RDMA memory windows and a trusted allocation service on each compute node. It employs a lock-free, one-sided RDMA-based protocol to allocate memory chunks (e.g., 4KB, 2MB) without involving the memory node's CPU and maintains metadata consistency during compute node failures via logging. We show that FineMem reduces remote memory allocation latency by as much as 95% compared to state-of-the-art remote memory management systems. It enables memory malloc systems, key-value stores systems, and swap systems running on FineMem to achieve low memory waste with minimal overhead.
To PRI or Not To PRI, That's the question
Yun Wang, Shanghai Jiao Tong University; Liang Chen, Jie Ji, Xianting Tian, and Ben Luo, Alibaba Group; Zhixiang Wei, Zhibai Huang, and Kailiang Xu, Shanghai Jiao Tong University; Kaihuan Peng, Kaijie Guo, Ning Luo, Guangjian Wang, Shengdong Dai, Yibin Shen, and Jiesheng Wu, Alibaba Group; Zhengwei Qi, Shanghai Jiao Tong University
SR-IOV and I/O device passthrough enable network and storage devices to be shared among multiple tenants with high density using virtual functions (VFs), achieving near-native performance. However, passthrough does not support page faults, requiring the hypervisor to statically pin the VM-allocated memory. This approach is unacceptable for cloud service providers (CSPs) that rely on oversubscription to enhance memory utilization and reduce costs. The Page Request Interface (PRI) was designed to support device-side I/O page faults (IOPFs) through collaboration among devices, Input-Output Memory Management Units (IOMMU), and the OS. But PRI has not seen broad adoption in devices like NICs and storage.
We propose VIO, a novel dynamic I/O device passthrough approach that achieves near-native performance and is hardware-independent. By leveraging a shadow available queue, VIO can dynamically and transparently switch devices between VIO and passthrough modes based on I/O operations per second (IOPS) pressure, balancing resource utilization and performance. Each DMA request is probed via IOPA-snooping in the virtio data plane to eliminate IOPFs, while device interrupts are directly passed through to the VM guest, enabling performance close to passthrough. VIO is extensively tested and deployed by a leading global CSP across 300K VMs, supporting both legacy and new instances while reclaiming up to the equivalent of 30K VM memory daily without compromising user Service Level Objectives (SLOs). As the scale grows, the benefits continue to increase.
Enabling Efficient GPU Communication over Multiple NICs with FuseLink
Zhenghang Ren, Yuxuan Li, Zilong Wang, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Xudong Liao, Yijun Sun, and Bowen Liu, Hong Kong University of Science and Technology; Han Tian, University of Science and Technology of China; Junxue Zhang, Hong Kong University of Science and Technology; Mingfei Wang, MetaX Integrated Circuits; Zhizhen Zhong, Massachusetts Institute of Technology; Guyue Liu, Peking University; Ying Zhang, Meta; Kai Chen, Hong Kong University of Science and Technology
Machine learning (ML) clusters stack multiple network interface cards (NICs) within each server to improve inter-server GPU communication bandwidth. However, existing systems fall short in fully utilizing NICs because of static GPU-NIC bindings. This leads to bottlenecks at hot-spot NICs when handling imbalanced communication in ML tasks. For example, large language model serving instances may have different communication demands across NICs; expert-parallel training tasks have imbalanced all-to-all traffic; and the embedding transmission volumes during recommendation model training vary across GPUs. To fully utilize all NICs, we propose FuseLink to enable efficient GPU communication over multiple NICs. FuseLink extends inter-server network by integrating high-speed intra-server connections, and leverages GPUs to efficiently relay traffic to idle NICs. We implement FuseLink and integrate it into NCCL, so that ML applications can benefit from FuseLink seamlessly without code modifications. Compared to NCCL, we demonstrate that FuseLink achieves up to 212GBps bandwidth between two inter-server GPUs and accelerates ML tasks with dynamic traffic patterns. Specifically, it reduces the latencies of first-token generation in LLM model servings by 1.04-2.73×, improves the training throughput of mixture-of-experts model by up to 1.3×, and accelerates deep learning recommendation model training by up to 1.2×.
12:45 pm–2:00 pm
Symposium Luncheon
Back Bay Ballroom
2:00 pm–3:40 pm
Database Systems
Session Chair: Kiran-Kumar Muniswamy-Reddy, Amazon
Tigon: A Distributed Database for a CXL Pod
Yibo Huang, Haowei Chen, and Newton Ni, The University of Texas at Austin; Yan Sun, University of Illinois Urbana–Champaign; Vijay Chidambaram, Dixin Tang, and Emmett Witchel, The University of Texas at Austin
Building efficient distributed transactional databases remains a challenging problem despite decades of research. Existing distributed databases synchronize cross-host concurrent data accesses over a network, which requires numerous message exchanges and introduces performance overhead.
We describe Tigon, the first distributed in-memory database that synchronizes cross-host concurrent data accesses using atomic operations on CXL memory. Using CXL memory is more efficient than network-based approaches, however, Tigon must address the limitations of CXL memory. The limitations are CXL’s higher latency and lower bandwidth relative to local DRAM, and its limited hardware support for cross-host cache coherence. For TPC-C and a variant of YCSB, Tigon achieves up to 2.5× higher throughput compared with two optimized shared-nothing databases that use CXL memory as a transport and up to 18.5× higher throughput compared with an RDMA-based distributed database.
Mako: Speculative Distributed Transactions with Geo-Replication
Weihai Shen, Stony Brook University; Yang Cui, Google; Siddhartha Sen, Microsoft Research; Sebastian Angel, University of Pennsylvania; Shuai Mu, Stony Brook University
This paper introduces Mako, a highly available, high-throughput, and horizontally scalable transactional key-value store. Mako performs strongly consistent geo-replication to maintain availability despite entire datacenter failures, uses multi-core machines for fast serializable transaction processing, and shards data to scale out. To achieve these properties, especially to overcome the overheads of distributed transactions in geo-replicated settings, Mako decouples transaction execution and replication. This enables Mako to run transactions speculatively and very fast, and replicate transactions in the background to make them fault-tolerant. The key innovation in Mako is the use of two-phase commit (2PC) speculatively to allow distributed transactions to proceed without having to wait for their decisions to be replicated, while also preventing unbounded cascading aborts if shards fail prior to the end of replication. Our experimental evaluation on Azure shows that Mako processes 3.66M TPC-C transactions per second when data is split across 10 shards, each of which runs with 24 threads. This is an 8.6× higher throughput than state-of-the-art systems optimized for geo-replication.
Quake: Adaptive Indexing for Vector Search
Jason Mohoney, Devesh Sarda, and Mengze Tang, University of Wisconsin–Madison; Shihabur Rahman Chowdhury and Anil Pacaci, Apple; Ihab F. Ilyas, University of Waterloo; Theodoros Rekatsinas, Apple; Shivaram Venkataraman, University of Wisconsin–Madison
Vector search, the task of finding the k-nearest neighbors of a query vector against a database of high-dimensional vectors, underpins many machine learning applications, including retrieval-augmented generation, recommendation systems, and information retrieval. However, existing approximate nearest neighbor (ANN) methods perform poorly under dynamic and skewed workloads where data distributions evolve. We introduce Quake, an adaptive indexing system that maintains low latency and high recall in such environments. Quake employs a multi-level partitioning scheme that adjusts to updates and changing access patterns, guided by a cost model that predicts query latency based on partition sizes and access frequencies. Quake also dynamically sets query execution parameters to meet recall targets using a novel recall estimation model. Furthermore, Quake utilizes NUMA-aware intra-query parallelism for improved memory bandwidth utilization during search. To evaluate Quake, we prepare a Wikipedia vector search workload and develop a workload generator to create vector search workloads with configurable access patterns. Our evaluation shows that on dynamic workloads, Quake achieves query latency reductions of 1.5–38× and update latency reductions of 4.5–126× compared to state-of-the-art indexes such as SVS, DiskANN, HNSW, and SCANN.
Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD
Hao Guo and Youyou Lu, Tsinghua University
We propose PipeANN, an on-disk graph-based approximate nearest neighbor search (ANNS) system, which significantly bridges the latency gap with in-memory ones. We achieve this by aligning the best-first search algorithm with SSD characteristics, avoiding strict compute-I/O order across search steps. Experiments show that PipeANN has 1.14×--2.02× search latency compared to in-memory Vamana, and 35.0% of the latency of on-disk DiskANN in billion-scale datasets, without sacrificing search accuracy.
Skybridge: Bounded Staleness for Distributed Caches
Robert Lyerly, Meta Platforms Inc.; Scott Pruett, unaffiliated; Kevin Doherty and Greg Rogers, Meta Platforms Inc.; Nathan Bronson, OpenAI; John Hugg, Meta Platforms Inc.
Meta Platforms Inc. is a social media company whose products require high availability and low latency. Meta’s services run in multiple geographic locations around the world and use asynchronous replication to keep the numerous cached copies of the datastore in sync. This setup reduces consistency in order to meet availability and latency requirements. Eventual consistency due to asynchronous replication causes issues for Meta’s services, ranging from minor annoyances to product-breaking bugs. Therefore, we ask: can we put meaningful bounds on how long it takes writes to be visible while maintaining the scalability afforded by eventual consistency?
In this work we present Skybridge, an out-of-band replication stream for providing bounded staleness for distributed caches. Skybridge takes advantage of the fact that Meta’s systems already have a reliable delivery stream and instead focuses on real-time delivery of updates. Skybridge is complementary to the main replication pipeline and avoids correlated failures while being lightweight. We show that Skybridge helps provide 2-second bounded staleness for 99.99998% of writes, while the main replication pipeline only achieves this 99.993% of the time. Skybridge is able to achieve this while only being 0.54% the size of cache deployments.
3:40 pm–4:10 pm
Coffee and Tea Break
Grand and Liberty Foyer
4:10 pm–5:30 pm
AI + Systems I
Session Chair: Cheng Tan, Northeastern University
KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
Yue Guan, University of California, San Diego; Yuanwei Fang, Meta; Keren Zhou, George Mason University and OpenAI; Corbin Robeck and Manman Ren, Meta; Zhongkai Yu, University of California, San Diego; Yufei Ding, University of California, San Diego, and Meta; Adnan Aziz, Meta
In this work, we propose KPerfIR, a novel multi-level compiler-centric infrastructure designed to enable the development of customizable, extendable, and portable performance tools tailored for modern artificial intelligence (AI) workloads on modern GPUs. Our approach integrates profiling capabilities directly into the compiler workflow, allowing profiling functionalities to be implemented as compiler passes, offering a programmable and reusable framework for performance analysis. This design bridges the gap between compilers and profilers, enabling fine-grained insights into complex optimization challenges, such as overlapping the execution of fine-grained function units on GPUs. KPerfIR is integrated into the Triton infrastructure to highlight the power of a compiler-centric approach for advancing performance analysis and optimization in the ever-evolving landscape of AI compilers. Our evaluation shows that our tool incurs low overhead (8.2%), provides accurate measurements (2% relative error), and delivers actionable insights into complicated GPU intra-kernel events.
Mirage: A Multi-Level Superoptimizer for Tensor Programs
Mengdi Wu and Xinhao Cheng, Carnegie Mellon University; Shengyu Liu and Chunan Shi, Peking University; Jianan Ji and Man Kit Ao, Carnegie Mellon University; Praveen Velliengiri, Pennsylvania State University; Xupeng Miao, Purdue University; Oded Padon, Weizmann Institute of Science; Zhihao Jia, Carnegie Mellon University
We introduce Mirage, the first multi-level superoptimizer for tensor programs. A key idea in Mirage is µGraphs, a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy. µGraphs enable Mirage to discover novel optimizations that combine algebraic transformations, schedule transformations, and generation of new custom kernels. To navigate the large search space, Mirage introduces a pruning technique based on abstraction that significantly reduces the search space and provides a certain optimality guarantee. To ensure that the optimized µGraph is equivalent to the input program, Mirage introduces a probabilistic equivalence verification procedure with strong theoretical guarantees. Our evaluation shows that Mirage significantly outperforms existing approaches even for DNNs that are widely used and heavily optimized. Mirage is publicly available at https://github.com/mirage-project/mirage.
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
Shouyang Dong, University of Science and Technology of China, Cambricon Technologies, and Institute of Computing Technology, Chinese Academy of Sciences; Yuanbo Wen, Jun Bi, Di Huang, and Jiaming Guo, Institute of Computing Technology, Chinese Academy of Sciences; Jianxing Xu and Ruibai Xu, University of Science and Technology of China, Cambricon Technologies, and Institute of Computing Technology, Chinese Academy of Sciences; Xinkai Song and Yifan Hao, Institute of Computing Technology, Chinese Academy of Sciences; Ling Li, Institute of Software, Chinese Academy of Sciences, and University of Chinese Academy of Sciences; Xuehai Zhou, University of Science and Technology of China; Tianshi Chen, Cambricon Technologies; Qi Guo, Institute of Computing Technology, Chinese Academy of Sciences; Yunji Chen, Institute of Computing Technology, Chinese Academy of Sciences, and University of Chinese Academy of Sciences
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering “Write Once, Run Anywhere” of tensor programs an open question.
We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average.
WaferLLM: Large Language Model Inference at Wafer Scale
Congjie He, Yeqi Huang, and Pei Mu, University of Edinburgh; Ziming Miao, Jilong Xue, Lingxiao Ma, and Fan Yang, Microsoft Research; Luo Mai, University of Edinburgh
Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully.
We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators.
Evaluations show that WaferLLM achieves up to 200× higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606× faster and 16× more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20× speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.
6:00 pm–7:30 pm
OSDI '25 Poster Session and Reception
Sponsored by Amazon
Would you like to share a provocative opinion, interesting preliminary work, or a cool idea that will spark discussion at this year's OSDI? The poster session is the perfect venue to introduce such new or ongoing work. Poster presenters will have the opportunity to discuss their work, get exposure, and receive feedback from other attendees during the in-person evening reception. View the list of accepted posters.
Back Bay Ballroom
7:30 pm–8:30 pm
USENIX 50th Anniversary Celebration
Commonwealth Room
9:00 am–10:40 am
AI + Systems II
Session Chair: Yang Zhou, University of California, Davis, and University of California, Berkeley
BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching
Dingyan Zhang, Haotian Wang, Yang Liu, and Xingda Wei, Shanghai Jiao Tong University; Yizhou Shan, Huawei Cloud; Rong Chen and Haibo Chen, Shanghai Jiao Tong University
Model autoscaling is the key mechanism to achieve serverless model-as-a-service, but it faces a fundamental trade-off between scaling speed and storage/memory usage to cache parameters, and cannot meet frequent scaling requirements across multiple hosts. The key problem is that data plane performance is slow, and scaled instances remain stopped while parameters are loading.
In this paper, we first show that the data plane can be made fast with no or O(1) caching by loading parameters through the compute network between GPUs because: (1) its speed is comparable to host cache and is underutilized, and (2) scaling multiple instances requires no or O(1) caching with network-optimized multicast. Second, autoscaling can be made live by breaking the scaling abstraction for inference from a coarse-grained instance-level to a fine-grained layer-level. This allows us to offload the layer computation from the overloaded serving instances to the scaled ones without waiting for the parameters to be fully loaded.
Under real-world workloads, our system BLITZSCALE achieves up to 94 % lower tail latency reductions compared to state-of-the-art autoscaling system (ServerlessLLM), and it reduces the GPU time used for serving by 49 % when compared with serving systems that do not support autoscaling like DistServe and vLLM with the same service-level-agreement.Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization
Isu Jeong and Seulki Lee, Ulsan National Institute of Science and Technology
We introduce Bayesian code diffusion, a new deep learning program optimization strategy devised to accelerate the auto-tuning process of deep learning compilers. By using the concepts of prior and posterior distributions in the Bayesian framework and reformulating them to the context of deep learning program optimization, the proposed approach efficiently searches for optimal program code in a significantly reduced search space through an iterative diffusion of program code. To further enhance the efficiency of program optimization, we propose pre-training and fine-tuning of the cost model, which improves both the model's predictive accuracy and training efficiency. We implement Bayesian code diffusion in Ansor and evaluate its performance on a wide range of deep learning models on both CPUs and GPUs. Existing approaches struggle to reliably generate high-performing deep learning programs, i.e., achieving low program execution latency, across various configurations, including diverse deep learning model architectures and hardware platforms (CPU and GPU). In contrast, Bayesian code diffusion reduces the end-to-end compilation (optimization) time required to generate the equivalent program execution latency on various setups, e.g., achieving up to 3.31× optimization speedup. This substantial improvement demonstrates that Bayesian code diffusion performs efficient and principled deep learning program optimization across a wide range of deep learning models, operators, and hardware (CPU and GPU).
Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, and Peng Huang, University of Michigan
Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TRAINCHECK, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a single training iteration. It also uncovers 6 unknown bugs in popular training libraries that lead to silent errors.
Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing
Songlin Huang and Chenshu Wu, The University of Hong Kong
As GPUs play an increasingly important role in computer systems in the scaling laws era, understanding fine-grained GPU runtime behavior is more crucial than ever. However, existing GPU kernel profilers, typically kernel-exclusive or hardware-dependent, often fail to capture fine-grained measurements. This paper presents NEUTRINO, a programmable interface for GPU kernel profiling that leverages assembly-layer probing to achieve instruction-level fine granularity, profiling versatility across time and value domains, and hardware independence. To better visualize the rich details captured by NEUTRINO, we introduce the Densified Memory Access Timeline (DMAT), a novel representation that offers new insights into GPU runtime behavior. We implement NEUTRINO in Linux for both NVIDIA and AMD GPUs and conduct extensive evaluations and analyses. The results demonstrate NEUTRINO’s superior capabilities in GPU kernel profiling with low overhead. We envision NEUTRINO as a valuable tool for the community and have open-sourced it to facilitate future research at https://github.com/open-neutrino/neutrino.
Principles and Methodologies for Serial Performance Optimization
Sujin Park, Mingyu Guan, Xiang Cheng, and Taesoo Kim, Georgia Institute of Technology
Throughout the history of computer science, optimizing existing systems to achieve higher performance has been a longstanding aspiration. While the primary emphasis of this endeavor lies in reducing latency and increasing throughput, these two are closely intertwined, and answering the how question has remained a challenge, often relying on intuition and experience.
This paper introduces a systematic approach to optimizing sequential tasks, which are fundamental for overall performance. We define three principles—task removal, replacement, and reordering—and distill them into eight actionable methodologies: batching, caching, precomputing, deferring, relaxation, contextualization, hardware specialization, and layering. Our review of OSDI and SOSP papers over the past decade shows that these techniques, when taken together, comprehensively account for the observed sequential optimization strategies.
To illustrate the framework’s practical value, we present two case studies: one on file and storage systems, and another analyzing kernel synchronization to uncover missed optimization opportunities. Furthermore, we introduce SysGPT, a fine-tuned GPT model trained on curated literature analysis, which offer context-aware performance suggestions. SysGPT’s outputs are more specific and feasible than GPT-4’s, aligning with core strategies from recent research without direct exposure, demonstrating its utility as an optimization assistant.
10:40 am–11:10 am
Coffee and Tea Break
Grand and Liberty Foyer
11:10 am–12:50 pm
Scheduling and Resource Management
Session Chair: Anand Iyer, Georgia Institute of Technology
Söze: One Network Telemetry Is All You Need for Per-flow Weighted Bandwidth Allocation at Scale
Weitao Wang and T. S. Eugene Ng, Rice University
Weighted bandwidth allocation is a powerful abstraction that has a wide range of use cases in modern data center networks. However, realizing highly agile and precise weighted bandwidth allocation for large-scale cloud environments is fundamentally challenging. In this paper, we propose Söze, a lightweight decentralized weighted bandwidth allocation system that leverages simple network telemetry features of commodity Ethernet switches. Given the flow weights, Söze can effectively use the telemetry information to compute and enforce the weighted bandwidth allocations without per-flow, topology, or routing knowledge. We demonstrate the effectiveness of Söze through simulations and testbed experiments, improving TPC-H jobs completion time by up to 0.59× and 0.79× on average.
Decouple and Decompose: Scaling Resource Allocation with DeDe
Zhiying Xu and Minlan Yu, Harvard University; Francis Y. Yan, University of Illinois Urbana-Champaign
Efficient resource allocation is essential in cloud systems to facilitate resource sharing among tenants. However, the growing scale of these optimization problems have outpaced commercial solvers commonly employed in production. To accelerate resource allocation, prior approaches either customize solutions for narrow domains or impose workload-specific assumptions. In this work, we revisit real-world resource allocation problems and uncover a common underlying structure: the vast majority of these problems are inherently separable, i.e., they optimize the aggregate utility of individual resource and demand allocations, under separate constraints for each resource and each demand. Building on this observation, we develop DeDe, a scalable and theoretically rooted optimization framework for large-scale resource allocation. At the core of DeDe is a decouple-and-decompose approach: it decouples entangled resource and demand constraints and thereby decomposes the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel. We have implemented and released DeDe as a Python package with a familiar modeling interface. Our experiments on three representative resource allocation tasks — cluster scheduling, traffic engineering, and load balancing — demonstrate that DeDe delivers significant speedups while generating higher-quality allocations.
Quantum Virtual Machines
Runzhou Tao and Hongzheng Zhu, University of Maryland, College Park; Jason Nieh, Columbia University; Jianan Yao, University of Toronto; Ronghui Gu, Columbia University
Cloud computing services offer time on quantum computers, but users are forced to each use the entire quantum computer to run their programs as there is no way to multiplex a quantum computer among multiple programs at the same time. We present HyperQ, a system that introduces virtual machines for quantum computers to provide fault isolation, better resource utilization, and lower latency for quantum cloud computing. A quantum virtual machine is defined in terms of quantum computer hardware, specifically its quantum gates and qubits arranged in a hardware-specific topology. HyperQ enables quantum virtual machines to be simultaneously executed together on a quantum computer by multiplexing them in time and space on the hardware and ensuring that they are isolated from one another. HyperQ works with existing quantum programs and compiler frameworks; programs are simply compiled to run in virtual machines without the programs or compilers needing to know what else might be executed at the same time. We have implemented HyperQ for the IBM quantum computing service, the largest quantum computing fleet in the world. Our experimental results running quantum programs in virtual machines using the IBM service demonstrate that HyperQ can increase utilization and throughput while reducing program latency, by up to an order of magnitude, without sacrificing, and in some cases improving, fidelity in the results of quantum program execution.
QOS: Quantum Operating System
Emmanouil Giortamis, Francisco Romão, Nathaniel Tornow, and Pramod Bhatotia, TU Munich
Quantum computers face challenges due to hardware constraints, noise errors, and heterogeneity, and face fundamental design tradeoffs between key performance metrics such as quantum fidelity and system utilization. This substantially complicates managing quantum resources to scale the size and number of quantum algorithms that can be executed reliably in a given time.
We introduce QOS, a modular quantum operating system that holistically addresses the challenges of quantum resource management by systematically exploring key design tradeoffs across the stack.QOS exposes a hardware-agnostic API for transparent quantum job execution, mitigates hardware errors, and systematically multi-programs and schedules the jobs across space and time to achieve high quantum fidelity in a resource-efficient manner. QOS's modular design enables synergistic cross- and intra-layer optimizations, while introducing new concepts such as compatibility-based multi-programming and effective utilization.
We evaluate QOS on real quantum devices hosted by IBM, using 7000 real quantum runs of more than 70.000 benchmark instances. We show that the QOS achieves 2.6--456.5× higher fidelity, increases resource utilization by up to 9.6×, and reduces waiting times by up to 5× while sacrificing only 1--3% fidelity, on average, compared to the baselines.
Scalio: Scaling up DPU-based JBOF Key-value Store with NVMe-oF Target Offload
Xun Sun, Mingxing Zhang, Yingdi Shan, Kang Chen, and Jinlei Jiang, Tsinghua University; Yongwei Wu, Tsinghua University and Quan Cheng Laboratory
The rapid growth of data-intensive applications has created a demand for high-density storage systems. Data-Processing-Unit-based (DPU-based) Just a Bunch of Flash (JBOF) solutions provide an energy-efficient and cost-effective architecture to meet this need. However, existing JBOF solutions struggle with scalability when handling an increasing number of attached SSDs, due to their heavy reliance on the DPU's CPU for SSD I/O operations.
In this paper, we introduce Scalio, a scalable disaggregated key-value store designed to address the limitations of current DPU-based JBOF systems. Scalio offloads as many SSD I/O operations as possible to the DPU's network I/O capabilities, including traditional RDMA verbs and a recent hardware optimization, NVMe over Fabrics Target Offload. Additionally, Scalio incorporates a two-layer design with compact in-memory data structures to handle hot read traffic and manage bursty writes. One of the key challenges in this design is ensuring consistency between the DRAM states in the DPU and the SSD states, which, unlike CPU L1/L2 caches, are not automatically synchronized through hardware cache coherence protocols. To address this, Scalio introduces an RDMA-based cache consistency protocol that guarantees linearizability across the system, despite the disaggregated nature of the architecture.
Our experiments show that Scalio significantly improves both scalability and throughput, achieving up to 3.3× higher throughput compared to existing systems, especially in high-density SSD configurations.
12:50 pm–2:00 pm
Symposium Luncheon
Back Bay Ballroom
2:00 pm–3:40 pm
Distributed Systems and Data Centers II
Session Chair: Natacha Crooks, University of California, Berkeley
Low End-to-End Latency atop a Speculative Shared Log with Fix-Ante Ordering
Shreesha G. Bhat, Tony Hong, Xuhao Luo, Jiyu Hu, Aishwarya Ganesan, and Ramnatthan Alagappan, University of Illinois Urbana-Champaign
Today’s shared logs incur expensive coordination to globally order records across storage shards before they can deliver records to applications. This makes them unsuitable for many modern applications that must process ingested data as early as possible and realize low end-to-end (e2e) latencies. We propose SpecLog, a new shared log abstraction that delivers records by speculating the global order, allowing the application’s computation and shared-log coordination to be overlapped, thus reducing e2e latency. To enable accurate speculations, we introduce fix-ante ordering, a novel ordering mechanism that predetermines the global order and makes the shards adhere to the predetermined order. With fix-ante ordering, shards, except in rare cases, can accurately predict where their records will sit in the total order before global coordination. We build Belfast, an implementation of the SpecLog abstraction and fix-ante ordering. Our experiments show that Belfast offers lower e2e latencies than current shared logs while preserving their elasticity, flexibility, and scalability.
Understanding Stragglers in Large Model Training Using What-if Analysis
Jinkun Lin, New York University; Ziheng Jiang, Zuquan Song, Sida Zhao, and Menghan Yu, ByteDance Seed; Zhanghan Wang, New York University; Chenyuan Wang, ByteDance Seed; Zuocheng Shi, Zhejiang University; Xiang Shi, ByteDance; Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, and Xin Liu, ByteDance Seed; Aurojit Panda and Jinyang Li, New York University
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems
Xiaohu Chai, Tsinghua University and Ant Group; Tianyu Zhou, Ant Group; Keyang Hu, Tsinghua University; Jianfeng Tan, Tiwei Bie, Anqi Shen, Dawei Shen, Qi Xing, Shun Song, Tongkai Yang, Le Gao, Feng Yu, and Zhengyu He, Ant Group; Dong Du and Yubin Xia, Shanghai Jiao Tong University; Kang Chen, Tsinghua University; Yu Chen, Quan Cheng Laboratory and Tsinghua University
Serverless computing has seen widespread adoption in public cloud environments. However, it continues to suffer from long cold start latency, which remains a key performance bottleneck. We have conducted an in-depth investigation of existing cold start optimizations and evaluated their effectiveness in large-scale industrial deployments. Our study reveals several common limitations in prior research: (1) reliance on simplified assumptions that overlook the complexities of large-scale systems; (2) a narrow focus on optimizing isolated components of the cold start process, while ignoring end-to-end workflow interactions; and (3) insufficient attention to the challenges introduced by concurrent execution environments. As a result, despite incorporating prior techniques, cold start latency on the Ant Group serverless platform remains in the range of hundreds of milliseconds to several seconds.
This paper identifies three previously overlooked sources of latency: (1) control path latency, stemming from interactions within the serverless runtime; (2) resource contention latency, arising under high concurrency and sustained execution; and (3) user code initialization latency, which reflects the trade-off between resource efficiency and startup performance. To address these challenges, we propose a suite of novel techniques that overcome key limitations in existing approaches. These techniques are designed to be both adaptable to real-world workloads and scalable to large deployments. Our system, AFaaS (short for Ant FaaS), reduces cold start latency to the millisecond level. AFaaS has been deployed in production for over 18 months and has consistently demonstrated stable performance at scale.
Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling
David Domingo, Rutgers University; Hugo Barbalho and Marco Molinaro, Microsoft Research; Kuan Liu, Abhisek Pan, David Dion, and Thomas Moscibroda, Microsoft Azure; Sudarsun Kannan, Rutgers University; Ishai Menache, Microsoft Research
In virtual machine (VM) allocation systems, caching repetitive and similar VM allocation requests and associated resolution rules is crucial for reducing computational costs and meeting strict latency requirements. While modern allocation systems distribute requests among multiple allocator agents and use caching to improve performance, current schedulers often neglect the cache state and latency considerations when assigning each new request to an agent. Due to the high variance in costs of cache hits and misses and the associated processing overheads of updating the caches, simple load-balancing and cache-aware mechanisms result in high latencies. We introduce Kamino, a high-performance, latency-driven and cache-aware request scheduling system aimed at minimizing end-to-end latencies. Kamino employs a novel scheduling algorithm grounded in theory which uses partial indicators from the cache state to assign each new request to the agent with the lowest estimated latency. Evaluation of Kamino using a high-fidelity simulator on large-scale production workloads shows a 42% reduction in average request latencies. Our deployment of Kamino in the control plane of a large public cloud confirms these improvements, with a 33% decrease in cache miss rates and 17% reduction in memory usage.
ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization
Zhuang Wang, Rice University; Zhaozhuo Xu, Stevens Institute of Technology; Jingyi Xi, unaffiliated; Yuke Wang, Anshumali Shrivastava, and T. S. Eugene Ng, Rice University
Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs. Its performance bottleneck lies in communications for gradient synchronization. Although high tensor sparsity is widely observed, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to bridge this gap. We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones. These findings give a new understanding and inspire us to develop a holistic gradient synchronization system for sparse tensors called ZEN. We demonstrate that ZEN can achieve up to 5.09x speedup in communication time and up to 2.48x speedup in training throughput compared to the state-of-the-art methods.
3:40 pm–4:10 pm
Coffee and Tea Break
Grand and Liberty Foyer
4:10 pm–5:50 pm
Kernel and Operating Systems I
Session Chair: Emmett Witchel, The University of Texas at Austin
Extending Applications Safely and Efficiently
Yusheng Zheng, UC Santa Cruz; Tong Yu, eunomia-bpf Community; Yiwei Yang, UC Santa Cruz; Yanpeng Hu, ShanghaiTech University; Xiaozheng Lai, South China University of Technology; Dan Williams, Virginia Tech; Andi Quinn, UC Santa Cruz
This paper presents the Extension Interface Model (EIM) and bpftime, which together enable safer and more efficient extension of userspace applications than the current state-of-the-art. EIM is a new model that treats each required feature of an extension as a resource, including concrete hardware resources (e.g., memory) and abstract ones (e.g., the ability to invoke a function from the extended application). An extension manager, i.e., the person who manages a deployment, uses EIM to specify only the resources an extension needs to perform its task. bpftime is a new extension framework that enforces an EIM specification. Compared to prior systems, bpftime is efficient because it uses extended Berkeley Packet Filter (eBPF)-style verification, hardware-supported isolation features (e.g., Intel MPK), and dynamic binary rewriting. Moreover, bpftime is easy to adopt into existing workflows since it is compatible with the current eBPF ecosystem. We demonstrate the usefulness of EIM and bpftime across 6 use cases that improve security, monitor and enhance performance, and explore configuration trade-offs.
Tintin: A Unified Hardware Performance Profiling Infrastructure to Uncover and Manage Uncertainty
Ao Li, Marion Sudvarg, Zihan Li, Sanjoy Baruah, Chris Gill, and Ning Zhang, Washington University in St. Louis
Hardware performance counters (HPCs) enable the measurement of microarchitectural events, which are crucial for tracking and predicting program behavior. High-fidelity measurement and precise attribution are essential for accurate profiling. However, existing profiling tools have fundamental challenges in both aspects. In measurement, numerous events compete for limited hardware monitoring resources; while for attribution, applications have diverse requirements, but systems provide limited support. Existing tools mitigate the former limitation through event multiplexing, but this approach introduces non-trivial errors. The latter limitation, however, remains largely unaddressed.
This paper introduces Tintin, an HPC profiling infrastructure with a modular three-component design that addresses both challenges. Tintin introduces mechanisms to mitigate multiplexing errors by characterizing uncertainty at runtime, scheduling events to minimize it, and reporting uncertainty to applications. It also proposes the Event Profiling Context (ePX) as a new OS primitive to unify diverse profiling requirements. Tintin is evaluated using benchmarks as well as real-world resource orchestration, performance debugging, and intrusion detection systems, to demonstrate its ability to improve hardware profiling with low runtime overhead.
Building Bridges: Safe Interactions with Foreign Languages through Omniglot
Leon Schuermann and Jack Toubes, Princeton University; Tyler Potyondy and Pat Pannuto, University of California San Diego; Mae Milano and Amit Levy, Princeton University
Awarded Best Paper!
Memory- and type-safe languages promise to eliminate entire classes of systems vulnerabilities by construction. In practice, though, even clean-slate systems often need to incorporate libraries written in other languages with fewer safety guarantees. Because these interactions threaten the soundness of safe languages, they can reintroduce the exact vulnerabilities that safe languages prevent in the first place.
This paper presents Omniglot: the first framework to efficiently uphold safety and soundness of Rust in the presence of unmodified and untrusted foreign libraries. Omniglot facilitates interactions with foreign code by integrating with a memory isolation primitive and validation infrastructure, and avoids expensive operations such as copying or serialization.
We implement Omniglot for two systems: we use it to integrate kernel components in a highly-constrained embedded operating system kernel, as well as to interface with conventional Linux userspace libraries. Omniglot performs comparably to approaches that deliver weaker guarantees and significantly better than those with similar safety guarantees.
KRR: Efficient and Scalable Kernel Record Replay
Tianren Zhang, SmartX; Sishuai Gong and Pedro Fonseca, Purdue University
Modern kernels are large, complex, and plagued with bugs. Unfortunately, their large size and complexity make kernel failures very challenging for developers to diagnose since failures encountered in deployment are often notoriously difficult to reproduce. Although record-replay techniques provide the powerful ability to accurately record a failed execution and deterministically replay it, enabling advanced manual and automated analysis techniques, they are inefficient and do not scale with modern I/O-intensive, concurrent workloads.
This paper introduces KRR, a kernel record-replay framework that provides a highly efficient execution recording mechanism by narrowing the scope of the record and replay boundary to the kernel. Unlike previous record-replay whole-stack approaches, KRR adopts a split-recorder design that employs the guest and the host to jointly record the kernel execution. Our evaluation demonstrates that KRR scales efficiently up to 8 cores, across a range of different workloads, including kernel compilation, RocksDB, and Nginx. When recording 8-core VMs that run RocksDB and kernel compilation, KRR incurs only a 1.52× ~ 2.79× slowdown compared to native execution, while traditional whole-VM RR suffers from 8.97× ~ 29.94× slowdown. We validate that KRR is practical and has a broad recording scope by reproducing 17 bugs across different Linux versions, including 6 non-deterministic bugs and 5 high-risk CVEs; KRR was able to record and reproduce all but one non-deterministic bug.
Deterministic Client: Enforcing Determinism on Untrusted Machine Code
Zachary Yedidia, Stanford University; Geoffrey Ramseyer, Stanford University and Stellar Development Foundation; David Mazières, Stanford University
This paper presents Deterministic Client (DeCl), a software-based sandboxing system for enforcing deterministic behavior on untrusted machine code, either x86-64 or Arm64. DeCl adapts techniques from Software Fault Isolation (SFI) traditionally used to guarantee memory isolation to instead enforce the stronger property of determinism. By using a simple and efficient machine code verifier that can guarantee that a program behaves deterministically, DeCl does not rely on a trusted compiler/interpreter for correctness. This allows the use of LLVM without compromising the size of the trusted code base. We also describe how to implement two efficient metering mechanisms that enforce deterministic preemption of sandboxed programs, and how DeCl can be implemented in combination with traditional software-based isolation, by making the sandboxed code position-oblivious. DeCl is able to combine and improve upon the benefits of both interpreters and JIT compilers at once, with low CPU overhead, fast startup time, and strong security via a small trusted code base. We evaluate DeCl's effectiveness on general-purpose CPU benchmarks, as well as in an application-specific context by integrating with the Groundhog smart contract engine, and using DeCl for zero-knowledge-proof verification.
6:00 pm–7:30 pm
USENIX ATC '25 Poster Session and Reception
The USENIX ATC '25 poster session and reception will feature posters by authors presenting their work at the conference. View the list of accepted posters.
Back Bay Ballroom
7:30 pm–8:30 pm
Tribute to USENIX ATC
Commonwealth Room
9:00 am–10:40 am
Kernel and Operating Systems II
Session Chair: Sudarsun Kannan, Rutgers University
Disentangling the Dual Role of NIC Receive Rings
Boris Pismenny, EPFL & NVIDIA; Adam Morrison, Tel Aviv University; Dan Tsafrir, Technion — Israel Institute of Technology
CPUs parallelize packet processing across cores via per-core receive (Rx) rings, which are typically sized to absorb bursts with >=1Ki entries by default. The combined I/O working set (packet buffers pointed to by all Rx rings) easily exceeds the LLC capacity, thus degrading performance due to high memory bandwidth pressure. Recent work has reduced the I/O working set size by sharing Rx rings among cores with the "shRing" system. But this approach suffers from a bottleneck under imbalanced loads, which are common.
We contend that the bottleneck stems from an unnecessary entanglement of two orthogonal producer-consumer structures: (1) memory allocation, where the core produces empty buffers that the NIC consumes to store packets; and (2) packet delivery, where the NIC produces incoming packets that the core consumes. We propose rxBisect, a new CPU-NIC interface that decouples these structures. RxBisect replaces each Rx ring with two separate rings corresponding to the two structures, allowing memory allocation to be performed independently of packet reception. RxBisect can thus pass empty buffers efficiently between cores upon imbalance, thereby eliminating the aforementioned bottleneck. We implement rxBisect with software emulation and find that it improves throughput by up to 20% and 37% relative to the state-of-the-art (shRing) and state-of-the-practice (per-core Rx rings).
XSched: Preemptive Scheduling for Diverse XPUs
Weihang Shen, Mingcong Han, Jialong Liu, Rong Chen, and Haibo Chen, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
XPUs, such as GPUs, NPUs, ASICs, and FPGAs, lack flexible scheduling capabilities, failing to meet rich application requirements (e.g., priority and fairness) in multitasking environments. This paper presents XSched, a scheduling framework that enables preemptive scheduling on diverse XPUs with flexible policies. XSched provides unified interfaces for scheduling XPU tasks through a preemptible command queue abstraction (XQueue). The key challenge in implementing the abstraction is adapting to XPUs with diverse and evolving hardware capabilities and software stacks. XSched proposes a multi-level hardware model that enables mature, advanced XPUs to achieve optimal scheduling performance while maintaining compatibility with emerging, wimpy XPUs. To demonstrate the generalizability of XSched, we adapted it to ten XPUs of different types, brands, and generations across seven software platforms and implemented two hardware-agnostic scheduling policies. We further evaluated XSched through three case studies of multitasking workloads on XPUs. XSched effectively achieves various scheduling objectives using its efficient and flexible preemption mechanisms.
OS Rendering Service Made Parallel with Out-of-Order Execution and In-Order Commit
Yuanpei Wu and Dong Du, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Chao Xu, Fields Lab, Huawei Central Software Institute; Yubin Xia and Yang Yu, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Ming Fu, Fields Lab, Huawei Central Software Institute; Binyu Zang and Haibo Chen, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education
Rendering service is an indispensable OS service on smart-device OSes like Android, iOS and OpenHarmony. However, the recent shift towards highly scalable display scenarios, such as foldable and multiple screens, has notably amplified the rendering workload, leading to low frame rates that degrade user experience. Yet, rendering services predominantly follow a sequential model, which is notoriously hard to parallelize due to the complex state dependency, drawing order dependency, and interface dependency. This paper observes that a significant portion of the rendering procedure is potentially parallelizable through proper state pre-untangling and drawing order post-preserving. To this end, this paper introduces Spars, a scalable parallelized OS rendering service inspired by the out-of-order execution with in-order commit in computer architecture. Spars revolutionizes the rendering procedure by initially generating self-contained rendering tasks through in-order preparation, executing such tasks in an out-of-order manner to maximize multi-core parallelism, and subsequently committing the tasks in-order to enforce drawing order dependencies. Evaluation results on state-of-the-art single-screen, dual-fold, and tri-fold smartphones (Mate 70, X5, XT) as well as one-chip-multiple-screen configurations show an average frame rate improvement of 1.76×–1.91×. Moreover, Spars is able to decrease the device power consumption by 3.0% or increase the budget of graphics primitives by 2.31× for more appealing visual effects with the same stable frame rate.
EMT: An OS Framework for New Memory Translation Architectures
Siyuan Chai, Jiyuan Zhang, Jongyul Kim, Alan Wang, Fan Chung, and Jovan Stojkovic, University of Illinois Urbana-Champaign; Weiwei Jia, University of Rhode Island; Dimitrios Skarlatos, Carnegie Mellon University; Josep Torrellas and Tianyin Xu, University of Illinois Urbana-Champaign
With terabyte-scale memory capacity and memory-intensive workloads, memory translation has become a major performance bottleneck. Many novel hardware schemes are developed to speed up memory translation, but few are experimented with commodity OSes. A main reason is that memory management in major OSes, like Linux, does not have the extensibility to empower emerging hardware schemes.
We develop EMT, a pragmatic framework atop Linux to empower different hardware schemes of memory translation such as radix tree and hash table. EMT provides an architecture neutral interface that 1) supports diverse memory translation architectures, 2) enables hardware-specific optimizations, 3) accommodates modern hardware and OS complexity, and 4) has negligible overhead over hardwired implementations. We port Linux’s memory management onto EMT and show that EMT enables extensibility without sacrificing performance. We use EMT to implement OS support for ECPT and FPT, two recent experimental translation schemes for fast translation; EMT enables us to understand the OS perspective of these architectures and further optimize their designs.
Tiered Memory Management Beyond Hotness
Jinshu Liu, Hamid Hadian, Hanchen Xu, and Huaicheng Li, Virginia Tech
Tiered memory systems often rely on access frequency (''hotness'') to guide data placement. However, hot data is not always performance-critical, limiting the effectiveness of hotness-based policies. We introduce amortized offcore latency (AOL), a novel metric that precisely captures the true performance impact of memory accesses by accounting for memory access latency and memory-level parallelism (MLP). Leveraging AOL, we present two powerful tiering mechanisms: SOAR, a profile-guided allocation policy that places objects based on their performance contribution, and ALTO, a lightweight page migration regulation policy to eliminate unnecessary migrations. SOAR and ALTO outperform four state-of-the-art tiering designs across a diverse set of workloads by up to 12.4×, while underperforming in a few cases by no more than 3%.
10:40 am–11:10 am
Coffee and Tea Break
Grand and Liberty Foyer
11:10 am–12:30 pm
AI + Systems III
Session Chair: Deepti Raghavan, Brown University
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu, University of Washington; Yufei Gao, Tsinghua University and University of Washington; Yilong Zhao, University of Washington and University of California, Berkeley; Liangyu Zhao, University of Washington; Gefei Zuo, University of Michigan; Yile Gu and Dedong Xie, University of Washington; Tian Tang and Qinyu Xu, Tsinghua University and University of Washington; Zihao Ye, Keisuke Kamahori, and Chien-Yu Lin, University of Washington; Ziren Wang, Tsinghua University and University of Washington; Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci, University of Washington
Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems’ performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving—compute, memory, networking—are executed sequentially within a device.
We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91× throughput boost compared to state-of-the-art serving systems achieving 50% to 72 % of optimal throughput across popular models.
PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
Yu Cheng, Lei Wang, and Yining Shi, School of Computer Science, Peking University; Yuqing Xia, Lingxiao Ma, Jilong Xue, and Yang Wang, Microsoft Research; Zhiwen Mo, Imperial College London and Microsoft Research; Feiyang Chen, Shanghai Jiao Tong University and Microsoft Research; Fan Yang and Mao Yang, Microsoft Research; Zhi Yang, School of Computer Science, Peking University
To effectively utilize heterogeneous specialized hardware units in modern GPUs, such as TensorCores and Tensor Memory Accelerators, this paper introduces PipeThreader, a new DNN compiler. PipeThreader proposes shifting scheduling functionality from hardware to software so as to enable more efficient and sophisticated computation pipelining with minimal manual effort. This is achieved through sTask-graph, a new DNN computation abstraction, a hierarchical hardware abstraction that captures the capabilities of specialized units, and new scheduling primitives. As a result, PipeThreader can discover efficient pipeline scheduling for well-studied DNN architectures like FlashAttention, achieving comparable or even superior performance. Additionally, it can uncover novel pipeline schemes for emerging models like Mamba2, delivering significantly better performance compared to state-of-the-art hand-crafted implementations. The code is open-sourced at https://github.com/tile-ai/tilelang.
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
Zheng Wang, University of California, San Diego, and Meta; Anna Cai and Xinfeng Xie, Meta; Zaifeng Pan and Yue Guan, University of California, San Diego; Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, and Yuchen Hao, Meta; Yufei Ding, University of California, San Diego, and Meta
In this work, we present WLB-LLM, a WorkLoad-Balanced 4D Parallelism for Large Language Model Training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23× when applying WLB-LLM in our internal LLM training framework.
DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
Yeonhong Park, Jake Hyun, Hojoon Kim, and Jae W. Lee, Seoul National University
Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose DecDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and latency reduction. DecDEC stores the residual matrix—the difference between full-precision and quantized weights—in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations---this enables adaptation to the dynamic nature of activation distribution, thus maximizing the effectiveness of error compensation. We demonstrate the effectiveness of DecDEC by augmenting state-of-the-art quantization methods. For example, DecDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12—outperforming its 3.5-bit counterpart—while adding less than 0.0003% to GPU memory usage and incurring only a 1.7% inference slowdown on NVIDIA RTX 4050 Mobile.
12:30 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:40 pm
File and Storage Systems
Session Chair: Hakim Weatherspoon, Cornell University and Exosteller, Inc.
Stripeless Data Placement for Erasure-Coded In-Memory Storage
Jian Gao, Jiwu Shu, Bin Yan, and Yuhao Zhang, Tsinghua University; Keji Huang, Huawei Technologies Co., Ltd
Erasure coding plays a crucial role in distributed storage systems to provide fault tolerance at a low storage cost. Conventional erasure coding schemes are based on stripes. However, placing data into stripes can incur non-negligible performance overheads that will manifest in emerging fast in-memory storage systems, making conventional erasure coding schemes suboptimal in such scenarios.
Aiming to eliminate such overheads, we present Nos, a stripeless erasure coding scheme. It lets each node in the storage system independently replicate data to other nodes and encode received data replica into parities with XOR. Thus, Nos avoids the overheads caused by stripes. To enable failure recovery, Nos uses a combinatoric structure called symmetric balanced incomplete block design (SBIBD) to decide primary-to-backup node affinities during replication. Atop Nos, we further build Nostor, a distributed in-memory key-value store. Evaluations demonstrate that Nostor achieves 1.61x and 2.60x throughputs with similar or lower latencies than stripe-based erasure coding baselines.
PoWER Never Corrupts: Tool-Agnostic Verification of Crash Consistency and Corruption Detection
Hayley LeBlanc, University of Texas at Austin; Jacob R. Lorch and Chris Hawblitzel, Microsoft Research; Cheng Huang and Yiheng Tao, Microsoft; Nickolai Zeldovich, MIT CSAIL and Microsoft Research; Vijay Chidambaram, University of Texas at Austin
Awarded Distinguished Artifact!
Storage systems must maintain integrity even after rare and difficult-to-test-for conditions like power losses and media errors. Formal verification presents a promising avenue to ensure storage systems are resilient, but current approaches involve significant complexity and rely on verification constructs or forms of logic beyond what most verifiers natively support. In this paper, we present two new verification techniques that rely only on standard constructs provided by most verification tools such as Hoare logic, ghost variables, and quantifiers. First, we introduce PoWER (Preconditions on Writes Enforcing Recoverability), a novel approach to verifying crash consistency that encodes its requirements in the preconditions of storage API methods. Second, we present a new model of media corruption for provable corruption detection on any type of storage device. To demonstrate the power of these new techniques, we use them to build two verified storage systems using two different verification frameworks. We build and verify the key-value (KV) store CapybaraKV using Verus and the notary server CapybaraNS using Dafny. Both systems are built for persistent memory (PM), which we target due to new challenges it presents to building resilient storage systems. We develop new techniques to address these challenges, including the corruption-detecting Boolean, a new primitive for atomic checksum updates. Both systems verify in under a minute, and CapybaraKV achieves performance competitive with similar unverified PM KV stores.
Fast and Synchronous Crash Consistency with Metadata Write-Once File System
Yanqi Pan, Wen Xia, Yifeng Zhang, Xiangyu Zou, and Hao Huang, Harbin institute of Technology, Shenzhen; Zhenhua Li, Tsinghua University; Chentao Wu, Shanghai Jiao Tong University
Low-latency persistent memory (PM) encourages file systems to pursue synchronous crash consistency. However, existing crash consistency approaches, such as journaling and log structure file system, incur many small, random, and ordered metadata I/Os, failing to exploit PM I/O potential. We propose a new file system model called metadata write-once file system (WOFS) to achieve fast and synchronous crash consistency. The key idea is to generate specific metadata for each file operation as a checksum-protected package and write it once with a single ordering point. The package is then managed to provide file abstractions through a package translation layer without extra writes. Using an array of techniques to generate, organize, and recover from packages, WOFS can provide practical, efficient, and reliable file system services. We implement WOLVES as a WOFS prototype in Linux kernel. Experiments using benchmarks and applications suggest that WOLVES can recover from crashes, improve operation throughput, and potentially reach PM I/O bandwidth limits.
Decentralized, Epoch-based F2FS Journaling with Fine-grained Crash Recovery
Yaotian Cui and Zhiqi Wang, The Chinese University of Hong Kong, China; Renhai Chen, College of Intelligence and Computing, Tianjin University, China; Zili Shao, The Chinese University of Hong Kong, China
F2FS, a log-structured filesystem, has gained widespread adoption in Android systems. However, F2FS relies on coarse-grained checkpointing for crash recovery. When triggered, this mechanism significantly degrades system performance by blocking file writes. Additionally, F2FS’s checkpointing approach may not fully recover file data and metadata to a consistent state after a crash. Given these limitations, it is crucial to design a new journaling mechanism for F2FS that provides fine-grained crash recovery. While journaling methods are well-studied for in-place-update filesystems (such as JBD2 for EXT4), directly applying these state-of-the-art techniques to F2FS - an out-of-place-update filesystem - does not yield similar benefits.
In this paper, we propose a novel journaling technique, called F2FSJ, for F2FS with ordered journal mode. Catering to the out-of-place update features of F2FS, F2FSJ incorporate several innovative designs. First, in F2FSJ, only metadata changes are journaled and committed after data flushing, by which I/O and storage overheads can be mitigated. Second, we propose a decentralized journal design by embedding journal logs into inodes, which significantly reduces lock contention and interference when recording metadata changes. Third, we propose an epoch-based approach with a novel data/control-plane decoupling mechanism, which eliminates waiting times during journal period transfers. Finally, for journal apply, we propose a fast-forward-to-latest approach to consolidate multiple small updates into one update for reducing small writes. We have implemented a fully functional prototype of F2FSJ and conducted extensive experiments. Our experimental results demonstrate that F2FSJ can effectively reduce the checkpointing time by up to 4.9x and reduce the latency by up to 35% compared with F2FS. F2FSJ is open-sourced for public access.
Okapi: Decoupling Data Striping and Redundancy Grouping in Cluster File Systems
Sanjith Athlur and Timothy Kim, Carnegie Mellon University; Saurabh Kadekodi, Google; Francisco Maturana and Xavier Ramos, Carnegie Mellon University; Arif Merchant, Google; K. V. Rashmi and Gregory R. Ganger, Carnegie Mellon University
The Okapi cluster file system decouples how data is spread across disks (data striping) for IO efficiency from how data is erasure coded together (redundancy grouping) for durability. Existing systems couple these two mechanisms’ configurations, inducing significant inefficiencies. Decoupling allows grouping to be configured based on reliability and space efficiency goals, while simultaneously allowing striping to be configured based on performance goals. Decoupling also allows redundancy scheme changes from one EC scheme to another (e.g., to react to data temperature or disk failure rate changes) to occur without having to re-write data. Evaluation of an Okapi prototype shows that decoupling can be accomplished with <1% increase in metadata size and file manager memory, and minimal file creation and degraded read resource increase. Experiments demonstrate that decoupling can improve read throughput by 80% and reduce seeks per second by up to 70%, without yielding any data reliability, and reduce the overhead of redundancy transitions by up to 70%.
3:40 pm–4:10 pm
Coffee and Tea Break
Grand and Liberty Foyer
4:10 pm–5:30 pm
Privacy and Security
Session Chair: Heming Cui, The University of Hong Kong
Compass: Encrypted Semantic Search with High Accuracy
Jinhao Zhu, UC Berkeley; Liana Patel, Stanford University; Matei Zaharia and Raluca Ada Popa, UC Berkeley
We present Compass, a semantic search system for encrypted data that achieves high accuracy, matching state-of-the-art plaintext search quality, while ensuring the privacy of data, queries, and results, even if the server is compromised. Compass contributes a novel way to traverse a state-of-the-art graph-based semantic search index and a white-box co-design with Oblivious RAM, a cryptographic primitive that hides access patterns, to enable efficient search over encrypted embeddings. With our techniques, Directional Neighbor Filtering, Speculative Neighbor Prefetch, and Graph-Traversal Tailored ORAM, Compass achieves user-perceived latencies within or around a second and is orders of magnitude faster than baselines under various network conditions.
Weave: Efficient and Expressive Oblivious Analytics at Scale
Mahdi Soleimani, Grace Jia, and Anurag Khandelwal, Yale University
Many distributed analytics applications that are offloaded to the cloud operate on sensitive data. Even when the computations for such analytics workloads are confined to trusted hardware enclaves and all stored data and network communications are encrypted, several studies have shown that they are still vulnerable to access pattern attacks. Prior efforts towards preventing access pattern leakage often incur network and compute overheads that are logarithmic in dataset size, while also limiting the functionality of supported analytics jobs.
We present Weave, an efficient, expressive, and secure analytics platform that scales to large datasets. Weaveemploys a combination of noise injection and hardware memory isolation via enclave page caches to reduce the network and compute overheads for oblivious analytics to a constant factor. Weave also employs several optimizations and extensions that exploit dataset and workload-specific properties to ensure performance at scale without compromising on functionality. Our evaluations show that Weave reduces the end-to-end execution time for a wide range of analytics jobs on large real-world datasets by 4--10× compared to prior state-of-the-art while providing strong obliviousness guarantees.
Paralegal: Practical Static Analysis for Privacy Bugs
Justus Adam, Carolyn Zech, Livia Zhu, Sreshtaa Rajesh, Nathan Harbison, Mithi Jethwa, Will Crichton, Shriram Krishnamurthi, and Malte Schwarzkopf, Brown University
Finding privacy bugs in software today usually requires onerous manual audits. Code analysis tools could help, but existing tools aren’t sufficiently practical and ergonomic to be used.
Paralegal is a static analysis tool to find privacy bugs in Rust programs. Key to Paralegal’s practicality is its distribution of work between the program analyzer, privacy engineers, and application developers. Privacy engineers express a high-level privacy policy over markers, which application developers then apply to source code entities. Paralegal extracts a Program Dependence Graph (PDG) from the program, leveraging Rust’s ownership type system to model the behavior of library code. Paralegal augments the PDG with the developers’ markers and checks privacy policies against the marked PDG.
In an evaluation on eight real-world applications, Paralegal found real privacy bugs, including two previously unknown ones. Paralegal supports a broader range of policies than information flow control (IFC) and CodeQL, a widely-used code analysis engine. Paralegal is fast enough to deploy interactively, and its markers are easy to maintain as code evolves.
MettEagle: Costs and Benefits of Implementing Containers on Microkernels
Till Miemietz, Viktor Reusch, and Matthias Hille, Barkhausen Institut; Lars Wrenger, Leibniz-Universität Hannover; Jana Eisoldt, Barkhausen Institut; Jan Klötzke, Kernkonzept GmbH; Max Kurze, Technische Universität Dresden; Adam Lackorzynski, Technische Universität Dresden and Kernkonzept GmbH; Michael Roitzsch, Barkhausen Institut; Hermann Härtig, Barkhausen Institut and Technische Universität Dresden
Today, many applications are hosted by cloud providers. In order to isolate the workloads of different clients, cloud enterprises mostly rely on containers rather than standard processes, since the latter are able to exercise a lot of ambient authority. Containers counter this deficiency by sandboxing processes. To this end, they use dedicated security mechanisms such as seccomp-bpf. However, these mechanisms add complexity to the kernel and increase its attack surface, thus prompting new security challenges. Processes in microkernel-based systems do not have ambient authority. Thus, they do not require additional security mechanisms to build sandboxes. In this paper, we try to answer the question whether a microkernel-based OS architecture enables a leaner and more secure container infrastructure. Based on a CVE analysis, we show that the conceptual simplicity of containers on microkernels results in a better security posture than that typically found on monolithic systems. We furthermore demonstrate the practical feasibility of implementing containers on state-of-the-art microkernels by building MettEagle, a prototype container service running on L4Re. We found that applications running in containers on L4Re expose performance characteristics comparable to that of containers on Linux for both synthetic and real-world benchmarks. In some cases, the container implementation of L4Re even outperforms Linux, accelerating container startup latency and improving network performance.
5:30 pm–5:35 pm
Closing Remarks
Program Co-Chairs: Lidong Zhou, Microsoft, and Yuanyuan Zhou, University of California, San Diego
Grand and Liberty Ballroom
