NSDI '26 offers authors the choice of two submission deadlines. The list of accepted papers from the spring deadline is available below.
From Source to Solution: Tackling Packet Losses in Large-scale Cloud Gaming Systematically and Precisely
Jing Wang, Xiao Kong, Yunzhe Ni, Nian Wen, Jiaxing Zhang, Congcong Miao, and Honghao Liu, Tencent Inc.
Cloud gaming requires all video frames to be delivered before a stringent delay deadline to ensure seamless gaming experience. However, meeting this requirement is challenging due to packet losses, which greatly magnifies the frame delay. Various FEC-based loss recovery schemes were recently proposed to address the packet loss issue. However, the source of such packet losses remains unrevealed. Our production measurement results from Tencent START cloud gaming platform have shown that 66.5% of packet losses are caused by network infrastructure’s preferences against UDP and network congestion. Moreover, off-the-shelf video streaming systems like WebRTC could not detect retransmission loss efficiently. These issues completely nullified the performance gain of loss recovery schemes. To address this, we design and implement LADR, which combines loss avoidance, detection, and recovery to tackle packet losses. LADR incorporates the loss-based and delay-based congestion control algorithms and adopts RACK-TLP for loss avoidance and detection. Furthermore, LADR adopts an opportunistic FEC scheme to perform loss recovery. LADR has been rolled out at Tencent START cloud gaming platform, a large-scale cloud gaming provider, for one year. Production measurement results show that LADR only suffers from 0.049% packet loss rate (-59.8% vs. existing solutions) and delivers 99.87% of video frames within 100 milliseconds.
Mortise: Auto-tuning Congestion Control to Optimize QoE via Network-Aware Parameter Optimization
Yixin Shen, Tsinghua University, Bytedance Inc., and Zhongguancun Laboratory; Ruihua Chen, Tsinghua University; Bo Wang, Tsinghua University and Zhongguancun Laboratory; Jing Chen, Haochen Zhang, and Minhu Wang, Tsinghua University; Yan Liu, Bytedance Inc.; Mingwei Xu, Tsinghua University and Zhongguancun Laboratory; Zili Meng, Hong Kong University of Science and Technology
Congestion control algorithms (CCAs) critically shape the tradeoff among throughput, latency, and loss, directly impacting user Quality of Experience (QoE).
However, most existing CCAs use static, heuristically chosen parameter settings that fail to adapt to dynamic network states, resulting in suboptimal QoE. Our key observation is that the optimal CCA parameter configuration depends on real-time network states.
To bridge this gap, we propose Mortise, a real-time, network-aware adaptation framework that dynamically tunes rule-based CCA parameters to maximize QoE.
To address the challenges in modeling the complex parameter-QoE relationship, Mortise introduces a QoS tradeoff proxy to decompose parameter optimization into two steps: it first infers the application's preferred QoS tradeoff from real-time QoE gradients and then derives the corresponding parameter settings via control-theoretic analysis.
Implemented atop TCP and evaluated in both emulated and production environments, Mortise outperforms state-of-the-art solutions, enhancing the QoE of file downloading service by up to 73% and QoE of video streaming service by up to 167% in real-world scenarios, with minimal deployment overhead.
HEDGE: Traffic Engineering with Probabilistic Link Capacities
Arjun Devraj, Cornell University; Bill Owens, NYSERNet; Umesh Krishnaswamy, Microsoft; Ying Zhang, Meta; Rachee Singh, Cornell University
Cloud providers have adopted higher modulation formats to achieve higher data-rate wavelengths in their optical wide-area networks. However, higher modulation formats reduce signal quality margins, making wavelengths more susceptible to wavelength-specific faults (WSFs)—temporary faults that selectively affect certain wavelengths while others remain unaffected, even though they all share the same optical fiber and equipment. WSFs cause the capacity of inter-datacenter links to fluctuate, frequently disrupting traffic engineering systems. We propose HEDGE, a system that mitigates the effects of WSFs by implementing link-local resilience and global network-wide resilience against WSFs. For local resilience, HEDGE provisions inter-datacenter links with a guaranteed minimum capacity and availability target, in spite of WSFs, while using the fewest possible constituent wavelengths. For global resilience, HEDGE optimally balances throughput and availability while allocating flows on a stochastic wide-area network with fluctuating link capacities. HEDGE sustains equivalent throughput with state-of-the-art traffic engineering systems, while dropping 12.2× less network flow in worst-case scenarios and reducing disruptions to tunnel allocations by 622× in spite of a rapidly changing topology.
In Link We Trust: BFT at the Speed of CFT using Switches
Lior Zeno, Naama Ben-David, and Mark Silberstein, Technion – Israel Institute of Technology
We introduce SwitchBFT, a novel BFT consensus protocol for data centers that matches the performance and fault tolerance guarantees of the fastest Crash Fault Tolerance (CFT) protocols. We take advantage of several unique properties of the trusted network that have emerged in modern data centers. SwitchBFT leverages packet source authentication to eliminate the overheads of cryptographic signatures, thus speeding up the fault-free scenario, and utilizes network switch programmability to enforce agreement decisions and to verify that safety is not violated, thereby offering robust performance even when some replicas are faulty. Designing a practical BFT that makes the most of these properties requires solving several challenges, such as packet losses and switch crash faults, all within the tight switch resource budget. We show that SwitchBFT outperforms state-of-the-art BFTs in scalability and performance, attaining the speed of NOPaxos, an in-switch CFT implementation.
HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
Chiheng Lou, Sheng Qi, and Chao Jin, School of Computer Science, Peking University; Dapeng Nie, Haoran Yang, and Yu Ding, Alibaba Group; Xuanzhe Liu and Xin Jin, School of Computer Science, Peking University
With the proliferation of large language model (LLM) variants, developers are turning to serverless computing for cost-efficient LLM deployment. However, public cloud providers often struggle to provide performance guarantees for serverless LLM serving due to significant cold start latency caused by substantial model sizes and complex runtime dependencies. To address this problem, we present HydraServe, a serverless LLM serving system designed to minimize cold start latency in public clouds. HydraServe proactively distributes models across servers to quickly fetch them, and overlaps cold-start stages within workers to reduce startup latency. Additionally, HydraServe strategically places workers across GPUs to avoid network contention among cold-start instances. To minimize resource consumption during cold starts, HydraServe further introduces pipeline consolidation that can merge groups of workers into individual serving endpoints. Our comprehensive evaluations under diverse settings demonstrate that HydraServe reduces the cold start latency by 1.7×–4.7× and improves service level objective attainment by 1.43×–1.74× compared to baselines.
FRCC: Towards Provably Fair and Robust Congestion Control
Anup Agarwal, Carnegie Mellon University; Venkat Arun, University of Texas at Austin; Srinivasan Seshan, Carnegie Mellon University
Congestion control algorithms (CCAs) play a critical role in network bandwidth allocation. Recent work (from SIGCOMM 2022) showed that a large class of CCAs, including BBR, Copa, and Reno, starve flows in the presence of network jitter. Starvation occurs because CCAs coordinate fairness by encoding fair rates into congestion signals. For example, Reno's throughput scales as 1/√loss rate. Even a small amount of noise in these signals leads to large errors in inferring fair rates.
We present FRCC (Fair and Robust Congestion Controller), the first CCA that provably bounds unfairness (avoids starvation) even under network jitter. Our key insight is to encode only the flow count (or equivalently, the fair link fraction) into the congestion signals, and independently estimate the link capacity to calculate the fair rate. In this way, we bound jitter's impact on fairness. We implement FRCC in the Linux kernel and evaluate it in a variety of network conditions, including synthetic jitter, heterogeneous RTTs, and multi-bottleneck settings. FRCC closely matches the bounds predicted by our theoretical analysis, and consistently achieves fairness, even when state-of-the-art CCAs exhibit starvation.
CCEval: Accurately and Confidently Evaluating Performance Metrics of Congestion Control Algorithms for Datacenter Networks
Tianfeng Liu, Kaihui Gao, and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Jin Guang and Xinyun Chen, The Chinese University of Hong Kong, Shenzhen; Vincent Liu, University of Pennsylvania; Zhiyong Chen and Yiwei Zhang, Tsinghua University; Ni Jin, Zhongguancun Laboratory and Beijing University of Posts and Telecommunications; Ran Zhang, Zhongguancun Laboratory
Congestion control in datacenter networks (DCNs) is a highly active research area. Typical CCA evaluation workflows contain three steps: generate experimental configurations, execute the experiments, and estimate performance metrics using results from multiple trials. However, due to variability brought by random traffic workloads and single-digit trial counts, common experimental methodologies fail to provide enough confidence to properly evaluate CCA performance.
We propose CCEval, an evaluation framework for accurately and confidently estimating performance metrics of CCAs in DCNs. The key idea is using confidence intervals and more trials to quantify and improve the accuracy and confidence of performance metrics. To this end, we propose a model-free estimation algorithm to calculate the confidence intervals and forecast the required trial count for a given accuracy, confidence level, metric, and CCA. We further design a model-based tail quantile estimation algorithm to reduce the needed trial counts significantly without losing accuracy and confidence. Extensive experiments on simulators and real-world testbeds with four CCAs on typical topologies and flow distributions show that CCEval can produce estimations of performance metrics accurately and confidently, with 1% relative margin of error and 95% confidence level, and can reduce trial counts by 75%~80% for tail quantile estimation.
Matryoshka: Realizing Hyperscale Data Center Network Design for the AI Era
Yan Cai, Meta; Jialong Li, Max Planck Institute for Informatics; Kutalmis Akpinar, Tianxiang Li, Hany Morsy, Jason Wilson, and Sunil Khaunte, Meta; Yiting Xia, Max Planck Institute for Informatics; Ying Zhang, Meta
Over the past decade, data center networking (DCN) has undergone substantial transformation in terms of both scale and complexity. Developing a DCN entails multiple intricate steps, such as establishing physical connections, configuring logical network addressing, and defining high-level routing policies. While extensive work has focused on logical DCN design and physical deployment, a critical gap remains: materializing these designs into concrete switch configurations—a necessary step to realize the development procedure. This problem is especially acute in the AI era, as hyperscale, rapidly evolving, and highly heterogeneous AI-driven clusters place unprecedented demands on DCN design and implementation.
This paper presents Matryoshka, Meta’s production-scale DCN design system that bridges this gap. Matryoshka employs an intent-based, model-driven approach to systematically compile high-level DCN design intents into working switch configurations. Operational for over six years, Matryoshka has supported orders-of-magnitude growth in Meta’s DCN infrastructure, guiding the design nearly 900 DCNs across 18 distinct types, including the latest 100K-GPU supercluster for AI training. We share our experience in building and operating Matryoshka, highlighting how it empowers the rapid design and evolution of AI clusters nowadays.
R-TCP: A Framework to Optimize TCP Performance Over Rate-Limiting Networks
Shengtong Zhu, The Chinese University of Hong Kong; Yan Liu and Lingfeng Guo, Independent Researcher; Jack Yiu Bun Lee, The Chinese University of Hong Kong
Many mobile operators provide subscription plans that include a data quota for full-speed access, beyond which the service will be throttled to a low data rate — rate-limited service. This is designed to control costs and to motivate users to upgrade. Our recent measurements in a country-scale service suggested that the proportion of TCP flows subjected to rate limiting can be as high as 28%. More importantly, TCP flows under rate limiting can exhibit excessive retransmission rates, exceeding 20% in many cases. The extra bandwidth costs incurred by the retransmissions for large service providers are very significant, not to mention bandwidth wastage. This work develops a novel R-TCP framework to mitigate the excessive retransmissions problems in various TCP designs (e.g., Cubic and BBR) under rate limiting networks. R-TCP is specifically designed and optimized for sender-side kernel implementation with minimal overheads. It has been implemented into Linux where extensive experiments in real-world networks and applications show that it can substantially reduce excessive retransmissions by up to 88% with negligible tradeoff in goodput and application-layer performance.
Learning to Tune Optical WANs: A Field Deployment of Noise Models in Optical Networks
Bhaskar Kataria and Howard Hua, Cornell University; Andrea D Amico, NEC Labs; Bill Owens, NYSERNet; Rachee Singh, Cornell University
Accurately modeling optical signal transmission is critical for optimizing network performance, particularly in large-scale fiber optic networks operated by Internet Service Providers. In this work, we develop a Gaussian Noise model for a New York state ISP's optical backbone. Our model accounts for all major network components, including amplifiers, fiber spans, reconfigurable optical add-drop multiplexers, and transceivers. By accurately predicting end-to-end signal-to-noise ratio, our model provides a foundation for network performance analysis and optimization.
Then, we leverage hyperparameter search techniques—commonly used in machine learning—to identify amplifier gain settings that improve signal quality. By treating the model as an opaque box, we systematically search for amplifier configurations that maximize the predicted end-to-end SNR while maintaining practical network constraints. We validate our approach through a field deployment by applying optimized amplifier gain settings in a live ISP network. Our results show a significant improvement in optical signal quality, achieving a 2 dB increase in SNR on a single wavelength.
FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline
Jingwei Xu, Shanghai Jiao Tong University and Huawei Technologies; Junbin Kang, Huawei Technologies; Mingkai Dong, Shanghai Jiao Tong University; Mingyu Liu, Lu Zhang, Shaohong Guo, and Ziyan Qiu, Huawei Technologies; Mingzhen You and Ziyi Tian, Shanghai Jiao Tong University; Anqi Yu, Tianhong Ding, and Xinwei Hu, Huawei Technologies; Haibo Chen, Shanghai Jiao Tong University and Huawei Technologies
Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72× throughput for small file read/write and up to 12.81× throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year and has been open-sourced.
MirrorNet: High-fidelity and Scalable Network Emulation for Software-defined WAN
Congcong Miao, Tencent; Yuejie Wang, Peking University; Jianming Wang, Xuefeng Ji, Guozhi Shan, Sirui Li, Pan Fang, and Yanke Zhang, Tencent; Jialin Li, National University of Singapore; Xianneng Zou, Tencent; Guyue Liu, Peking University
Operating a large-scale WAN reliably is becoming increasingly challenging due to the surge in traffic volumes, and the growing complexity of both software and hardware. In this paper, we introduce MirrorNet, our production-grade emulation framework designed to mirror a software-based WAN. Unlike traditional emulators and simulators that access only a partial set of network information, MirrorNet functions as a comprehensive twin of the production network, encompassing the controller, data plane, and network traffic. Our key challenge lies in striking a balance between the requirements for a fine-grained and high-fidelity emulation, scalability, and resource efficiency. To address these, we have developed a multi-faceted approach: i) we employ an incremental storage and replay method to reconstruct the historical production network at a second-by-second level; ii) we propose a network update strategy that maintains consistent alignment between the emulation and production networks; and iii) we design a custom orchestrator capable of rapidly deploying one or more large-scale emulation networks, which can operate concurrently to expedite testing. MirrorNet has been deployed in TWAN for over 2 years and integral in our daily WAN management tasks, aiding in troubleshooting, parameter tuning, testing, and capacity assessment.
CascadeNet: Generating Network Traffic with High-Fidelity Temporal Patterns
Runwei Lu, Yanran Deng, Ruixuan Li, and Jinting Liu, New York University Shanghai; Yuejie Wang, Peking University; Xinyu Li, Carnegie Mellon University; Deming Xu, New York University Shanghai; Han Tian, University of Science and Technology of China; Kai Chen, Hong Kong University of Science and Technology; Guyue Liu, Peking University
Facing the challenge of limited network trace access, the exploration of synthetic trace generation has become crucial for research. Although current methods manage to replicate the statistical characteristics of network traffic accurately, they fail to capture the temporal dynamics of network activities. This gap stems primarily from their approach to data representation. To address this issue, we propose a novel representation of network traces by aggregating network flows into time series. Built upon this data representation, we propose CascadeNet, an end-to-end framework embedded with CascadeGAN—a hierarchical generative model—to generate network traffic with high-fidelity temporal patterns while learning complex flow structures and dependencies. We also develop several techniques to facilitate the transformation from aggregated time series to timestamps. Our evaluations across four diverse IPv4 header traces show (1) CascadeNet surpasses baselines by 41%~76% on temporal distance metrics; (2) CascadeNet outperforms baselines in downstream tasks; (3) it offers remarkable scalability, reducing training time by 7.3×~25× compared to state-of-the-art method.
FENIX: Enabling In-Network DNN Inference with FPGA-Enhanced Programmable Switches
Xiangyu Gao, Tsinghua University; Tong Li, Renmin University of China; Yinchao Zhang, Tsinghua University; Ziqiang Wang, Southeast University and Tsinghua University; Xiangsheng Zeng, Huazhong University of Science and Technology; Su Yao, Tsinghua University and BNRist; Ke Xu, Tsinghua University and Zhongguancun Laboratory
Machine learning (ML) is increasingly used in network data planes for advanced traffic analysis, but existing solutions (such as FlowLens, N3IC, BoS) still struggle to simultaneously achieve low latency, high throughput, and high accuracy. To address these challenges, we present FENIX, a hybrid in-network ML system that performs feature extraction on programmable switch ASICs and deep neural network inference on FPGAs. FENIX introduces a Data Engine that leverages a probabilistic token bucket algorithm to control the sending rate of feature streams, effectively addressing the throughput gap between programmable switch ASICs and FPGAs. In addition, FENIX designs a Model Engine to enable high-accuracy deep neural network inference in the network, overcoming the difficulty of deploying complex models on resource-constrained switch chips. We implement FENIX on a programmable switch platform that integrates a Tofino ASIC and a ZU19EG FPGA directly, and evaluate it on real-world network traffic datasets. Our results show that FENIX achieves microsecond-level inference latency and multi-terabit throughput with low hardware overhead, and delivers over 90% accuracy on mainstream network traffic classification tasks, outperforming the state of the art.
Bridging Storage and Execution: A Semantic Virtual Bus for On-Demand Application Streaming
Jun Lu, Central South University; Jialin Li, National University of Singapore; Yaoxue Zhang and Ju Ren, Tsinghua University
Traditional application delivery requires full local installation, incurring persistent security risks from outdated versions, significant download delays. Despite advances in network throughput and latency, existing dynamic loading solutions such as Web applications and network filesystems like NFS suffer from performance degradation, functionality limitations, and intrusive application modifications. We introduce STREAMBUS, a transparent application streaming system that redefines the network as a semantic-aware virtual storage bus beneath the file system layer. Supporting deployment across diverse environments, including WiFi-dependent mobile devices, it addresses two key challenges: maintaining microsecond-level latency comparable to local storage and bridging the semantic gap between stateless remote storage and stateful execution. To achieve this, STREAMBUS combines a dual-mode transmission mechanism that synchronously serves requested blocks and asynchronously prefetches predicted blocks, with a thread-aware Markov-chain model that captures fine-grained access patterns. Evaluation shows STREAMBUS delivers near-native performance across diverse networks. On desktops, it achieves 15–40% better per-page access latency than local NVMe in common cases. On mobile devices, it typically sustains startup overheads below 40% relative to local storage, even over variable Wi-Fi connectivity. Robustness experiments demonstrate stable performance under emulated network conditions with realistic delay patterns, supporting intra-city deployments.
FastServe: Iteration-Level Preemptive Scheduling for Large Language Model Inference
Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin, School of Computer Science, Peking University
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-tocompletion processing for inference jobs, which suffers from head-of-line blocking and long latency.
We present FastServe, a distributed LLM serving system which exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. Queues with higher priority than the one the job joins are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. Evaluation shows that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 6.1×.
Managing Congestion Control Heterogeneity on the Internet with Approximate Performance Isolation
Ayush Mishra, ETH Zurich; Archit Bhatnagar, University of Michigan; Yixuan Zhang, Tsinghua University; Ben Leong, National University of Singapore; Ya Gao, Wuxi Institute of Technology; Raj Joshi, Harvard University
The Internet hosts a diverse mix of congestion control algorithms (CCAs) optimized for specific throughput-delay tradeoffs. However, traditional queuing disciplines and AQMs struggle to manage this heterogeneity and often lead to unfairness and suboptimal performance. In this paper, we explore isolation techniques that can allow competing CCAs to make their desired throughput-delay trade-offs independent of who they compete with. More specifically, we motivate Approximate Performance Isolation between competing flows by grouping flows with similar desired throughput-delay trade-offs in the same queue. We also present Santa, a new practical and scalable multi-queue AQM built on the principles of approximate performance isolation. Santa infers each flow’s throughput-delay preferences by comparing their buffer occupancy, and shuffles aggressive ("naughty") and passive ("nice") flows into appropriate queues over time. We prototype Santa on a programmable switch to demonstrate that it is practical, scalable, and can approximate the isolation benefits of Fair Queuing (FQ) with a handful of of queues.
Checkmate: Zero Performance Overhead Model Checkpointing via Network Gradient Replication
Ankit Bhardwaj, Tufts University; Weiyang Wang, Jeremy Carin, Adam Belay, and Manya Ghobadi, Massachusetts Institute of Technology
This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to create a checkpoint already exists in the network as gradients. Our core contribution is a new multicast abstraction that simultaneously delivers gradients to a separate CPU-based shadow cluster. The shadow maintains a checkpoint by applying those gradients to a copy of the model. Our evaluation shows that Checkmate performs per-iteration checkpointing with training throughput comparable to an ideal no-checkpoint baseline. Checkmate achieves 5 to 34.5× more frequent checkpointing compared to state-of-the-art checkpointing systems, resulting in 80% to 97.1% reduction in repeated work per failure. At the same checkpointing frequency, Checkmate delivers 1.3× to 6.5× throughput compared to other systems.
Diagnosing and Repairing Distributed Routing Configurations Using Selective Symbolic Simulation
Rulan Yang, Gao Han, Hanyang Shao, Xiaoqiang Zheng, Xing Fang, Ziyi Wang, and Lizhao You, Xiamen University; Ruiting Zhou, Southeast University; Linghe Kong, Shanghai Jiao Tong University; Ennan Zhai, Alibaba Cloud; Qiao Xiang and Jiwu Shu, Xiamen University
Although substantial progress has been made in automatically verifying whether distributed routing configurations comply with certain intents, diagnosing and repairing configuration errors remains manual and time-consuming. To fill this gap, we propose S2Sim, a novel system for automatic routing configuration diagnosis and repair. Our key insight is that by deriving a set of contracts that guarantees an intent-compliant variant of the erroneous configuration, we can systematically check for all contract violations in the configuration via symbolic simulation to pinpoint and repair the errors. S2Sim also introduces a series of extensions to support complex configurations (e.g., ACL, route aggregation and multi-path routing), networks (e.g., underlay and overlay networks), and intents (e.g., k-link failure tolerance). We fully implement S2Sim and evaluate its performance using real configurations from two major providers and synthesized configurations composed from their real errors and real-world topologies with different scales O(10) to O(1000). Results show that S2Sim accurately and efficiently diagnoses and repairs real configuration errors (i.e., up to 20 seconds in real networks of O(100) nodes and up to 15 minutes in synthesized networks of O(1000) nodes).
QCON: Seamless QoE-Aware 5G Streaming via Multi-Connectivity
Goodsol Lee, Seoul National University; Junhong Min, University of Colorado Boulder; Seyeon Kim, Korea University; Juheon Yi, Seoul National University; Kwang Taik Kim and Mung Chiang, Purdue University; Sangtae Ha, University of Colorado Boulder; Kyunghan Lee and Saewoong Bahk, Seoul National University
Mobile real-time video streaming (RTS) applications—cloud gaming and AR/VR—require consistent high throughput and low latency to satisfy user Quality of Experience (QoE), yet today’s wireless links fluctuate wildly. While multi-path solutions seem promising to tackle such single-link fluctuations, existing transport-level solutions require multiple cellular subscriptions, which most users don’t have. In this paper, we leverage 5G multi-connectivity, which allows simultaneous connection to multiple base stations (e.g., 5G and 4G) and is already deployed in commercial networks. However, our measurements show RTS applications still suffer from single-link fluctuations due to operators’ deliberate policies restricting multi-connectivity to conserve 4G backup links regardless of application demands. To optimize application QoE while respecting operator policies, we present QCON, a QoE-driven multi-connectivity solution that efficiently utilizes backup links based on precise application QoE. For practical deployment, we design QoE Monitor to infer application QoE within the RAN and develop multi-link scheduling to optimize both QoE and radio resource efficiency. We also design priority-based re-injection utilizing RAN link recovery mechanism to prevent video stalls. Our prototype implementation of QCON on a RAN intelligent controller within an Open-RAN testbed demonstrates 2.1× improvements of bitrates, enhancing tail frame rates by 4-5× with efficient backup link use compared to existing multi-link scheduling schemes.
Queue-Mem: Energy-Efficient Hardware Storage for Advanced Network Function Acceleration
Mariano Scazzariello, RISE Research Institutes of Sweden; Tommaso Caiazzi, Roma Tre University; Hamid Ghasemirahni, Dejan Kostić, and Marco Chiesa, KTH Royal Institute of Technology
General-purpose CPU servers have been widely used to deploy Network Functions (NFs) thanks to their high flexibility and simplicity of deployment. Due to their high energy consumption, best practices advocate for only processing packet headers on CPU cores while temporary storing the corresponding packet payloads on either network interface cards or external RDMA-enabled memory.
We show that the seemingly minor decision of where to store a packet payload greatly impacts overall energy consumption in state-of-the-art NF systems operating at terabit-per-second speeds. In fact, we show that if one could ideally store packet payloads on today's hardware switches, while processing headers externally, one could reduce energy use by 1.8× to 10.9× compared to current practices.
In this paper, we introduce Queue-Mem, a general-purpose, energy-efficient storage solution to enhance NF deployment that is amenable for implementation with various existing hardware switches. Building Queue-Mem involves addressing significant challenges associated with payload storage, as hardware switches lack such functionality. By carefully exploiting the buffer queues of existing switches, we are the first ones to build and showcase a robust, energy-efficient packet processing pipeline capable of handling terabit-per-second speeds and supporting advanced per-flow network functions, all while using just a single commodity server connected to an ASIC switch.
Decoding RSSI Compression in RFID: Dynamic RCS Modeling and Tag-Intrinsic Power Metrics for Reliable Backscatter Networks
Jia Liu, Yifei Ma, Xingyu Chen, and Haipeng Dai, Nanjing University; He Huang, Soochow University; Zihao Lin and Wei Zheng, Nanjing University; Junzhao Du, Xidian University; Guihai Chen, Nanjing University
Radio Frequency Identification (RFID) is a foundational element of modern IoT and backscatter networks, powering inventory, localization, and battery-free sensing at scale. In this paper, we uncover RSSI compression, a power-dependent bias in reader-measured RSSI, as a critical physical-layer problem that propagates upward in the network stack, degrading MAC-layer collision resolution, network-layer link estimation, and application-layer reliability. Through carefully designed experiments, we trace this distortion to dynamic tag Radar Cross Section (RCS) behavior and introduce two novel physical-layer metrics: Interrogation Threshold Power (ITP), a channel-specific metric for accurate link-quality estimation, and Backscatter Power Index (BPI), a tag-intrinsic, environment-agnostic signature. These metrics provide high-fidelity signal information that higher layers can directly exploit for more robust collision detection, power control, localization and sensing tasks. Finally, an in-situ single-query method further reduces measurement overhead by 99.8%, while cutting channel-estimation error by 64.7%, delivering significant cross-layer performance gains in real-world backscatter networks.
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
Gabriele Oliaro, Carnegie Mellon University; Xupeng Miao, Purdue University; Xinhao Cheng, Carnegie Mellon University; Vineeth Kada, Anthropic PBC; Mengdi Wu, Ruohan Gao, and Yingyi Huang, Carnegie Mellon University; Remi Delacourt, Mistral AI; April Yang, Carnegie Mellon University; Yingcheng Wang, Purdue University; Colin Unger, Stanford University; Zhihao Jia, Carnegie Mellon University and Amazon Web Services
Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters—wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations—dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by 1.9-4.8× under heavy inference workloads and 2.5-6.8× under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.
Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management
Jiawei Tyler Gu, Zhen Tang, Yiming Su, Bogdan Alexandru Stoica, Xudong Sun, and William X. Zheng, University of Illinois Urbana-Champaign; Yue Zhang and Akond Rahman, Auburn University; Chen Wang, IBM Research; Tianyin Xu, University of Illinois Urbana-Champaign
Modern cloud applications are increasingly managed by software programs, often named “operators,” which automate laborious, human-based operations. While operator programs largely prevent human mistakes, their own reliability has unprecedented impact on managed applications. This paper discusses the emerging challenges of operator program reliability on cloud-native platforms like Kubernetes. Our work is grounded in a rigorous analysis of 412 real-world failures of thirteen Kubernetes operators. We find that challenges of operator reliability come from the multifold complexity of an operator’s interactions with its managed applications, environment, and user interface. Among these, operators’ interactions with managed applications are the largest contributor to real-world operator failures, but they are largely overlooked—these interactions are often ad hoc and lack well-defined interfaces. We advocate to rethink the management interface of cloud applications and demonstrate this urgent need by showing the prevalence of defects in existing operators. Specifically, we develop a simple testing tool to exercise interactions between operators and the managed cloud applications, which discovered 86 new bugs in six popular Kubernetes operators.
Feedback-guided Adaptive Testing of Distributed Systems Designs
Ao Li, Carnegie Mellon University; Ankush Desai, Amazon Web Services; Rohan Padhye, Carnegie Mellon University
Validating distributed systems for correctness poses significant challenges. Practitioners often rely on formal models of core system designs, which are then tested by exploring possible component interactions. Unfortunately, standard testing approaches based on random sampling of the state space are inefficient and prone to missing subtle bugs, as they lack guidance from the system's behavior.
To address this, we present Fest, a new testing system for formal models of distributed systems. Fest incorporates feedback-guided adaptive schedule generation, drawing inspiration from grey-box fuzzing, to steer exploration towards maximizing behavioral coverage and uncovering bugs more effectively. Our implementation in the P programming framework demonstrates significant improvements across 94 distributed system model configurations: up to 41× (1.5× average) improvement in behavioral coverage, 278× (15× average) improvement in scenario coverage, and 33% more bugs detected compared to existing methods. These results highlight Fest's effectiveness in ensuring the robustness of distributed systems through improved testing efficiency.
Observability Is Eating Your Cores: Fine-Grained Analysis of Microservice Metrics with IPU-Hosted Sketches
Alessandro Cornacchia, King Abdullah University of Science and Technology; Theophilus A. Benson, Carnegie Mellon University; Muhammad Bilal and Marco Canini, King Abdullah University of Science and Technology
Observability has become mission-critical for troubleshooting cloud-native technology. However, today's observability fails to meet the demands of cloud-native environments, either resulting in crippling complexity and high costs for collecting and storing huge data volumes, or sacrificing events coverage by sampling at coarse time granularity. We present μView, which stands out from conventional cloud monitors by incorporating a lightweight observability data-plane on Infrastructure Processing Units (IPUs). Our novel architecture leverages the proximity of IPUs to the monitored services to tackle observability bloat. Crucially, μView's data-plane applies streaming data sketching techniques to continuously process and analyze microservice's metrics at fine time resolution, without hurting application performance. We show for several use cases that by anticipating SLO violations μView can help (i) narrow the focus on informative observability data, and (ii) trigger useful signals about service performance, thus enabling timely proactive actions. Our code and artifacts are available at: https://github.com/sands-lab/uview.
Attack of the Bubbles: Straggler-Resilient Pipeline Parallelism for Large Model Training
Tianyuan Wu, Lunxi Cao, Hanfeng Lu, and Xiaoxiao Jiang, Hong Kong University of Science and Technology; Yinghao Yu, Siran Yang, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang, Alibaba Group; Wei Wang, Hong Kong University of Science and Technology
Awarded Outstanding Paper!
Training large Deep Neural Network (DNN) models at scale often encounters straggler issues, mostly in communications due to network congestion, RNIC/switch defects, or topological asymmetry. Under advanced pipeline parallelism, even minor communication delays can induce significant training slowdowns. This occurs because (1) slow communication disrupts the pipeline schedule, creating cascading “bubbles” in a domino effect, and (2) current GPU kernel scheduling is susceptible to head-of-line blocking, where slow communication blocks subsequent computations, further adding to these bubbles. To address these challenges, we present PIPEMORPH, a straggler-resilient training system with two key optimizations. First, it optimally adapts the pipeline schedule in the presence of stragglers to absorb communication delays without inducing cascading bubbles, using a simple yet effective algorithm guided by an analytical model. Second, upon detecting slow communication, PIPEMORPH offloads communication operations from GPU to host memory and utilizes CPU-side RDMA for data transfer. This eliminates head-of-line blocking as subsequent computation kernels can be scheduled immediately on GPUs. Together, these optimizations effectively reduce pipeline stalls in the presence of communication stragglers, improving the training iteration time by 1.2-3.5× in our experiments under various settings.
Law: Towards Consistent Low Latency in 802.11 Home Networks
Yibin Shen and Zili Meng, Hong Kong University of Science and Technology
Wireless ultra-low-latency video streaming over 802.11 WiFi networks is increasingly popular, but the latency on the Wi-Fi link is always fluctuating. With the development of CDNs and edge servers, the fluctuation of the wireless lasthop is increasingly dominating the fluctuation of the end-toend latency. In this paper, we investigate the reasons why the existing Wi-Fi link layer will have a fluctuating latency from a systematic perspective. We find that the hierarchical queueing structure, queue-agnostic rate adaptation, and delay-insensitive retry management of the existing link layer design are the main reasons to a latency spike when channel fluctuates. Thus, we propose LAtency-bounded Wi-Fi (Law), an 802.11 link layer architecture to provide a consistent low latency for the application. Law exploits the loss tolerance ability from the upper layer video streaming application and significantly avoids the latency spikes caused by the blockage in the link layer at the cost of acceptably additional packet loss. Law maintains a high goodput by carefully redesigning the queueing structure and introducing fine-grained control for each transmission opportunity. We implement the prototype of Law on OpenWiFi and test it with WebRTC – both the tail frame latency and stall rate can be significantly reduced over existing baselines.
eXpressSFU: Toward Super-Scalable Video Conferencing with SmartNICs
Tuan Tran and S. M. H. Hosseini, University of Colorado Boulder; Seyeon Kim, Korea University; Kyunghan Lee, Seoul National University; Nam Bui, University of Colorado Denver; Dirk Grunwald and Sangtae Ha, University of Colorado Boulder
Video conferencing has emerged as a critical Internet application. Unlike video-on-demand services, high-quality video conferencing necessitates minimal latency, as media streams are generated, transmitted, processed, forwarded and received in real time. Our empirical analysis reveals that processing latency at the media server, particularly Selective Forwarding Units (SFUs), is the dominant barrier to scalability. Notably, cryptography, memory, and I/O operations account for approximately 79% of the media packet processing latency.
In this paper, we introduce eXpressSFU, a re-architected video conferencing system designed to significantly enhance scalability for large-scale and high-quality video conferences. By decoupling the fast-control and data planes from the slow-control plane, eXpressSFU accelerates the media packet processing pipeline through the use of emerging network technologies, such as Smart Network Interface Cards (SmartNICs). Experimental results show that our system reduces packet processing latency by a factor of 8. This improvement allows it to support 3× more concurrent users while cutting computational power consumption by up to 60%.
AVA: Towards Agentic Video Analytics with Vision Language Models
Yuxuan Yan, Zhejiang University; Shiqi Jiang, Microsoft Research; Ting Cao, Tsinghua University; Yifan Yang, Microsoft Research; Qianqian Yang and Yuanchao Shu, Zhejiang University; Yuqing Yang and Lili Qiu, Microsoft Research
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.
The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark could be accessed at https://huggingface.co/datasets/iesc/Ava-100.
SLATE: Service Layer Traffic Engineering
Gangmuk Lim, University of Illinois Urbana–Champaign; Aditya Prerepa, University of Illinois Urbana–Champaign and xAI; Brighten Godfrey and Radhika Mittal, University of Illinois Urbana–Champaign
In microservice-based applications, requests flow between many microservice instances across potentially multiple geodistributed clusters. Today, the routing of requests is limited to simple load balancing, or extensions that spill requests to nearby clusters. We argue that the problem is more subtle, and that there are significant opportunities for improvement by viewing microservice request routing as a global traffic engineering problem. We present Service Layer Traffic Engineering (SLATE), a system that optimizes request routing in microservice deployments that span multiple clusters to minimize average latency and bandwidth cost. SLATE tackles challenges unique to the service layer, including multiple request traffic classes, multi-hop call trees, and service latency profiles. To achieve this, SLATE takes a unique hybrid approach combining global optimization and local exploration. SLATE outperforms state-of-the-art global load balancing by up to 18.3× in average latency and reduces egress bandwidth cost by up to 2.64× in a Kubernetes deployment of an open-source benchmark application, and shows resilient performance against dynamic changes due to its hybrid optimization approach. Our system is completely transparent to the application and can be seamlessly plugged into existing L7 proxy deployments, specifically Envoy.
Predict, Prune, Play: Efficient Video Playback Optimization Under Device Diversity and Drift
Harsha Sharma, Massachusetts Institute of Technology and Amazon; Pouya Hamadanian and Arash Nasr-Esfahany, Massachusetts Institute of Technology; Zahaib Akhtar, Amazon and North Carolina State University; Mohammad Alizadeh, Massachusetts Institute of Technology
Video-streaming platforms tune dozens of playback parameters across thousands of client devices. Our measurements from Amazon Prime Video show that device-specific tuning can enhance stream quality. Yet traditional tuning techniques like Bayesian optimization become prohibitively expensive due to the large configuration space and the constant emergence of new device types.
We introduce AZEEM, a scalable recommendation system leveraging few-shot prediction to rapidly identify promising configurations for new devices. The key insight behind AZEEM is that devices exhibit performance similarities that enable predictions from limited observations. Trained on offline data of device-playback configuration interactions, AZEEM efficiently narrows down the search space to a small set of configurations likely to contain optimal or near-optimal candidates. Additionally, AZEEM addresses temporal distribution shift—where the best-performing configurations change over time—by recommending a small, robust set of candidates rather than a single configuration. Evaluations using large-scale real-world datasets show that AZEEM reduces exploration cost by 5.8−13.6× and improves stream quality compared to state-of-the-art Bayesian optimization and multi-armed bandit approaches, enabling effective device-specific optimization at scale. We deploy AZEEM on a subset of Amazon Prime Video’s production traffic, where it achieved a relative QoE improvement of 2.7% on average and 10.6% at the 90th percentile over an existing treatment tuning system.
OneSidedMW: Managing Disaggregated Memory Efficiently, Flexibly, and Securely with RNIC Offloading
Zixuan Wang, Jinyu Gu, Xingda Wei, and Yubin Xia, Shanghai Jiao Tong University
RDMA-based memory disaggregation is gaining popularity in modern datacenters to improve memory efficiency. However, existing memory management approaches for the disaggregated memory (DM) architecture face a critical tradeoff: they either suffer from poor memory utilization due to coarse-grained allocation, or encounter significant challenges in terms of performance, memory node CPU overhead, security vulnerabilities, and limited flexibility.
In this paper, we present OneSidedMW, a novel system that combines two advanced RDMA features—RDMA NIC (RNIC) offloading and memory windows—to provide fine-grained and highly efficient one-sided memory management primitives for DM. Specifically, it leverages RNIC offloading to perform MW binding and unbinding operations, achieving remote memory allocation and deallocation without involving the memory node’s CPU. To demonstrate the efficiency of OneSidedMW, we evaluate it over two representative DM systems: swap-based systems and disaggregated key-value stores. OneSidedMW achieves up to 10.6× better performance in disaggregated key-value stores and up to 32.3% performance improvement in swap-based systems, compared with the state-of-the-art approaches.
Remembrall: Leaning into Memory for Accurate Video Analytics on System-on-Chip GPUs
Murali Ramanujam, Yinwei Dai, Kyle Jamieson, and Ravi Netravali, Princeton University
Continually retraining models has emerged as a primary technique to enable high-accuracy video analytics on edge devices. Yet, existing systems employ such adaptation by relying on the spare compute resources that traditional (memory-constrained) edge servers afford. In contrast, mobile edge devices such as drones and dashcams offer a fundamentally different resource profile: weak(er) compute with abundant unified memory pools. We present Remembrall, a continuous learning system for the mobile edge's System-on-Chip GPUs. Our driving insight is that visually distinct scenes that require retraining exhibit substantial overlap in model embeddings; if captured into a base model on device memory, specializing to each new scene can become lightweight, requiring very few samples. To practically realize this approach, Remembrall presents new, compute-efficient techniques to (1) select high-utility data samples for retraining specialized models, (2) update the base model without complete retraining, and (3) time-share compute resources between retraining and live inference for maximal accuracy. Across diverse workloads, Remembrall lowers retraining costs by 2.8-10× compared to existing systems, resulting in 18-45% higher accuracies.
A Fast Solver-Free Algorithm for Traffic Engineering in Large-Scale Data Center Network
Yingming Mao, Xi'an Jiaotong University and Shanghai Innovation Institute; Qiaozhu Zhai, Xi'an Jiaotong University; Ximeng Liu, Shanghai Jiao Tong University; Zhen Yao and Xia Zhu, Huawei; Yuzhou Zhou, Xi'an Jiaotong University
Rapid growth of data center networks (DCNs) poses significant challenges for large-scale traffic engineering (TE). Existing acceleration strategies, which rely on commercial solvers or deep learning, face scalability issues and struggle with degrading performance or long computational time.
Unlike existing algorithms adopting parallel strategies, we propose Sequential Source-Destination Optimization (SSDO), a sequential solver-free algorithm for intra-DCN TE. SSDO decomposes the problem into subproblems, each focused on adjusting the split ratios for a specific source-destination (SD) demand while keeping others fixed. To enhance the efficiency of subproblem optimization, we design a Balanced Binary Search Method (BBSM), which identifies the most balanced split ratios among multiple solutions that minimize Maximum Link Utilization (MLU). SSDO dynamically updates the sequence of SDs based on real-time utilization, which accelerates convergence and enhances solution quality.
We evaluate SSDO primarily on Meta DCNs, and additionally on two WAN topologies as auxiliary demonstrations of generality. In a Meta topology, SSDO achieves a 65% and 60% reduction in normalized MLU compared to TEAL and POP, two state-of-the-art TE acceleration methods, while delivering a 12× speedup over POP. These results demonstrate the superior performance of SSDO in large-scale TE.
Net-P4ct: Enhanced WAN Bandwidth Fair Sharing Using P4 Programmable Switches
Haoran Chen and Mingwei Cui, Bytedance; Yihan Zou, Yihang Miao, Suhan Jiang, Damu Ding, Lirong Lai, Ming Gao, Rui Jiang, Shengyuan He, Anjian Chen, Jiaming Shi, Junjie Wan, Yandong Duan, Ruomin Fang, Hongyu Wu, and Yongping Tang, ByteDance; Qiao Kang, unaffiliated; Guangrui Wu and Xiyun Xu, ByteDance
At growing internet companies like ByteDance, Wide Area Network (WAN) bandwidth sharing across diverse services with varying SLO requirements is a fundamental challenge. Conventional host-based enforcement systems, where agents identify and throttle traffic at the server end, face practical challenges such as "blind spot" traffic, kernel-dependent operational complexity, and significant server resource overhead. To address these issues, we present Net-P4ct, an in-network bandwidth enforcement system using P4 programmable switches. Net-P4ct improves both bandwidth guarantees and fair sharing by shifting dynamic QoS control into the switch data plane. Specifically, it achieves broader traffic coverage by combining host-side traffic tagging with a P4-switch pipeline, where service classification and QoS class assignment are performed. Based on observed traffic metrics, a centralized control plane determines real-time policy updates according to the max-min fair bandwidth allocation. We demonstrate the system's benefits including improved bandwidth utilization, reduced operational complexity, and lower per-byte processing cost. Net-P4ct has been deployed in ByteDance's production WAN for nearly a year, and we hope to share our experience with the community.
SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling
Athinagoras Skiadopoulos, Stanford University; Mark Zhao, University of Colorado Boulder; Swapnil Gandhi, Stanford University and NVIDIA; Thomas Norrie, OpenAI; Shrijeet Mukherjee, NVIDIA; Christos Kozyrakis, Stanford University and NVIDIA
Mixture-of-Experts (MoE) models have become a widely-adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts—sparsely-activated feed-forward networks—within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads.
To break this performance-accuracy tradeoff, we introduce SYMI, an adaptive MoE training system. The key insight of SYMI is to decouple the placement of expert parameters from their large optimizer state. SYMI statically partitions the optimizer of each expert across all training nodes. Meanwhile, SYMI dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SYMI right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overhead. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SYMI is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively.
BBC: Enabling BLE to Support Bluetooth Classic
Hsun-Wei Cho and Kang G. Shin, University of Michigan
Bluetooth Classic has been the technology used by the overwhelming majority of wireless headphones. However, Bluetooth Classic is incompatible with Bluetooth Low Energy (BLE), and hence cannot directly communicate with BLE devices. With the recent shift toward BLE, this incompatibility prevents using simple, energy-efficient BLE chips with Bluetooth headphones, and requires using more complex dual-mode chips to support both Bluetooth Classic and BLE.
To overcome this incompatibility, we present BBC, which enables Bluetooth-Classic connectivity on BLE chips. BBC sends and receives raw FSK bits using BLE hardware while emulating all other Bluetooth-Classic operations in the driver. By eliminating the need for Bluetooth-Classic hardware, BBC enables future devices to use BLE-only chips while maintaining the Bluetooth-Classic compatibility via emulation. It also enables new connectivity for current BLE devices to directly stream audio to Bluetooth-Classic headphones. BBC achieves a throughput of 557kbps and a packet error rate (PER) of 4.86% at the distance of 10m, and provides the same audio quality as off-the-shelf Bluetooth-Classic chips.
Fractal: Fault-Tolerant Shell-Script Distribution
Zhicheng Huang, Ramiz Dundar, and Yizheng Xie, Brown University; Konstantinos Kallas, University of California, Los Angeles; Nikos Vasilakis, Brown University
This paper presents FRACTAL, a new system that offers fault tolerant distributed shell script execution for unmodified scripts. FRACTAL first distinguishes recoverable regions from side-effectful ones, and augments them with additional runtime support aimed at fault recovery. It employs precise dependency and progress tracking at the subgraph level to offer sound and efficient fault recovery. It minimizes the number of upstream regions that are re-executed during recovery and ensures exactly-once semantics upon recovery for downstream regions. Evaluation on 4- and 30-node clusters indicates average fault-free speedups of (1) >9.6x over Bash, a single-node shell-interpreter baseline, (2) >5.5x over Hadoop Streaming, a MapReduce system that supports language-agnostic third-party components, and (3) 17% over DiSh, a state-of-the-art fault-intolerant shell-script distribution system—all while recovering 7.8–16.4x faster than Hadoop Streaming in cases of faults.
HyperEdge: An Edge CDN Infrastructure for Cost Efficient Video Streaming
Dehui Wei, National University of Singapore; Jiao Zhang, Beijing University of Posts and Telecommunications, and Purple Mountain Laboratories; Haozhe Li, Rui Han, Zhichen Xue, Yajie Peng, Xiaofei Pang, and Yan Ma, ByteDance; Jialin Li, National University of Singapore
Awarded Outstanding Paper!
As ByteDance’s business expands, the cost of video streaming using content delivery networks (CDN) has become prohibitively high. We have discovered a sea of under-utilized edge devices with the potential to reduce content distribution cost. The unreliable performance of an edge network, however, presents deep challenges to video streaming services. In this work, we introduce HyperEdge, an edge-assisted content delivery system for video streaming. HyperEdge seamlessly integrates the robustness of a conventional CDN with the cost-efficiency of an edge network. It offers dependable streaming quality to users while minimizing traffic expenses. HyperEdge employs a centralized tracker cluster to optimize content distribution to a pool of edge devices, based on real-time monitoring. To ensure satisfactory video playback quality, we develop a novel multi-path protocol for client-edge video transmission. Having been in stable operation for six years, HyperEdge manages over a hundred thousand edge devices, serving about a hundred million users daily, and saving hundreds of millions of dollars in content delivery cost annually.
Over-Threshold Multiparty Private Set Intersection for Collaborative Network Intrusion Detection
Onur Eren Arpaci, Raouf Boutaba, and Florian Kerschbaum, University of Waterloo
An important function of collaborative network intrusion detection is to analyze the network logs of the collaborators for joint IP addresses. However, sharing IP addresses in plain is sensitive and may be even subject to privacy legislation as it is personally identifiable information. In this paper, we present the privacy-preserving collection of IP addresses. We propose a single collector, over-threshold private set intersection protocol. In this protocol N participants identify the IP addresses that appear in at least t participant's sets without revealing any information about other IP addresses. Using a novel hashing scheme, we reduce the computational complexity of the previous state-of-the-art solution from O(M(N logM/t)2t) to O(t2M(binomNt)), where M denotes the dataset size. This reduction makes it practically feasible to apply our protocol to real network logs. We test our protocol using joint networks logs of multiple institutions. Additionally, we present two deployment options: a collusion-safe deployment, which provides stronger security guarantees at the cost of increased communication overhead, and a non-interactive deployment, which assumes a non-colluding collector but offers significantly lower communication costs and applicable to many use cases of collaborative network intrusion detection similar to ours.
Syntra: Synthesizing Cross-Layer Controllers for Low-Latency Video Streaming
Jia Pan, The University of Texas at Austin; Anup Agarwal, Carnegie Mellon University; Işıl Dillig and Venkat Arun, The University of Texas at Austin
Modern applications such as low-latency video streaming demand tight coordination across multiple control dimensions, including bitrate selection, congestion control, frame skipping, and forward error correction (FEC). These dimensions interact in complex ways, making existing heuristic approaches difficult to design, tune, and generalize. This paper presents Syntra, an automated tool that synthesizes joint controllers from a symbolic model of the system and a declarative performance objective. Syntra formulates control as a partially observable game, performs bounded-horizon minimax search (similar to Chess engines) to synthesize strategies, and distills them into an efficient, interpretable policy via imitation learning. Synthesized controllers incorporate novel strategies that exploit the synergy between control dimensions to consistently outperform existing designs in our evaluation.
Wallet: Confidential Serverless Computing
Patrick Sabanic, Masanori Misono, Teofil Bodea, Julian Pritzi, Michael Hackl, Dimitrios Stavrakakis, and Pramod Bhatotia, Technical University of Munich
Although serverless computing offers compelling cost and deployment simplicity advantages, a significant challenge remains in securely managing sensitive data as it flows through the network of ephemeral function executions in serverless computing environments within untrusted clouds. While Confidential Virtual Machines (CVMs) offer a promising secure execution environment, their integration with serverless architectures currently faces fundamental limitations in key areas: security, performance, and resource efficiency.
We present WALLET, a lightweight confidential computing system for secure serverless deployments. By employing nested confidential execution and a decoupled guest OS within CVMs, WALLET runs each function in a minimal "trustlet", significantly improving security through a reduced Trusted Computing Base (TCB). Furthermore, by leveraging a data-centric I/O architecture built upon a lightweight LibOS, WALLET optimizes network communication to address performance and resource efficiency challenges.
Our evaluation shows that compared to CVM-based deployments, WALLET has a 4.3× smaller TCB, improves end-to-end latency (15–93%), achieves higher function density (up to 907×), and reduces inter-function communication (up to 27×) and function chaining latency (16.7-30.2×); thus, WALLET offers a practical system design for confidential serverless computing.
DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants
Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, and Junchen Jiang, University of Chicago; Shan Lu, Madan Musuvathi, and Esha Choukse, Microsoft
Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question.
We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1× faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.
Controlling Arbitrary Internet Queues with Titrate
Anchengcheng Zhou and Joshua Lau, Princeton University; P. Brighten Godfrey, University of Illinois Urbana–Champaign and Broadcom; Maria Apostolaki, Princeton University
Router buffers are critical to networks for absorbing short-lived congestion and allowing full throughput. However, excessive buffering can lead to high queuing delay, poor burst absorption, and even low throughput for queues sharing the buffer memory. Existing queue management schemes designed for Internet routers (e.g., CoDel, PIE) prevent such excessive buffering only under stringent assumptions about the queue composition (flows in the queue), while more recent approaches (e.g., L4S) require end-host collaboration. In this work, we revisit queue management for Internet routers from first principles and introduce Titrate, a closed-loop controller that senses queue dynamics and adjusts thresholds for any given queue to achieve high throughput, low latency and effective burst absorption. To balance convergence speed and stability, Titrate draws inspiration from TCP’s control loop, combining a multiplicative-increase-additive-decrease approach with an ssthresh-like variable.
We evaluate Titrate’s performance via simulation and Internet experiments. Across a wide range of realistic traffic mixes, Titrate increases minimum throughput by 39%, 14% compared to CoDel, PIE, while keeping 59% lower queuing latency compared to static-threshold baselines of on-par throughput. It also improves end-user quality of experience over static-threshold baselines. We further show that Titrate reacts swiftly to bandwidth and traffic changes and offers device-wide benefits.
Slowpoke: End-to-end Throughput Optimization Modeling for Microservice Applications
Yizheng Xie, Di Jin, and Oğuzhan Çölkesen, Brown University; Vasiliki Kalavri and John Liagouris, Boston University; Nikos Vasilakis, Brown University
Slowpoke is a new system to accurately quantify the effects of hypothetical optimizations on end-to-end throughput for microservice applications, without relying on tracing or a priori knowledge of the call graph. Microservice operators can use Slowpoke to ask what-if performance analysis questions of the form "What throughput could my retail application sustain if I optimized the shopping cart service from 10K req/s to 20K req/s?". Given a target service and its hypothetical optimization, Slowpoke employs a performance model that determines how to selectively slow down non-target services to preserve the relative effect of the optimization. It then performs profiling experiments to predict the end-to-end throughput, as if the optimization had been implemented. Applied to four real-world microservice applications, Slowpoke accurately quantifies optimization effects with a root mean squared error of only 2.07%. It is also effective in more complex scenarios, e.g., predicting throughput after scaling optimizations or when bottlenecks arise from mutex contention. Evaluated in large-scale deployments of 45 nodes and 108 synthetic benchmarks, Slowpoke further demonstrates its scalability and coverage of a wide range of microservice characteristics.
Co-Designing Traffic Control with NVMe-oF for Disaggregated Storage: A Comparative Study of Switched and Switchless SAN Architectures
Chendong Wang, Joontaek Oh, and Ming Liu, University of Wisconsin–Madison
Disaggregated storage is a pivotal component of today’s cluster infrastructures. With the advent of high-bandwidth server interconnects and new NVMe form factors, commodity storage appliances are becoming denser, delivering tens of millions of IOPS. This calls for today’s storage area network (SAN) fabric to expand the bandwidth capacity drastically. Industry practices tackle this issue via either (i) a scale-up approach, upgrading the per-port bandwidth in a switched SAN, or (ii) a scale-out strategy, integrating more paths in a switchless SAN. However, it is unclear which network architecture is more suitable for scaling storage disaggregation.
This paper presents a comparative study of switched and switchless SAN architectures from several angles. We begin by developing an experimental methodology that integrates both small-scale real-system prototypes and large-scale simulations, providing the flexibility needed to explore architectural trade-offs. We then characterize NVMe-oF I/O flows and co-design SAN traffic control mechanisms around these characteristics to improve I/O transmission efficiency in both settings. Our evaluation yields several key findings. First, the switchless SAN achieves throughput comparable to that of the switched SAN, despite involving additional routing hops, while simultaneously reducing latency through the use of multiple load-aware I/O paths that mitigate interference. Second, the switchless SAN reduces capital costs by obviating the need for expensive high-radix switches, scales effectively under heterogeneous I/O workloads, and avoids the single point of failure associated with top-of-rack (ToR) switches. Collectively, these results demonstrate that switchless SANs provide a compelling alternative to traditional switched designs for disaggregated storage environments.
Agentix: An Efficient Serving Engine for LLM Agents as General Programs
Michael Luo, University of California, Berkeley, and Google DeepMind; Xiaoxiang Shi, Shanghai Jiao Tong University; Colin Cai, Tianjun Zhang, Justin Wong, and Yichuan Wang, University of California, Berkeley; Chi Wang, Yanping Huang, and Zhifeng Chen, Google DeepMind; Joseph E. Gonzalez and Ion Stoica, University of California, Berkeley
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program.
To address this, we introduce Agentix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Agentix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms—for single-threaded and distributed programs—that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Agentix improves throughput of programs by 4-15× at the same latency compared to state-of-the-art systems, such as vLLM.
ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression
Zirui Wang, Tingfeng Lan, and Zhaoyuan Su, University of Virginia; Juncheng Yang, Harvard University; Yue Cheng, University of Virginia
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques—such as deduplication and compression—are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness.
Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead. Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses the XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches.
Harvesting Spare CPU Resources in Container Systems
Adam Hall and Anirudh Sarma, Georgia Institute of Technology; Esha Choukse, Microsoft Azure Research; Umakishore Ramachandran, Georgia Institute of Technology; Sameh Elnikety, Microsoft Research
Platforms like Kubernetes are widely adopted for deploying latency-sensitive cloud services in containers, and CPU resources for these containers are over-provisioned to ensure low 99th percentile tail latency under peak load. At the same time, cloud services exhibit bursty traffic patterns resulting in CPU usage variability that creates opportunity to harvest ephemerally unused CPU cores to run latency-tolerant containers. However, existing resource controls do not allow latency-sensitive containers to share unused cores without compromising their low tail latency objectives. Prior research on performance isolation is inadequate for container systems because it requires modifying applications and system software, employs offline profiling, and does not account for interference from processing container networking interrupts. We present HarvestContainers, a system that protects latency-sensitive containers from all sources of interference while harvesting their spare CPU cores to run latency-tolerant containers. Our solution dynamically determines the safe number of CPU cores to harvest and does not require rewriting applications or OS. We implement HarvestContainers integrated with Kubernetes and evaluate it experimentally. Our evaluation shows that latency-sensitive containers with microsecond-scale service level objectives can share up to 75% of their unused CPU cores while maintaining tail latency within 4% of standalone operation.