{ExChain}: Exception Dependency Analysis for Root Cause Diagnosis

Ao Li; Shan Lu; Zhuotao Liu; Suman Nath; Michael Leighton; Rohan Padhye; Diedi Hu; Bingchuan Tian; Vyas Sekar; Maomao Ding; Amit Adhikari; Chongxi Ma; Thawan Kooburat; Zhengyu Zhang; Yunhan Xie; Ellie Wen; Chunqiang Tang; Gang Sun; Qing Ma; Tianxi Wei; Chenren Xu; Li Chen; Dennis Cai; Yanwei Xu; Ennan Zhai; Nicholas Zhang; Joon Ong; Mayur Patel; Amr Sabaa; Arjun Singh; Alex Smirnov; Manish Verma; Prerepa V Viswanadham; Biao Lyu; Amin Vahdat; Ennan Zhai; Peng Cheng; Jiao Zhang; Jiming Chen; Tao Huang; Shunmin Zhu; Dennis Cai; Leqi Zou; Shunmin Zhu; Sida Zhao; Liang Xiang; Zherui Liu; Zhe Li; Xiaoying Jia; Jianxi Ye; Xin Jin; Xin Liu

Papers and Proceedings

The full Proceedings published by USENIX for the symposium are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Full Proceedings PDFs
NSDI '24 Full Proceedings (PDF, 301 MB)
NSDI '24 Proceedings Interior (PDF, 300 MB, best for mobile devices)
NSDI '24 Errata Slip #1 (PDF)
NSDI '24 Errata Slip #2 (PDF)
NSDI '24 Errata Slip #3 (PDF)

Attendee Files

NSDI '24 Attendee List (PDF)

Tuesday, April 16

8:00 am–8:55 am

Continental Breakfast

Mezzanine

8:55 am–9:10 am

Opening Remarks and Awards

Program Co-Chairs: Laurent Vanbever, ETH Zürich; Irene Zhang, Microsoft Research

Santa Clara Ballroom

9:10 am–10:30 am

Track 1

Clouds but Faster

Session Chair: Seo Jin Park, University of Southern California

Santa Clara Ballroom

Horus: Granular In-Network Task Scheduler for Cloud Datacenters

Parham Yassini, Simon Fraser University; Khaled Diab, Hewlett Packard Labs; Saeed Zangeneh and Mohamed Hefeeda, Simon Fraser University

Available Media

Short-lived tasks are prevalent in modern interactive datacenter applications. However, designing schedulers to assign these tasks to workers distributed across the whole datacenter is challenging, because such schedulers need to make decisions at a microsecond scale, achieve high throughput, and minimize the tail response time. Current task schedulers in the literature are limited to individual racks. We present Horus, a new in-network task scheduler for short tasks that operates at the datacenter scale. Horus efficiently tracks and distributes the worker state among switches, which enables it to schedule tasks in parallel at line rate while optimizing the scheduling quality. We propose a new distributed task scheduling policy that minimizes the state and communication overheads, handles dynamic loads, and does not buffer tasks in switches. We compare Horus against the state-of-the-art in-network scheduler in a testbed with programmable switches as well as using simulations of datacenters with more than 27K hosts and thousands of switches handling diverse and dynamic workloads. Our results show that Horus efficiently scales to large datacenters, and it substantially outperforms the state-of-the-art across all performance metrics, including tail response time and throughput.

Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining

Zili Zhang, Fangyue Liu, Gang Huang, Xuanzhe Liu, and Xin Jin, School of Computer Science, Peking University

Available Media

Vector query processing powers a wide range of AI applications. While GPUs are optimized for massive vector operations, today's practice relies on CPUs to process queries for large vector datasets, due to limited GPU memory.

We present RUMMY, the first GPU-accelerated vector query processing system that achieves high performance and supports large vector datasets beyond GPU memory. The core of RUMMY is a novel reordered pipelining technique that exploits the characteristics of vector query processing to efficiently pipeline data transmission from host memory to GPU memory and query processing in GPU. Specifically, it leverages three ideas: (i) cluster-based retrofitting to eliminate redundant data transmission across queries in a batch, (ii) dynamic kernel padding with cluster balancing to maximize spatial and temporal GPU utilization for GPU computation, and (iii) query-aware reordering and grouping to optimally overlap transmission and computation. We also tailor GPU memory management for vector queries to reduce GPU memory fragmentation and cache misses. We evaluate RUMMY with a variety of billion-scale benchmarking datasets. The experimental results show that RUMMY outperforms IVF-GPU with CUDA unified memory by up to 135×. Compared to the CPU-based solution (with 64 vCPUs), RUMMY (with one NVIDIA A100 GPU) achieves up to 23.1× better performance and is up to 37.7× more cost-effective.

LoLKV: The Logless, Linearizable, RDMA-based Key-Value Storage System

Ahmed Alquraan and Sreeharsha Udayashankar, University of Waterloo; Virendra Marathe, Oracle Labs; Bernard Wong and Samer Al-Kiswany, University of Waterloo

Available Media

We present LoLKV, a novel logless replicated key-value storage system. LoLKV follows a fundamentally different approach for designing a linearizable key-value storage system compared to state-of-the-art systems. LoLKV forgoes the classical log-based design and uses lock-free approach to allow multiple threads to concurrently update objects. It presents a novel leader election and consolidation approach to handle complex failure scenarios. LoLKV’s followers are passive, reducing their overall CPU usage. Our evaluation shows that LoLKV achieves 1.7–10× higher throughput and 20–92% lower tail latency than other state-of-the-art low-latency key-value stores.

Making Kernel Bypass Practical for the Cloud with Junction

Joshua Fried and Gohar Irfan Chaudhry, MIT CSAIL; Enrique Saurez, Esha Choukse, and Íñigo Goiri, Azure Research – Systems; Sameh Elnikety, Microsoft Research; Rodrigo Fonseca, Azure Research – Systems; Adam Belay, MIT CSAIL

Available Media

Kernel bypass systems have demonstrated order of magnitude improvements in throughput and tail latency for network-intensive applications relative to traditional operating systems (OSes). To achieve such excellent performance, however, they rely on dedicated resources (e.g., spinning cores, pinned memory) and require application rewriting. This is unattractive to cloud operators because they aim to densely pack applications, and rewriting cloud software requires a massive investment of valuable developer time. For both reasons, kernel bypass, as it exists, is impractical for the cloud.

In this paper, we show these compromises are not necessary to unlock the full benefits of kernel bypass. We present Junction, the first kernel bypass system that can pack thousands of instances on a machine while providing compatibility with unmodified Linux applications. Junction achieves high density through several advanced NIC features that reduce pinned memory and the overhead of monitoring large numbers of queues. It maintains compatibility with minimal overhead through optimizations that exploit a shared address space with the application. Junction scales to 19–62× more instances than existing kernel bypass systems and can achieve similar or better performance without code changes. Furthermore, Junction delivers significant performance benefits to applications previously unsupported by kernel bypass, including those that depend on runtime systems like Go, Java, Node, and Python. In a comparison to native Linux, Junction increases throughput by 1.6–7.0× while using 1.2–3.8× less cores across seven applications.

Track 2

Scheduling the Network

Session Chair: Akshay Narayan, University of California, Berkeley

Magnolia Room

Sifter: An Inversion-Free and Large-Capacity Programmable Packet Scheduler

Peixuan Gao, Anthony Dalleggio, Jiajin Liu, and Chen Peng, New York University; Yang Xu, Fudan University; H. Jonathan Chao, New York University

Available Media

Packet schedulers play a crucial role in determining the order in which packets are served. They achieve this by assigning a rank to each packet and sorting them based on these ranks. However, when dealing with a large number of flows at high packet rates, sorting functions can become extremely complex and time-consuming. To address this issue, fast-approximating packet schedulers have been proposed, but they come with the risk of producing scheduling errors, or packet inversions, which can lead to undesirable consequences. We present Sifter, a programmable packet scheduler that offers high accuracy and large capacity while ensuring inversion-free operation. Sifter employs a unique sorting technique called “Sift Sorting” to coarsely sort packets with larger ranks into buckets, while accurately and finely sorting those with smaller ranks using a small Push-In-First-Out (PIFO) queue in parallel. The sorting process takes advantage of the “Speed-up Factor”, which is a function of the memory bandwidth to output link bandwidth ratio, to achieve Sift Sorting and ensure accurate scheduling with low resource consumption. Sifter combines the benefits of PIFO’s accuracy and FIFO-based schedulers’ large capacity, resulting in guaranteed delivery of packets in an accurate scheduling order. Our simulation results demonstrate Sifter’s efficiency in achieving inversion-free scheduling, while the FPGA-based hardware prototype validates that Sifter supports a throughput of 100Gbps without packet inversion errors.

Flow Scheduling with Imprecise Knowledge

Wenxin Li, Xin He, Yuan Liu, and Keqiu Li, Tianjin University; Kai Chen, Hong Kong University of Science and Technology and University of Science and Technology of China; Zhao Ge and Zewei Guan, Tianjin University; Heng Qi, Dalian University of Technology; Song Zhang, Tianjin University; Guyue Liu, New York University Shanghai

Available Media

Most existing data center network (DCN) flow scheduling solutions aim to minimize flow completion times (FCT). However, these solutions either require precise flow information (e.g., per-flow size), which is challenging to implement on commodity switches (e.g., pFabric), or no prior flow information at all, which is at the cost of performance (e.g., PIAS). In this work, we present QCLIMB, a new flow scheduling solution designed to minimize FCT by utilizing imprecise flow information. Our key observation is that although obtaining precise flow information can be challenging, it is possible to accurately estimate each flow's lower and upper bounds with machine learning techniques.

QCLIMB has two key parts: i) a novel scheduling algorithm that leverages the lower bounds of different flows to prioritize small flow over large flows from the beginning of transmission, rather than at later stages; and ii) an efficient out-of-order handling mechanism that addresses practical reordering issues resulting from the algorithm. We show that QCLIMB significantly outperforms PIAS (88% lower average FCT of small flows) and is surprisingly close to pFabric (around 9% gap) while not requiring any switch modifications.

Pudica: Toward Near-Zero Queuing Delay in Congestion Control for Cloud Gaming

Shibo Wang, Xi’an Jiaotong University and Tencent Inc.; Shusen Yang, Xi'an Jiaotong University; Xiao Kong, Chenglei Wu, and Longwei Jiang, Tencent; Chenren Xu, Peking University; Cong Zhao, Xi'an Jiaotong University; Xuesong Yang, Bonree; Jianjun Xiao and Xin Liu, Tencent; Changxi Zheng, Pixel Lab, Tencent America, and Columbia University; Jing Wang and Honghao Liu, Tencent

Available Media

Congestion control (CC) plays a pivotal role in cloud gaming services. However, existing CC methods often cause self-induced bottleneck queuing. As a result, they may largely delay game frame transmission and undermine the player's gaming experience. We present a new end-to-end CC algorithm named Pudica that strives to achieve near-zero queuing delay and high link utilization while respecting cross-flow fairness. Pudica introduces several judicious approaches to utilize the paced frame to probe the bandwidth utilization ratio (BUR) instead of bandwidth itself. By leveraging BUR estimations, Pudica designs a holistic bitrate adjustment policy to balance low queuing, efficiency, and fairness. We conducted thorough and comprehensive evaluations in real networks. In comparison to baseline methods, Pudica reduces the average and tailed frame delay by 3.1× and 4.9× respectively, and cuts down the stall rate by 10.3×. Meanwhile, it increases the frame bitrate by 12.1\%. Pudica has been deployed in a large-scale cloud gaming platform, serving millions of players.

Revisiting Congestion Control for Lossless Ethernet

Yiran Zhang, Tsinghua University and Beijing University of Posts and Telecommunications; Qingkai Meng, Tsinghua University and Beihang University; Chaolei Hu and Fengyuan Ren, Tsinghua University

Available Media

Congestion control is a key enabler for lossless Ethernet at scale. In this paper, we revisit this classic topic from a new perspective, i.e., understanding and exploiting the intrinsic properties of the underlying lossless network. We experimentally and analytically find that the intrinsic properties of lossless networks, such as packet conservation, can indeed provide valuable implications in estimating pipe capacity and the precise number of excessive packets. Besides, we derive principles on how to treat congested flows and victim flows individually to handle HoL blocking efficiently. Then, we propose ACK-driven congestion control (ACC) for lossless Ethernet, which simply resorts to the knowledge of ACK time series to exert a temporary halt to exactly drain out excessive packets of congested flows and then match its rate to pipe capacity. Testbed and large-scale simulations demonstrate that ACC ameliorates fundamental issues in lossless Ethernet (e.g., congestion spreading, HoL blocking, and deadlock) and achieves excellent low latency and high throughput performance. For instance, compared with existing schemes, ACC improves the average and 99th percentile FCT performance of small flows by 1.3~3.3× and 1.4~11.5×, respectively.

10:30 am–11:00 am

Break with Refreshments

Mezzanine

11:00 am–12:40 pm

Track 1

Serverless

Session Chair: Eric Eide, University of Utah

Santa Clara Ballroom

Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices

Zibo Wang, University of Science and Technology of China and Microsoft Research; Pinghe Li, ETH Zurich; Chieh-Jan Mike Liang, Microsoft Research; Feng Wu, University of Science and Technology of China; Francis Y. Yan, Microsoft Research
Awarded Outstanding Paper!

Available Media

Achieving resource efficiency while preserving end-user experience is non-trivial for cloud application operators. As cloud applications progressively adopt microservices, resource managers are faced with two distinct levels of system behavior: end-to-end application latency and per-service resource usage. Translating between the two levels, however, is challenging because user requests traverse heterogeneous services that collectively (but unevenly) contribute to the end-to-end latency. We present Autothrottle, a bi-level resource management framework for microservices with latency SLOs (service-level objectives). It architecturally decouples application SLO feedback from service resource control, and bridges them through the notion of performance targets. Specifically, an application-wide learning-based controller is employed to periodically set performance targets—expressed as CPU throttle ratios—for per-service heuristic controllers to attain. We evaluate Autothrottle on three microservice applications, with workload traces from production scenarios. Results show superior CPU savings, up to 26.21% over the best-performing baseline and up to 93.84% over all baselines.

Jolteon: Unleashing the Promise of Serverless for Serverless Workflows

Zili Zhang, Chao Jin, and Xin Jin, School of Computer Science, Peking University

Available Media

Serverless computing promises automatic resource provisioning to relieve the burden of developers. Yet, developers still have to manually configure resources on current serverless platforms to satisfy application-level requirements. This is because cloud applications are orchestrated as serverless workflows with multiple stages, exhibiting a complex relationship between resource configuration and application requirements.

We propose Jolteon, an orchestrator to unleash the promise of automatic resource provisioning for serverless workflows. At the core of Jolteon is a stochastic performance model that combines the benefits of whitebox modeling to capture the execution characteristics of serverless computing and blackbox modeling to accommodate the inherent performance variability. We formulate a chance constrained optimization problem based on the performance model, and exploit sampling and convexity to find optimal resource configurations that satisfy user-defined cost or latency bounds. We implement a system prototype of Jolteon and evaluate it on AWS Lambda with a variety of serverless workflows. The experimental results show that Jolteon outperforms the state-of-the-art solution, Orion, by up to 2.3× on cost and 2.1× on latency.

Can't Be Late: Optimizing Spot Instance Savings under Deadlines

Zhanghao Wu, Wei-Lin Chiang, Ziming Mao, and Zongheng Yang, University of California, Berkeley; Eric Friedman and Scott Shenker, University of California, Berkeley, and ICSI; Ion Stoica, University of California, Berkeley
Awarded Outstanding Paper!

Available Media

Cloud providers offer spot instances alongside on-demand instances to optimize resource utilization. While economically appealing, spot instances’ preemptible nature causes them ill-suited for deadline-sensitive jobs. To allow jobs to meet deadlines while leveraging spot instances, we propose a simple idea: use on-demand instances judiciously as a backup resource. However, due to the unpredictable spot instance availability, determining when to switch between spot and on-demand to minimize cost requires careful policy design. In this paper, we first provide an in-depth characterization of spot instances (e.g., availability, pricing, duration), and develop a basic theoretical model to examine the worst and average-case behaviors of baseline policies (e.g., greedy). The model serves as a foundation to motivate our design of a simple and effective policy, Uniform Progress, which is parameter-free and requires no assumptions on spot availability. Our empirical study, based on three-month-long real spot availability traces on AWS, demonstrates that it can (1) outperform the greedy policy by closing the gap to the optimal policy by 2× in both average and bad cases, and (2) further reduce the gap when limited future knowledge is given. These results hold in a variety of conditions ranging from loose to tight deadlines, low to high spot availability, and on single or multiple instances. By implementing this policy on top of SkyPilot, an intercloud broker system, we achieve 27%-84% cost savings across a variety of representative real-world workloads and deadlines. The spot availability traces are open-sourced for future research.

Towards Intelligent Automobile Cockpit via A New Container Architecture

Lin Jiang and Feiyu Zhang, Xi’an Yunzhiji Technology; Jiang Ming, Tulane University

Available Media

An intelligent cockpit is now crucial in automobiles, not just to provide digital instrumentation and in-vehicle controls but also to offer a wide range of entertainment functionalities. To cater to the demands of these intelligent vehicles, the automotive industry starts employing virtualization technology to offer a unified hardware and software architecture that can simplify system management and enhance resource utilization. Particularly in the domain of intelligent cockpits, virtualization can tightly integrate systems with different criticality levels (e.g., safety and real-time) on a single hardware platform, improving inter-system communication quality and the timely response to user-initiated requests. Currently, microhypervisor virtualization has been used in production to achieve intelligent automobile cockpit. However, in addition to the performance concern and high production costs, this solution is suffering from the global shortage of chips capable of running microhypervisor systems.

Our key insight is that, most functions within intelligent cockpit systems are non-safety-critical and non-real-time multimedia tasks. Based on this characteristic, in this paper we present AutoVP, a new cockpit virtualization architecture. The hardware foundation of AutoVP consists of two low-cost chips: 1) a consumer-grade System-on-Chip (SoC) multi-core processor as the main chip; 2) a typical automotive-grade Microcontroller Unit (MCU) as the auxiliary chip. The MCU auxiliary chip is responsible for hosting real-time and safety-critical tasks, while the SoC main chip primarily handles multimedia tasks, such as entertainment systems and digital instrumentation. Further more, we construct an Android container virtual environment on the SoC main chip. This environment integrates multiple media functions onto a single chip, resulting in efficient utilization of chip computational resources and high system scalability. Our comparative performance evaluation demonstrates that AutoVP is a cost-effective and efficient solution to build intelligent cockpits.

MuCache: A General Framework for Caching in Microservice Graphs

Haoran Zhang, Konstantinos Kallas, and Spyros Pavlatos, Rajeev Alur, Sebastian Angel, and Vincent Liu, University of Pennsylvania

Available Media

This paper introduces MuCache, a framework for extending arbitrary microservice applications with inter-service caches. MuCache significantly improves the performance of microservice graphs (commonly found in large applications like Uber or Twitter) by eliminating the need for one microservice to call another when the relevant state has not changed. MuCache is enabled by a novel non-blocking cache coherence and invalidation protocol for graph topologies that minimizes critical-path overhead. For this protocol, we prove a strong correctness result: any execution observed by the cache-enabled microservice application could have been observed by the original application without caches. Our evaluation on well-known microservice benchmarks shows that MuCache reduces the median request latency by up to 2.5×, and increases throughput by up to 60%.

Track 2

Network Protocols

Session Chair: Maria Apostolaki, Princeton University

Magnolia Room

A large-scale deployment of DCTCP

Abhishek Dhamija and Balasubramanian Madhavan, Meta; Hechao Li, Netflix; Jie Meng, Shrikrishna Khare, and Madhavi Rao, Meta; Lawrence Brakmo; Neil Spring, Prashanth Kannan, and Srikanth Sundaresan, Meta; Soudeh Ghorbani, Meta and Johns Hopkins University

Available Media

This paper describes the process and operational experiences of deploying the Data Center TCP (DCTCP) protocol in a large-scale data center network. In contrast to legacy congestion control protocols that rely on loss as the primary signal of congestion, DCTCP signals in-network congestion (based on queue occupancy) to senders and adjusts the sending rate proportional to the level of congestion. At the time of our deployment, this protocol was well-studied and fairly established with proven efficiency gains in other networks. As expected, we also observed improved performance, and notably decreased packet losses, compared to legacy protocols in our data centers. Perhaps unexpectedly, however, we faced numerous hurdles in rolling out DCTCP; we chronicle these unexpected challenges, ranging from its unfairness (to other classes of traffic) to implementation bugs. We close by discussing some of the open research questions and challenges.

TECC: Towards Efficient QUIC Tunneling via Collaborative Transmission Control

Jiaxing Zhang, Alibaba Group, University of Chinese Academy of Sciences; Furong Yang, Alibaba Group; Ting Liu, Alibaba Group, University of Chinese Academy of Sciences; Qinghua Wu, University of Chinese Academy of Sciences, Purple Mountain Laboratories, China; Wu Zhao, Yuanbo Zhang, Wentao Chen, Yanmei Liu, Hongyu Guo, and Yunfei Ma, Alibaba Group; Zhenyu Li, University of Chinese Academy of Sciences, Purple Mountain Laboratories, China

Available Media

In this paper, we present TECC, a system based on collaborative transmission control that mitigates the mismatch of sending behavior between the inner and outer connections to achieve efficient QUIC tunneling. In TECC, a feedback framework is implemented to enable end hosts to collect more precise network information that is sensed on the tunnel server, which assists the inner end-to-end connection to achieve better congestion control and loss recovery. Extensive experiments in emulated networks and real-world large-scale A/B tests demonstrate the efficiency of TECC. Specifically, compared with the state-of-the-art QUIC tunneling solution, TECC significantly reduces flow completion time. In emulated networks, TECC decreases flow completion time by 30% on average and 53% at the 99th percentile. TECC also gains a reduction in RPC (Remote Procedure Call) request completion time of 3.9% on average and 13.3% at the 99th percentile in large-scale A/B tests.

iStack: A General and Stateful Name-based Protocol Stack for Named Data Networking

Tianlong Li, Tian Song, and Yating Yang, Beijing Institute of Technology

Available Media

Named Data Networking (NDN) shifts the network from host-centric to data-centric with a clean-slate design, in which packet forwarding is based on names, and the data plane maintains per-packet state. Different forwarders have been implemented to provide NDN capabilities for various scenarios, however, there is a lack of a network stack that is integrated with operating systems (OS) for general purpose. Designing a stateful and entirely name-based protocol stack in OS kernel remains a challenge due to three factors: (i) an in-kernel name resolution architecture for packet demultiplexing is necessary, (ii) an entirely name-based stack requires to be compatible with the current address (MAC/IP/port)-based architecture in OS kernel, and (iii) maintaining per-packet state introduces a trade-off between performance and resource consumption.

In this paper, for the first time, we take NDN into OS kernel by proposing iStack, an Information-Centric Networking (ICN) protocol stack. The main innovations of iStack are threefold. First, we propose a name resolution architecture to support both network-layer forwarding and local packet demultiplexing. Second, a two-layer face system is proposed to provide abstraction of address-based network interfaces. Third, we design socket-compatible interfaces to keep the uniformity of current network stack in OS. Besides, we design compact forwarding data structures for fast packet processing with low memory footprint. We have implemented prototypes on multiple platforms. The evaluation results show that iStack achieves 6.50 Gbps throughput, outperforming the NDN-testbed forwarder by a factor of 16.25x, and reduces 46.08% forwarding latency for cached packets with its inkernel packet caching. iStack is not just another forwarder for NDN, but a step forward for practical development of ICN.

Cloudcast: High-Throughput, Cost-Aware Overlay Multicast in the Cloud

Sarah Wooders and Shu Liu, UC Berkeley; Paras Jain, Genmo AI; Xiangxi Mo and Joseph Gonzalez, UC Berkeley; Vincent Liu, University of Pennsylvania; Ion Stoica, UC Berkeley

Available Media

Bulk data replication across multiple cloud regions and providers is essential for large organizations to support data analytics, disaster recovery, and geo-distributed model serving. However, data multicast in the cloud can be expensive due to network egress costs and slow due to cloud network constraints. In this paper, we study the design of high-throughput, cost-optimized overlay multicast for bulk cloud data replication that exploits trends in modern provider pricing models along with techniques like ephemeral waypoints to minimize cloud networking costs.

To that end, we design an optimization algorithm that uses information about cloud network throughput and pricing to identify cost-minimizing multicast replication trees under user-given runtime budgets. Our open-source implementation, Cloudcast, is used for cloud overlay multicast that supports pluggable algorithms for determining the multicast tree structure. Our evaluations show that Cloudcast achieves 61.5% cost reduction and 2.3× replication speedup compared to both academic and commercial baselines (e.g., AWS multi-region bucket) for multi-region replication.

Understanding Routable PCIe Performance for Composable Infrastructures

Wentao Hou, University of Wisconsin-Madison; Jie Zhang and Zeke Wang, Zhejiang University; Ming Liu, University of Wisconsin-Madison

Available Media

Routable PCIe has become the predominant cluster interconnect to build emerging composable infrastructures. Empowered by PCIe non-transparent bridge devices, PCIe transactions can traverse multiple switching domains, enabling a server to elastically integrate a number of remote PCIe devices as local ones. However, it is unclear how to move data or perform communication efficiently over the routable PCIe fabric without understanding its capabilities and limitations.

This paper presents the design and implementation of rPCIeBench, a software-hardware co-designed benchmarking framework to systematically characterize the routable PCIe fabric. rPCIeBench provides flexible data communication primitives, exposes end-to-end PCIe transaction observability, and enables reconfigurable experiment deployment. Using rPCIeBench, we first analyze the communication characteristics of a routable PCIe path, quantify its performance tax, and compare it with the local PCIe link. We then use it to dissect in-fabric traffic orchestration behaviors and draw three interesting findings: approximate max-min bandwidth partition, fast end-to-end bandwidth synchronization, and interference-free among orthogonal data paths. Finally, we encode gathered characterization insights as traffic orchestration rules and develop an edge constraints relaxing algorithm to estimate PCIe flow transmission performance over a shared fabric. We validate its accuracy and demonstrate its potential to provide an optimization guide to design efficient flow schedulers.

12:40 pm–2:00 pm

Symposium Luncheon and Test of Time Award Presentation

Mezzanine

2:00 pm–3:40 pm

Track 1

Distributed Systems: Part 1

Session Chair: Jay Lorch, Microsoft Research

Santa Clara Ballroom

Alea-BFT: Practical Asynchronous Byzantine Fault Tolerance

Diogo S. Antunes, Afonso N. Oliveira, André Breda, Matheus Guilherme Franco, Henrique Moniz, and Rodrigo Rodrigues, Instituto Superior Técnico (ULisboa) and INESC-ID

Available Media

Traditional Byzantine Fault Tolerance (BFT) state machine replication protocols assume a partial synchrony model, leading to a design where a leader replica drives the protocol and is replaced after a timeout. Recently, we witnessed a surge of asynchronous BFT protocols, which use randomization to remove the need for bounds on message delivery times, making them more resilient to adverse network conditions. However, existing research proposals still fall short of gaining practical adoption, plausibly because they are not able to combine good performance with a simple design that can be readily understood and adopted. In this paper, we present Alea-BFT, a simple and highly efficient asynchronous BFT protocol, which is gaining practical adoption, namely in Ethereum distributed validators. Alea-BFT brings the key design insight from classical protocols of concentrating part of the work on a single designated replica and incorporates this principle in a simple two-stage pipelined design, with an efficient broadcast led by the designated replica, followed by an inexpensive binary agreement. The evaluation of our research prototype implementation and two real-world integrations in cryptocurrency ecosystems shows excellent performance, improving on the fastest protocol (Dumbo-NG) in terms of latency and displaying good performance under faults.

Harmony: A Congestion-free Datacenter Architecture

Saksham Agarwal, Qizhe Cai, Rachit Agarwal, and David Shmoys, Cornell University; Amin Vahdat, Google

Available Media

Datacenter networks today provide best-effort delivery—messages may observe unpredictable queueing, delays, and drops due to switch buffer overflows within the network. Such weak guarantees reduce the set of assumptions that system designers can rely upon from the network, thus introducing inefficiency and complexity in host hardware and software.

We present Harmony, a datacenter network architecture that provides powerful "congestion-free" message delivery guarantees—each message, once transmitted by the sender, observes bounded queueing at each switch in the network. Thus, network delays are bounded in failure-free scenarios, and congestion-related drops are completely eliminated. We establish, both theoretically and empirically, that Harmony provides such powerful guarantees with near-zero overheads compared to best-effort delivery networks: it incurs a tiny additive latency overhead that diminishes with message sizes, while achieving near-optimal network utilization.

SwiftPaxos: Fast Geo-Replicated State Machines

Fedor Ryabinin, IMDEA Software Institute and Universidad Politécnica de Madrid; Alexey Gotsman, IMDEA Software Institute; Pierre Sutra, Télécom SudParis and INRIA

Available Media

Cloud services improve their availability by replicating data across sites in different geographical regions. A variety of state-machine replication protocols have been proposed for this setting that reduce the latency under workloads with low contention. However, when contention increases, these protocols may deliver lower performance than Paxos. This paper introduces SwiftPaxos—a protocol that lowers the best-case latency in comparison to Paxos without hurting the worst-case one. SwiftPaxos executes a command in 2 message delays if there is no contention, and in 3 message delays otherwise. To achieve this, the protocol allows replicas to vote on the order in which they receive state-machine commands. Differently from previous protocols, SwiftPaxos permits a replica to vote twice: first for its own ordering proposal, and then to follow the leader. This mechanism avoids restarting the voting process when a disagreement occurs among replicas, saving computation time and message delays. Our evaluation shows that the throughput of SwiftPaxos is up to 2.9x better than state-of-the-art alternatives.

The Bedrock of Byzantine Fault Tolerance: A Unified Platform for BFT Protocols Analysis, Implementation, and Experimentation

Mohammad Javad Amiri, Stony Brook University; Chenyuan Wu, University of Pennsylvania; Divyakant Agrawal and Amr El Abbadi, UC Santa Barbara; Boon Thau Loo, University of Pennsylvania; Mohammad Sadoghi, UC Davis
Awarded Outstanding Paper!

Available Media

Byzantine Fault-Tolerant (BFT) protocols cover a broad spectrum of design dimensions from infrastructure settings, such as the communication topology, to more technical features, such as commitment strategy and even fundamental social choice properties like order-fairness. The proliferation of different protocols has made it difficult to navigate the BFT landscape, let alone determine the protocol that best meets application needs. This paper presents Bedrock, a unified platform for BFT protocols analysis, implementation, and experimentation. Bedrock proposes a design space consisting of a set of dimensions and explores several design choices that capture the trade-offs between different design space dimensions. Within Bedrock, a wide range of BFT protocols can be implemented and uniformly evaluated under a unified deployment environment.

DINT: Fast In-Kernel Distributed Transactions with eBPF

Yang Zhou, Harvard University; Xingyu Xiang, Peking University; Matthew Kiley, Harvard University; Sowmya Dharanipragada, Cornell University; Minlan Yu, Harvard University

Available Media

Serializable distributed in-memory transactions are important building blocks for data center applications. To achieve high throughput and low latency, existing distributed transaction systems eschew the kernel networking stack and rely heavily on kernel-bypass networking techniques such as RDMA and DPDK. However, kernel-bypass networking techniques generally suffer from security, isolation, protection, maintainability, and debuggability issues, while the kernel networking stack supports these properties well, but performs poorly.

We present DINT, a kernel networking stack-based distributed transaction system that achieves kernel-bypass-like throughput and latency. To gain the performance back under the kernel stack, DINT offloads frequent-path transaction operations directly into the kernel via eBPF techniques without kernel modifications or customized kernel modules, avoiding most of the kernel stack overheads. DINT does not lose the good properties of the kernel stack, as eBPF is a kernel-native technique on modern OSes. On typical transaction workloads, DINT even achieves up to 2.6× higher throughput than using a DPDK-based kernel-bypass stack, while only adding at most 10%/16% average/99th-tail unloaded latency.

Track 2

Programming the Network: Part 1

Session Chair: Laurent Vanbever, ETH Zürich

Magnolia Room

Brain-on-Switch: Towards Advanced Intelligent Network Data Plane via NN-Driven Traffic Analysis at Line-Speed

Jinzhu Yan and Haotian Xu, Tsinghua University; Zhuotao Liu, Qi Li, Ke Xu, Mingwei Xu, and Jianping Wu, Tsinghua University and Zhongguancun Laboratory

Available Media

The emerging programmable networks sparked significant research on Intelligent Network Data Plane (INDP), which achieves learning-based traffic analysis at line-speed. Prior art in INDP focus on deploying tree/forest models on the data plane. We observe a fundamental limitation in tree-based INDP approaches: although it is possible to represent even larger tree/forest tables on the data plane, the flow features that are computable on the data plane are fundamentally limited by hardware constraints. In this paper, we present BoS to push the boundaries of INDP by enabling Neural Network (NN) driven traffic analysis at line-speed. Many types of NNs (such as Recurrent Neural Network (RNN), and transformers) that are designed to work with sequential data have advantages over tree-based models, because they can take raw network data as input without complex feature computations on the fly. However, the challenge is significant: the recurrent computation scheme used in RNN inference is fundamentally different from the match-action paradigm used on the network data plane. BoS addresses this challenge by (i) designing a novel data plane friendly RNN architecture that can execute unlimited RNN time steps with limited data plane stages, effectively achieving line-speed RNN inference; and (ii) complementing the on-switch RNN model with an off-switch transformer-based traffic analysis module to further boost the overall performance. We implement a prototype of BoS using a P4 programmable switch as our data plane, and extensively evaluate it over multiple traffic analysis tasks. The results show that BoS outperforms state-of-the-art in both analysis accuracy and scalability.

The Eternal Tussle: Exploring the Role of Centralization in IPFS

Yiluo Wei, Hong Kong University of Science & Technology (GZ); Dennis Trautwein and Yiannis Psaras, Protocol Labs; Ignacio Castro, Queen Mary University of London; Will Scott, Protocol Labs; Aravindh Raman, Brave Software; Gareth Tyson, Hong Kong University of Science & Technology (GZ)

Available Media

Web centralization and consolidation has created potential single points of failure, e.g., in areas such as content hosting, name resolution, and certification. The "Decentralized Web", led by open-source software implementations, attempts to build decentralized alternatives. The InterPlanetary File System (IPFS) is part of this effort and attempts to provide a decentralized layer for object storage and retrieval. This comes with challenges, though: Decentralization can increase complexity, overhead, as well as compromise performance and scalability. As the core maintainers of IPFS, we have therefore begun to explore more hybrid approaches. This paper reports on our experiences building three centralized components within IPFS: (i) InterPlanetary Network Indexers, which provides an alternative centralized method for content indexing; (ii) Hydra Boosters, which are strategic DHT nodes that assist IPFS in content routing; and (iii) HTTP Gateways, which are a public access point for users to retrieve IPFShosted content. Through this approach, we trade-off the level of decentralization within IPFS in an attempt to gain certain benefits of centralization. We evaluate the performance of these components and demonstrate their ability to successfully address the challenges that IPFS faces.

BBQ: A Fast and Scalable Integer Priority Queue for Hardware Packet Scheduling

Nirav Atre, Hugo Sadok, and Justine Sherry, Carnegie Mellon University

Available Media

The need for fairness, strong isolation, and fine-grained control over network traffic in multi-tenant cloud settings has engendered a rich literature on packet scheduling in switches and programmable hardware. Recent proposals for hardware scheduling primitives (e.g., PIFO, PIEO, BMW-Tree) have enabled run-time programmable packet schedulers, considerably expanding the suite of scheduling policies that can be applied to network traffic. However, no existing solution can be practically deployed on modern switches and NICs because they either do not scale to the number of elements required by these devices or fail to deliver good throughput, thus requiring an impractical number of replicas.

In this work, we ask: is it possible to achieve priority packet scheduling at line-rate while supporting a large number of flows? Our key insight is to leverage a scheduling primitive used previously in software – called Hierarchical Find First Set – and port this to a highly pipeline-parallel hardware design. We present the architecture and implementation of the Bitmapped Bucket Queue (BBQ), a hardware-based integer priority queue that supports a wide range of scheduling policies (via a PIFO-like abstraction). BBQ, for the first time, supports hundreds of thousands of concurrent flows while guaranteeing 100 Gbps line rate (148.8 Mpps) on FPGAs and 1 Tbps (1,488 Mpps) line rate on ASICs. We demonstrate this by implementing BBQ on a commodity FPGA where it is capable of supporting over 100K flows and 32K priorities at 300 MHz, 3× the packet rate of similar hardware priority queue designs. On ASIC, we can synthesize 100K elements at 3.1 GHz using a 7nm process.

Sirius: Composing Network Function Chains into P4-Capable Edge Gateways

Jiaqi Gao, Jiamin Cao, Yifan Li, Mengqi Liu, Ming Tang, Dennis Cai, and Ennan Zhai, Alibaba Cloud

Available Media

Alibaba Cloud designs and deploys P4-capable gateway to accelerate the processing of the diverse business traffics in the edge cloud. Since the programmable ASIC in the gateway only accepts a monolithic, pipelined P4 program, the dozens network function chains for different business traffics have to be composed into one. This is non-trivial due to the contention between the complexity of network function chains and the limited resource in the programmable ASIC. In this paper, we present Sirius, a system that automates network function chain composition process. Sirius synthesizes tables to identify which business traffic the input packet belongs to, pipelines loops in the merged network function graph via recirculations, and partitions the graph between programmable ASIC and CPU when the required memory consumption exceeds the ASIC’s capability. So far, Sirius has automated network function arrangement in hundreds of gateways, and has effectively decreased our programmers’ workload by three orders of magnitude, from weeks to minutes.

Empower Programmable Pipeline for Advanced Stateful Packet Processing

Yong Feng and Zhikang Chen, Tsinghua University; Haoyu Song, Futurewei Technologies; Yinchao Zhang, Hanyi Zhou, Ruoyu Sun, Wenkuo Dong, Peng Lu, Shuxin Liu, and Chuwen Zhang, Tsinghua University; Yang Xu, Fudan University; Bin Liu, Tsinghua University

Available Media

Programmable pipeline offers flexible and high-throughput packet processing capability, but only to some extent. When more advanced dataplane functions beyond basic packet processing and forwarding are desired, the pipeline becomes handicapped. The fundamental reason is that most stateful operations require backward cross-stage data passing and pipeline stalling for state update and consistency, which are anomalous to a standard pipeline. To solve the problem, we augment the pipeline with a low-cost, yet fast side ring to facilitate the backward data passing. We further apply the speculative execution technique to avoid pipeline stalling. The resulting architecture, RAPID, supports native and generic stateful function programming using the enhanced P4 language. We build an FPGA-based prototype to evaluate the system, and a software emulator to assess the cost and performance of an ASIC implementation. We realize several stateful applications enabled by RAPID to show how it extends a programmable dataplane's potential to a new level.

3:40 pm–4:10 pm

Break with Refreshments

Mezzanine

4:10 pm–5:50 pm

Track 1

Video

Session Chair: Keith Winstein, Stanford University

Santa Clara Ballroom

GRACE: Loss-Resilient Real-Time Video through Neural Codecs

Yihua Cheng, Ziyi Zhang, Hanchen Li, Anton Arapin, and Yue Zhang, The University of Chicago; Qizheng Zhang, Stanford University; Yuhan Liu, Kuntai Du, and Xu Zhang, The University of Chicago; Francis Y. Yan, Microsoft; Amrita Mazumdar, NVIDIA; Nick Feamster and Junchen Jiang, The University of Chicago

Available Media

In real-time video communication, retransmitting lost packets over high-latency networks is not viable due to strict latency requirements. To counter packet losses without retransmission, two primary strategies are employed—encoder-based forward error correction (FEC) and decoder-based error concealment. The former encodes data with redundancy before transmission, yet determining the optimal redundancy level in advance proves challenging. The latter reconstructs video from partially received frames, but dividing a frame into independently coded partitions inherently compromises compression efficiency, and the lost information cannot be effectively recovered by the decoder without adapting the encoder.

We present a loss-resilient real-time video system called GRACE, which preserves the user’s quality of experience (QoE) across a wide range of packet losses through a new neural video codec. Central to GRACE’s enhanced loss resilience is its joint training of the neural encoder and decoder under a spectrum of simulated packet losses. In lossless scenarios, GRACE achieves video quality on par with conventional codecs (e.g., H.265). As the loss rate escalates, GRACE exhibits a more graceful, less pronounced decline in quality, consistently outperforming other loss-resilient schemes. Through extensive evaluation on various videos and real network traces, we demonstrate that GRACE reduces undecodable frames by 95% and stall duration by 90% compared with FEC, while markedly boosting video quality over error concealment methods. In a user study with 240 crowdsourced participants and 960 subjective ratings, GRACE registers a 38% higher mean opinion score (MOS) than other baselines.

LiFteR: Unleash Learned Codecs in Video Streaming with Loose Frame Referencing

Bo Chen, University of Illinois at Urbana-Champaign; Zhisheng Yan, George Mason University; Yinjie Zhang, Zhe Yang, and Klara Nahrstedt, University of Illinois at Urbana-Champaign

Available Media

Video codecs are essential for video streaming. While traditional codecs like AVC and HEVC are successful, learned codecs built on deep neural networks (DNNs) are gaining popularity due to their superior coding efficiency and quality of experience (QoE) in video streaming. However, using learned codecs built with sophisticated DNNs in video streaming leads to slow decoding and low frame rate, thereby degrading the QoE. The fundamental problem is the tight frame referencing design adopted by most codecs, which delays the processing of the current frame until its immediate predecessor frame is reconstructed. To overcome this limitation, we propose LiFteR, a novel video streaming system that operates a learned video codec with loose frame referencing (LFR). LFR is a unique frame referencing paradigm that redefines the reference relation between frames and allows parallelism in the learned video codec to boost the frame rate. LiFteR has three key designs: (i) the LFR video dispatcher that routes video data to the codec based on LFR, (ii) LFR learned codec that enhances coding efficiency in LFR with minimal impact on decoding speed, and (iii) streaming supports that enables adaptive bitrate streaming with learned codecs in existing infrastructures. In our evaluation, LiFteR consistently outperforms existing video streaming systems. Compared to the existing best-performing learned and traditional systems, LiFteR demonstrates up to 23.8% and 19.7% QoE gain, respectively. Furthermore, LiFteR achieves up to a 3.2× frame rate improvement through frame rate configuration.

MadEye: Boosting Live Video Analytics Accuracy with Adaptive Camera Configurations

Mike Wong and Murali Ramanujam, Princeton University; Guha Balakrishnan, Rice University; Ravi Netravali, Princeton University

Available Media

Camera orientations (i.e., rotation and zoom) govern the content that a camera captures in a given scene, which in turn heavily influences the accuracy of live video analytics pipelines. However, existing analytics approaches leave this crucial adaptation knob untouched, instead opting to only alter the way that captured images from fixed orientations are encoded, streamed, and analyzed. We present MadEye, a camera-server system that automatically and continually adapts orientations to maximize accuracy for the workload and resource constraints at hand. To realize this using commodity pan-tilt-zoom (PTZ) cameras, MadEye embeds (1) a search algorithm that rapidly explores the massive space of orientations to identify a fruitful subset at each time, and (2) a novel knowledge distillation strategy to efficiently (with only camera resources) select the ones that maximize workload accuracy. Experiments on diverse workloads show that MadEye boosts accuracy by 2.9-25.7% for the same resource usage, or achieves the same accuracy with 2-3.7× lower resource costs.

Gemino: Practical and Robust Neural Compression for Video Conferencing

Vibhaalakshmi Sivaraman, Pantea Karimi, Vedantha Venkatapathy, and Mehrdad Khani, Massachusetts Institute of Technology; Sadjad Fouladi, Microsoft Research; Mohammad Alizadeh, Frédo Durand, and Vivienne Sze, Massachusetts Institute of Technology

Available Media

Video conferencing systems suffer from poor user experience when network conditions deteriorate because current video codecs simply cannot operate at extremely low bitrates. Recently, several neural alternatives have been proposed that reconstruct talking head videos at very low bitrates using sparse representations of each frame such as facial landmark information. However, these approaches produce poor reconstructions in scenarios with major movement or occlusions over the course of a call, and do not scale to higher resolutions. We design Gemino, a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. Gemino upsamples a very low-resolution version of each target frame while enhancing high-frequency details (e.g., skin texture, hair, etc.) based on information extracted from a single high-resolution reference image. We use a multi-scale architecture that runs different components of the model at different resolutions, allowing it to scale to resolutions comparable to 720p, and we personalize the model to learn specific details of each person, achieving much better fidelity at low bitrates. We implement Gemino atop aiortc, an open-source Python implementation of WebRTC, and show that it operates on 1024x1024 videos in real-time on a Titan X GPU, and achieves 2.2–5x lower bitrate than traditional video codecs for the same perceptual quality.

ARTEMIS: Adaptive Bitrate Ladder Optimization for Live Video Streaming

Farzad Tashtarian, Christian Doppler Laboratory ATHENA, Alpen-Adria Universität Klagenfurt; Abdelhak Bentaleb, Concordia University; Hadi Amirpour, Christian Doppler Laboratory ATHENA, Alpen-Adria Universität Klagenfurt; Sergey Gorinsky, IMDEA Networks Institute; Junchen Jiang, University of Chicago; Hermann Hellwagner and Christian Timmerer, Christian Doppler Laboratory ATHENA, Alpen-Adria Universität Klagenfurt

Available Media

Live streaming of segmented videos over the Hypertext Transfer Protocol (HTTP) is increasingly popular and serves heterogeneous clients by offering each segment in multiple representations. A bitrate ladder expresses this choice as an ordered list of bitrate-resolution pairs. Whereas existing solutions for HTTP-based live streaming use a static bitrate ladder, the fixed ladders struggle to appropriately accommodate the dynamics in the video content and network-conditioned client capabilities. This paper proposes ARTEMIS as a practical scalable alternative that dynamically configures the bitrate ladder depending on the content complexity, network conditions, and clients' statistics. ARTEMIS seamlessly integrates with the end-to-end streaming pipeline and operates transparently to video encoders and clients. We develop a cloud-based implementation of ARTEMIS and conduct extensive real-world and trace-driven experiments. The experimental comparison vs. existing prominent bitrate ladders demonstrates that live streaming with ARTEMIS outperforms all baselines, reduces encoding computation by 25%, end-to-end latency by 18%, and increases quality of experience by 11%.

Track 2

Sharing the Network

Session Chair: Dan Ports, Microsoft Research

Magnolia Room

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Vamsi Addanki, Maciej Pacut, and Stefan Schmid, TU Berlin

Available Media

Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance.

This paper is the first research attempt in this direction. We propose CREDENCE, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. CREDENCE can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, CREDENCE achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that CREDENCE improves throughput by 1.5x compared to traditional approaches. In terms of flow completion times, we show that CREDENCE improves upon the state-of-the-art approaches by up to 95% using off-the-shelf machine learning techniques that are also practical in today’s hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.

Seer: Enabling Future-Aware Online Caching in Networked Systems

Jason Lei and Vishal Shrivastav, Purdue University

Available Media

State-intensive network and distributed applications rely heavily on online caching heuristics for high performance. However, there remains a fundamental performance gap between online caching heuristics and the optimal offline caching algorithm due to the lack of visibility into future state access requests in an online setting. Driven by the observation that state access requests in network and distributed applications are often carried in incoming network packets, we present Seer, an online caching solution for networked systems, that exploits the delays experienced by a packet inside a network—most prominently, transmission and queuing delays—to notify in advance of future packet arrivals to the target network nodes (switches/routers/middleboxes/end-hosts) implementing caching. Using this as a building block, Seer presents the design of an online cache manager that leverages visibility into (partial) set of future state access requests to make smarter prefetching and cache eviction decisions. Our evaluations show that Seer achieves up to 65% lower cache miss ratio and up to 78% lower flow completion time compared to LRU for key network applications over realistic workloads.

Reverie: Low Pass Filter-Based Switch Buffer Sharing for Datacenters with RDMA and TCP Traffic

Vamsi Addanki, TU Berlin; Wei Bai, Microsoft Research; Stefan Schmid, TU Berlin; Maria Apostolaki, Princeton University

Available Media

The switch buffers in datacenters today are shared by traffic classes with different loss tolerance and reaction to congestion signals. In particular, while legacy applications use loss-tolerant transport, e.g., DCTCP, newer applications require lossless datacenter transport, e.g., RDMA over Converged Ethernet. The allocation of buffers for this diverse traffic mix is managed by a buffer-sharing scheme. Unfortunately, as we analytically show in this paper, the buffer-sharing practices of today's datacenters pose a fundamental limitation to effectively isolate RDMA and TCP while also maximizing burst absorption. We identify two root causes: (i) the buffer-sharing for RDMA and TCP relies on two independent and often conflicting views of the buffer, namely ingress and egress; and (ii) the buffer-sharing scheme micromanages the buffer and overreacts to the changes in its occupancy during transient congestion.

In this paper, we present Reverie, a buffer-sharing scheme, which, unlike prior works, is suitable for both lossless and loss-tolerant traffic classes, providing isolation as well as superior burst absorption. At the core of Reverie lies a unified (consolidated ingress and egress) admission control that jointly optimizes the buffers for both traffic classes. Reverie, allocates buffer based on a low-pass filter that naturally absorbs bursty queue lengths during transient congestion within the buffer limits. Our evaluation shows that Reverie can improve the performance of RDMA as well as TCP in terms of flow completion times by up to 33%.

Precise Data Center Traffic Engineering with Constrained Hardware Resources

Shawn Shuoshuo Chen, Carnegie Mellon University; Keqiang He, AirBNB; Rui Wang, Google; Srinivasan Seshan and Peter Steenkiste, Carnegie Mellon University

Available Media

Data center traffic engineering (TE) routes flows over a set of available paths following custom weight distributions to achieve optimal load balancing or flow throughput. However, as a result of hardware constraints, it is challenging, and often impossible for larger data center networks, to precisely implement the TE weight distributions on the data plane switches. The resulting precision loss in the TE implementation causes load imbalances that can result in congestion and traffic loss.

Instead of treating all flows equally, we adapt the hardware resource allocation to a flow’s traffic volume and its contribution to the overall precision loss. We intelligently prune select ports in weight distributions and merge identical distributions to free up hardware resources. Evaluation using realistic traffic loads shows that our techniques approximate ideal TE solutions under various scenarios within 7% error, compared to a 67% error for today’s state-of-the-art approach. In addition, our design avoids traffic loss triggered by switch rule overflow. Finally, the execution time is 10× faster than the current approach.

Multitenant In-Network Acceleration with SwitchVM

Sajy Khashab, Alon Rashelbach, and Mark Silberstein, Technion

Available Media

We propose a practical approach to implementing multitenancy on programmable network switches to make in-network acceleration accessible to cloud users. We introduce a Switch Virtual Machine (SwitchVM), that is deployed on the switches and offers an expressive instruction set and program state abstractions. Tenant programs, called Data-Plane filters (DPFs), are executed on top of SwitchVM in a sandbox with memory, network and state isolation policies controlled by network operators. The packets that trigger DPF execution include the code to execute or a reference to the DPFs deployed in the switch. DPFs are Turing-complete, may maintain state in the packet and in switch virtual memory, may form a dynamic chain, and may steer packets to desired destinations, all while enforcing the operator’s policies.

We demonstrate that this idea is practical by prototyping SwitchVM in P4 on Intel Tofino switches. We describe a variety of use cases that SwitchVM supports, and implement three complex applications from prior works – key-value store cache, Load-aware load balancer and Paxos accelerator. We also show that SwitchVM provides strong performance isolation, zero-overhead runtime programmability, may hold two orders of magnitude more in-switch programs than existing techniques, and may support up to thirty thousand concurrent tenants each with its private state.

Wednesday, April 17

8:00 am–9:00 am

Continental Breakfast

Mezzanine

9:00 am–10:20 am

Track 1

ML at Scale

Session Chair: Rachee Singh

Santa Clara Ballroom

Characterization of Large Language Model Development in the Datacenter

Qinghao Hu, Shanghai AI Laboratory and S-Lab, Nanyang Technological University; Zhisheng Ye, Shanghai AI Laboratory and Peking University; Zerui Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University; Guoteng Wang, Shanghai AI Laboratory; Meng Zhang and Qiaoling Chen, Shanghai AI Laboratory and S-Lab, Nanyang Technological University; Peng Sun, Shanghai AI Laboratory and SenseTime Research; Dahua Lin, Shanghai AI Laboratory and CUHK; Xiaolin Wang and Yingwei Luo, Peking University; Yonggang Wen and Tianwei Zhang, Nanyang Technological University

Available Media

Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.

QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models

Kiran Kumar Matam, Hani Ramezani, Fan Wang, Zeliang Chen, Yue Dong, Maomao Ding, Zhiwei Zhao, Zhengyu Zhang, Ellie Wen, and Assaf Eisenman, Meta, Inc.

Available Media

Deep learning recommendation models play an important role in online companies and consume a major part of the AI infrastructure dedicated to training and inference. The accuracy of these models highly depends on how quickly they are published on the serving side. One of the main challenges in improving the model update latency and frequency is the model size, which has reached the order of Terabytes and is expected to further increase in the future. The large model size causes large latency (and write bandwidth) to update the model in geo-distributed servers. We present QuickUpdate, a system for real-time personalization of large-scale recommendation models, that publishes the model in high frequency as part of online training, providing serving accuracy that is comparable to that of a fully fresh model. The system employs novel techniques to minimize the required write bandwidth, including prioritized parameter updates, intermittent full model updates, model transformations, and relaxed consistency. We evaluate QuickUpdate using real-world data, on one of the largest production models in Meta. The results show that QuickUpdate provides serving accuracy that is comparable to a fully fresh model, while reducing the average published update size and the required bandwidth by over 13x. It provides a scalable solution for serving production models in real-time fashion, which is otherwise not feasible at scale due to the limited network and storage bandwidth.

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang and Haibin Lin, ByteDance; Yinmin Zhong, Peking University; Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, and Jianxi Ye, ByteDance; Xin Jin, Peking University; Xin Liu, ByteDance

Available Media

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahini, Google

Available Media

TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale, including features for automatic fault resiliency and hardware recovery. We adopt a software-defined networking (SDN) approach to manage TPUv4’s high-bandwidth inter-chip interconnect (ICI) fabric, using optical circuit switching to dynamically configure routes to work around machine, chip and link failures. Our infrastructure detects failures and automatically triggers reconfiguration to minimize disruption to running workloads, as well as initiating remediation and repair workflows for the affected components. Similar techniques interface with maintenance and upgrade workflows for both hardware and software. Our dynamic reconfiguration approach allows our TPUv4 supercomputers to achieve 99.98% system availability, gracefully handling hardware outages experienced by ~1% of the training jobs.

Track 2

Satellites and Things

Session Chair: Zerina Kapetanovic, Stanford University

Magnolia Room

NN-Defined Modulator: Reconfigurable and Portable Software Modulator on IoT Gateways

Jiazhao Wang and Wenchao Jiang, Singapore University of Technology and Design; Ruofeng Liu, University of Minnesota; Bin Hu, University of Southern California; Demin Gao, Nanjing Forestry University; Shuai Wang, Southeast University

Available Media

A physical-layer modulator is a vital component for an IoT gateway to map the symbols to signals. However, due to the soldered hardware chipsets on the gateway's motherboards or the diverse toolkits on different platforms for the software radio, the existing solutions either have limited extensibility or are platform specific. Such limitation is hard to ignore when modulation schemes and hardware platforms have become extremely diverse. This paper presents a new paradigm of using neural networks as an abstraction layer for physical layer modulators in IoT gateway devices, referred to as NN-defined modulators. Our approach addresses the challenges of extensibility and portability for multiple technologies on various hardware platforms. The proposed NN-defined modulator uses a model-driven methodology rooted in solid mathematical foundations while having native support for hardware acceleration and portability to heterogeneous platforms. We conduct the evaluation of NN-defined modulators on different platforms, including Nvidia Jetson Nano, Raspberry Pi. Evaluations demonstrate that our NN-defined modulator effectively operates as conventional modulators and provides significant efficiency gains (up to 4.7× on Nvidia Jetson Nano and 1.1× on Raspberry Pi), indicating high portability. Furthermore, we show the real-world applications using our NN-defined modulators to generate ZigBee and WiFi packets, which are compliant with commodity TI CC2650 (ZigBee) and Intel AX201 (WiFi NIC) respectively.

Democratizing Direct-to-Cell Low Earth Orbit Satellite Networks

Lixin Liu, Tsinghua University; Yuanjie Li and Hewu Li, Tsinghua University and Zhongguancun Laboratory; Jiabo Yang, Wei Liu, Jingyi Lan, Yufeng Wang, and Jiarui Li, Tsinghua University; Jianping Wu, Qian Wu, Jun Liu, and Zeqi Lai, Tsinghua University and Zhongguancun Laboratory
Awarded Outstanding Paper!

Available Media

Multi-tenant Low Earth Orbit (LEO) satellites emerge as a cost-effective win-win solution for direct 4G/5G access to our regular phones/IoTs anywhere on Earth. However, the current hop-by-hop stateful cellular session impedes this effort due to its need for tight functional coupling and stable service relationships among satellite operators, mobile operators, and users. Our empirical study with real satellite data shows that, it restricts LEO satellites' serviceable areas, limits the use of available (possibly competitive) satellites, and suffers from signaling storms and dynamic many-to-many relationships in extreme LEO mobility. We thus devise MOSAIC to strive for self-serve multi-tenant LEO satellites. MOSAIC defines policy-embedded one-time tokens for pay-as-you-go local satellite access. These tokens allow satellites to self-serve users anywhere without relying on remote mobile operators, alleviate inter-satellite coordinations to enjoy competitive satellites, and simplify many-to-many service relationships for on-demand multi-tenancy. MOSAIC is attack-resilient and incrementally deployable using our SIM-based solution. Our evaluations with the real satellite data and commodity 3GPP NTN protocol stack validate MOSAIC's viability.

Known Knowns and Unknowns: Near-realtime Earth Observation Via Query Bifurcation in Serval

Bill Tao, Om Chabra, Ishani Janveja, Indranil Gupta, and Deepak Vasisht, University of Illinois Urbana-Champaign

Available Media

Earth observation satellites, in low Earth orbits, are increasingly approaching near-continuous imaging of the Earth. Today, these satellites capture an image of every part of Earth every few hours. However, the networking capabilities haven’t caught up, and can introduce delays of few hours to days in getting these images to Earth. While this delay is acceptable for delay-tolerant applications like land cover maps, crop type identification, etc., it is unacceptable for latency-sensitive applications like forest fire detection or disaster monitoring. We design Serval to enable near-realtime insights from Earth imagery for latency-sensitive applications despite the networking bottlenecks by leveraging the emerging computational capabilities on the satellites and ground stations. The key challenge for our work stems from the limited computational capabilities and power resources available on a satellite. We solve this challenge by leveraging predictability in satellite orbits to bifurcate computation across satellites and ground stations. We evaluate Serval using trace-driven simulations and hardware emulations on a dataset comprising ten million images captured using the Planet Dove constellation comprising nearly 200 satellites. Serval reduces end-to-end latency for high priority queries from 71.71 hours (incurred by state of the art) to 2 minutes, and 90-th percentile from 149 hours to 47 minutes.

Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things

Vaibhav Singh, Tusher Chakraborty, and Suraj Jog, Microsoft Research; Om Chabra and Deepak Vasisht, UIUC; Ranveer Chandra, Microsoft Research

Available Media

Low Earth Orbit satellite constellations are gaining traction for providing connectivity to low-power outdoor Internet of Things (IoT) devices. This is made possible by the development of low-cost, low-complexity pico-satellites that can be easily launched, offering global connectivity without the need for Earth-based gateways. In this paper, we report the space-to-Earth communication bottlenecks derived from our experience of deploying an IoT satellite. Specifically, we characterize the challenges posed by the low link budgets, satellite motion, and packet collisions. To address these challenges, we design a new class of technique that use the Doppler shift caused by the satellite's motion as a unique signature for packet detection and decoding, even at low signal-to-noise ratios and in the presence of collisions. We integrate these techniques into our system, called Spectrumize, and evaluate its performance through both simulations and real-world deployments. Our evaluation shows that Spectrumize performs 3x better compared to classic approach in detecting packet with over 80% average accuracy in decoding.

10:20 am–10:50 am

Break with Refreshments

Mezzanine

10:50 am–12:30 pm

Track 1

Wide-Area and Edge

Session Chair: Ying Zhang, Meta

Santa Clara Ballroom

Application-Level Service Assurance with 5G RAN Slicing

Arjun Balasingam, MIT CSAIL; Manikanta Kotaru and Paramvir Bahl, Microsoft

Available Media

This paper presents Zipper, a novel Radio Access Network (RAN) slicing system that provides assurances of application-level throughput and latency. Existing RAN slicing systems optimize for slice-level assurance, but these methods fail to provide predictable network performance to individual mobile apps. Extending the slice-level formulation to app-level introduces an intractable optimization problem with exploding state and action spaces. To simplify the search space, Zipper casts the problem as a model predictive controller, and explicitly tracks the network dynamics of each user. It uses an efficient algorithm to compute slice bandwidth allocations that meet each app's requirements. To assist operators with interfacing admission control policies, Zipper exposes a primitive that estimates if there is bandwidth available to accommodate an incoming app's requirements.

We implemented Zipper on a production-class 5G virtual RAN testbed integrated with hooks to control slice bandwidth, and we evaluated it on real workloads, including video conferencing and virtual reality apps. On a representative RAN workload, our real-time implementation supports up to 200 apps and over 70 slices on a 100 MHz channel. Relative to a slice-level service assurance system, Zipper reduces tail throughput and latency violations, measured as a ratio of violation of the app's request, by 9×.

CHISEL: An optical slice of the wide-area network

Abhishek Vijaya Kumar, Cornell University; Bill Owens, NYSERnet; Nikolaj Bjørner, Binbin Guan, Yawei Yin, and Paramvir Bahl, Microsoft; Rachee Singh, Cornell University

Available Media

Network slicing reserves a portion of the physical resources of radio access networks and makes them available to consumers. Slices guarantee traffic isolation, strict bandwidth and quality of service. However, the abstraction of slicing has been limited to access networks. We develop CHISEL, a system that dynamically carves slices of the wide-area network (WAN), enabling an end-to-end network slicing abstraction. CHISEL creates optical slices between WAN endpoints to avoid queueing and congestion delays inherent in packet switched paths in WANs. CHISEL incrementally allocates optical spectrum on long-haul fiber to provision slices. This task is made challenging by the co-existence of data-carrying channels on the fiber and numerous physical constraints associated with provisioning optical paths e.g., spectrum contiguity, continuity and optical reach constraints. CHISEL leverages the empirical finding that cloud WANs have abundant optical spectrum to spare — 75% of optical spectrum on 75% of fiber spans is unused. CHISEL can optimally allocate terabits of slice requests while consuming minimal optical spectrum within seconds without increasing spectral fragmentation on fiber. CHISEL trades-off optimality of slice bandwidth allocation for faster run-time, provisioning slices within 2% of optimal in less than 30 seconds in a commercial cloud WAN. Finally, CHISEL reduces the latency of provisioning optical slices on hardware by 10X. Compared to IP tunnels of equivalent capacity, CHISEL consumes 3.3X fewer router ports.

LuoShen: A Hyper-Converged Programmable Gateway for Multi-Tenant Multi-Service Edge Clouds

Tian Pan, Kun Liu, Xionglie Wei, Yisong Qiao, Jun Hu, Zhiguo Li, Jun Liang, Tiesheng Cheng, Wenqiang Su, Jie Lu, Yuke Hong, Zhengzhong Wang, Zhi Xu, Chongjing Dai, Peiqiao Wang, Xuetao Jia, Jianyuan Lu, Enge Song, and Jun Zeng, Alibaba Cloud; Biao Lyu, Zhejiang University and Alibaba Cloud; Ennan Zhai, Alibaba Cloud; Jiao Zhang and Tao Huang, Purple Mountain Laboratories; Dennis Cai, Alibaba Cloud; Shunmin Zhu, Tsinghua University and Alibaba Cloud

Available Media

Edge clouds are expected to be a key revenue growth driver for cloud vendors in the next decade; however, simply replicating the network infrastructure for the public cloud to the edge experiences deployment issues. At the edge, the challenge for cloud network design is to deliver the required performance under the stringent restrictions of hardware budget and deployment footprints, while retaining functionality equivalence. To this end, we propose LuoShen, a hyper-converged gateway for multi-tenant multi-service edge clouds by consolidating the entire cloud network infrastructure into a 2U server switch with a P4-centric architecture. At the data plane, LuoShen conducts pipeline folding and fits the original overlay and underlay devices into the switch pipeline via meticulous on-chip resource budgeting. At the control plane, LuoShen relies on BGP peering to ensure inter-component reachability. LuoShen achieves 1.2Tbps throughput and reduces the upfront cost, deployment size and power usage by 75%, 87%, 60%, compared with the original cloud network architecture. It has been deployed in Alibaba Cloud at hundreds of edge sites.

Sprinter: Speeding Up High-Fidelity Crawling of the Modern Web

Ayush Goel and Jingyuan Zhu, University of Michigan; Ravi Netravali, Princeton University; Harsha V. Madhyastha, University of Southern California

Available Media

Crawling the web at scale forms the basis of many important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community has paid little attention to this workload in the last decade. In this paper, we highlight the need to revisit the notion that web crawling is a solved problem. Specifically, to discover and fetch all page resources dependent on JavaScript and modern web APIs, crawlers today have to employ compute-intensive web browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput.

To make web crawling more efficient without any loss of fidelity, we present Sprinter, which combines browser-based and browserless crawling to get the best of both. The key to Sprinter’s design is our observation that crawling workloads typically include many pages from every site that is crawled and, unlike in traditional user-facing page loads, there is significant potential to reuse client-side computations across pages. Taking advantage of this property, Sprinter crawls a small, carefully chosen, subset of pages on each site using a browser, and then efficiently identifies and exploits opportunities to reuse the browser’s computations on other pages. Sprinter was able to crawl a corpus of 50,000 pages 5x faster than browser-based crawling, while still closely matching a browser in the set of resources fetched.

Hairpin: Rethinking Packet Loss Recovery in Edge-based Interactive Video Streaming

Zili Meng, Tsinghua University, Hong Kong University of Science and Technology, and Tencent; Xiao Kong and Jing Chen, Tsinghua University and Tencent; Bo Wang and Mingwei Xu, Tsinghua University; Rui Han and Honghao Liu, Tencent; Venkat Arun, UT Austin; Hongxin Hu, University at Buffalo, SUNY; Xue Wei, Tencent

Available Media

Interactive streaming requires minimizing stuttering events (or deadline misses for video frames) to ensure seamless interaction between users and applications. However, existing packet loss recovery mechanisms uniformly optimize redundancy for initial transmission and retransmission, which still could not satisfy the delay requirements of interactive streaming, but also introduces considerable bandwidth costs. Our insight is that in edge-based interactive streaming, differentiating retransmissions on redundancy settings can often achieve a low bandwidth cost and a low deadline miss rate simultaneously. In this paper, we propose Hairpin, a new packet loss recovery mechanism for edge-based interactive streaming. Hairpin finds the optimal combination of data packets, retransmissions, and redundant packets over multiple rounds of transmissions, which significantly reduces the bandwidth cost while ensuring the end-to-end latency requirement. Experiments with production deployments demonstrate that Hairpin can simultaneously reduce the bandwidth cost by 40% and the deadline miss rate by 32% on average in the wild against state-of-the-art solutions.

Track 2

Verification

Session Chair: Srinivas Narayana, Rutgers University

Magnolia Room

Finding Adversarial Inputs for Heuristics using Multi-level Optimization

Pooria Namyar, Microsoft and University of Southern California; Behnaz Arzani and Ryan Beckett, Microsoft; Santiago Segarra, Microsoft and Rice University; Himanshu Raj and Umesh Krishnaswamy, Microsoft; Ramesh Govindan, University of Southern California; Srikanth Kandula, Microsoft

Available Media

Production systems use heuristics because they are faster or scale better than their optimal counterparts. Yet, practitioners are often unaware of the performance gap between a heuristic and the optimum or between two heuristics in realistic scenarios. MetaOpt is a system that helps analyze these heuristics. Users specify the heuristic and the optimal (or another heuristic) as input, and MetaOpt encodes these efficiently for a solver to find performance gaps and their corresponding adversarial inputs. Its suite of built-in optimizations helps it scale to practical problem sizes. We used MetaOpt to analyze heuristics from three domains (traffic engineering, vector bin packing, and packet scheduling). We found a production traffic engineering heuristic can require 30\% more capacity than the optimal in realistic cases. We modified the heuristic based on the patterns in the adversarial inputs MetaOpt discovered and reduced the performance gap by 12.5×. We examined adversarial inputs to a vector bin packing heuristic and proved a new lower bound on its performance.

Towards provably performant congestion control

Anup Agarwal, Carnegie Mellon University; Venkat Arun, University of Texas at Austin; Devdeep Ray, Ruben Martins, and Srinivasan Seshan, Carnegie Mellon University

Available Media

We seek to ease the design of congestion control algorithms (CCAs) that provably perform well under diverse network scenarios including, cellular links, policers, token bucket filters, operating system jitter, etc. Guaranteeing performance under such conditions is hard as it requires considering combinatorial possibilities of CCA and network interactions. We build a framework that allows us to reason about CCAs. It describes (1) the necessary actions that any performant CCA must take, and (2) a provably sufficient amount of information for CCAs to consider when deciding their sending rate. Combining this framework with techniques in formal methods, we synthesize CCAs that provably perform across a diverse set of network conditions. Our methodology also led us to discover and prove fundamental impossibility results.

EPVerifier: Accelerating Update Storms Verification with Edge-Predicate

Chenyang Zhao, Yuebin Guo, Jingyu Wang, Qi Qi, Zirui Zhuang, Haifeng Sun, and Lingqi Guo, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Yuming Xie, Huawei Technologies Co., Ltd; Jianxin Liao, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications

Available Media

Data plane verification is designed to automatically verify network correctness by directly analyzing the data plane. Recent data plane verifiers have achieved sub-millisecond verification for per rule update by partitioning packets into equivalence classes (ECs). A large number of data plane updates can be generated in a short interval, known as update storms, due to network events such as end-to-end establishments, disruption or recovery. When it comes to update storms, however, the verification speed of current EC-based methods is often slowed down by the maintenance of their EC-based network model (EC-model).

This paper presents EPVerifier, a fast, partitioned data plane verification for update storms to further accelerate update storms verification. EPVerifier uses a novel edge-predicate-based (EP-based) local modeling approach to avoid drastic oscillations of the EC-model caused by changes in the set of equivalence classes. In addition, with local EPs, EPVerifier can achieve a partition of verification tasks by switches that EC-based methods cannot to get better parallel performance. We implement EPVerifier as an easy-to-use tool, allowing users to quickly get the appropriate verification results at any moment by providing necessary input. Both dataset trace-driven simulations and deployments in the wild show that EPVerifier achieves robustly fast update storm verification and superior parallel performance and these advantages expand with the data plane's complexity and storm size growth. The verification time of EPVerifier for an update storm of size 1M is around 10s on average, a 2-10× improvement over the state-of-the-art.

Netcastle: Network Infrastructure Testing At Scale

Rob Sherwood, NetDebug.com; Jinghao Shi, Ying Zhang, Neil Spring, Srikanth Sundaresan, Jasmeet Bagga, Prathyusha Peddi, Vineela Kukkadapu, Rashmi Shrivastava, Manikantan KR, Pavan Patil, Srikrishna Gopu, Varun Varadan, Ethan Shi, Hany Morsy, Yuting Bu, Renjie Yang, Rasmus Jönsson, Wei Zhang, Jesus Jussepen Arredondo, and Diana Saha, Meta Platforms Inc.; Sean Choi, Santa Clara University

Available Media

Network operators have long struggled to achieve reliability. Increased complexity risks surprising interactions, increased downtime, and lost person-hours trying to debug correctness and performance problems in large systems. For these reasons, network operators have also long pushed back on deploying promising network research, fearing the unexpected consequences of increased network complexity. Despite the changes’ potential benefits, the corresponding increase in complexity may result in a net loss.

The method to build reliability despite complexity in Software Engineering is testing. In this paper, we use statistics from a large-scale network to identify unique challenges in network testing. To tackle the challenges, we develop Netcastle: a system that provides continuous integration/continuous deployment (CI/CD) network testing as a service for 11 different networking teams, across 68 different use-cases, and O(1k) of test devices. Netcastle supports comprehensive network testing, including device-level firmware, datacenter distributed control planes, and backbone centralized controllers, and runs 500K+ network tests per day, a scale and depth of test coverage previously unpublished. We share five years of experiences in building and running Netcastle at Meta.

MESSI: Behavioral Testing of BGP Implementations

Rathin Singha and Rajdeep Mondal, University of California Los Angeles; Ryan Beckett, Microsoft; Siva Kesava Reddy Kakarla, Microsoft Research; Todd Millstein and George Varghese, University of California Los Angeles

Available Media

Complex network protocols like the Border Gateway Protocol (BGP) are prone to implementation errors that cause unintended behaviors with potentially global consequences. We introduce an approach and tool called MESSI (Modular Exploration of State and Structure Inclusively) to automatically generate tests for black-box BGP implementations. Our approach is model-based, leveraging an executable model of BGP to generate behavioral tests. However, doing so effectively requires addressing new challenges such as the stateful nature of BGP and the need to generate complex structures like regular expressions in route maps. We used MESSI to generate roughly 150K tests that capture different aspects of BGP, such as route-map filtering, the decision process, route aggregation, and dynamics. These tests identified 22 correctness bugs across several widely used open-source BGP implementations (FRR, Quagga, GoBGP, BIRD, Batfish) and one closed-source implementation. Eight of these errors have already been fixed. While our models are BGP-specific our approach is not: thus we expect it can be adapted to test other stateful protocols with complex structures.

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:40 pm

Track 1

Networking at Scale

Session Chair: Junchen Jiang, University of Chicago

Santa Clara Ballroom

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Nils Blach and Maciej Besta, ETH Zürich; Daniele De Sensi, ETH Zürich and Sapienza University of Rome; Jens Domke, RIKEN Center for Computational Science (R-CCS); Hussein Harake, Swiss National Supercomputing Centre (CSCS); Shigang Li, ETH Zürich and BUPT, Beijing; Patrick Iff, ETH Zürich; Marek Konieczny, AGH-UST; Kartik Lakhotia, Intel Labs; Ales Kubicek and Marcel Ferrari, ETH Zürich; Fabrizio Petrini, Intel Labs; Torsten Hoefler, ETH Zürich

Available Media

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.

Crescent: Emulating Heterogeneous Production Network at Scale

Zhaoyu Gao and Anubhavnidhi Abhashkumar, ByteDance; Zhen Sun, Cornell University; Weirong Jiang and Yi Wang, ByteDance

Available Media

This paper presents the design, implementation, evaluation, and deployment of Crescent, ByteDance’s network emulation platform, for preventing change-induced network incidents. Inspired by prior art such as CrystalNet, Crescent achieves high fidelity by running switch vendor images inside containers. But, we explore a different route to scaling up the emulator with unique challenges. First, we analyze our past network incidents to reveal the difficulty in identifying a safe emulation boundary. Instead of emulating the entire network, we exploit the inherent symmetry and modularity of data center network architectures to strike a balance between coverage and resource cost. Second, we study the node-to-host assignment by formulating it as a graph partitioning problem. Evaluation results show that our partitioning algorithm reduces the testbed bootup time by up to 20× compared with random partitioning. Third, we developed an incremental approach to modify the emulated network on the fly. This approach can be 30× faster than creating a new testbed of the same scale. Crescent has been actively used for three and a half years, which led to a significant reduction in change-induced network incidents. We also share Crescent’s success in many other use cases and the critical lessons learned from its deployment.

Reasoning about Network Traffic Load Property at Production Scale

Ruihan Li, Peking University and Alibaba Cloud; Fangdan Ye, Yifei Yuan, Ruizhen Yang, Bingchuan Tian, Tianchen Guo, Hao Wu, Xiaobo Zhu, Zhongyu Guan, Qing Ma, and Xianlong Zeng, Alibaba Cloud; Chenren Xu, Peking University; Dennis Cai and Ennan Zhai, Alibaba Cloud

Available Media

This paper presents Jingubang, the first reported system for checking network traffic load properties (e.g., if any link’s utilization would exceed 80% during a network change) in a production Wide Area Network (WAN). Motivated by our network operators, Jingubang should meet three important requirements: (R1) comprehensive support for complex traffic behavior under BGP, IS-IS, policy-based routes (PBR), and segment routes (SR), (R2) reasoning on traffic load of billions of flows across a period of time, (R3) real-time failure-tolerance analysis. These requirements pose challenges in modeling the complex traffic behavior and maintaining the checking efficiency. Jingubang has successfully addressed these challenges. First, we propose the traffic distribution graph (or TDG), capable of modeling equal-cost multi-path (ECMP), packet rewriting, and tunneling, introduced by BGP/IS-IS, PBR, and SR, respectively. Second, we design an algorithm based on TDG to simulate traffic distribution for billions of flows across a time period both efficiently and accurately. Third, Jingubang proposes an incremental traffic simulation approach that first computes an incremental TDG and then simulates only the differential traffic distribution, avoiding the need to simulate the entire network traffic distribution from scratch. Jingubang has been used in the daily checking of our WAN for more than one year and prevented service downtime resulting from traffic load violations.

POSEIDON: A Consolidated Virtual Network Controller that Manages Millions of Tenants via Config Tree

Biao Lyu, Zhejiang University and Alibaba Cloud; Enge Song, Tian Pan, Jianyuan Lu, Shize Zhang, Xiaoqing Sun, Lei Gao, Chenxiao Wang, Han Xiao, Yong Pan, Xiuheng Chen, Yandong Duan, Weisheng Wang, Jinpeng Long, Yanfeng Wang, Kunpeng Zhou, and Zhigang Zong, Alibaba Cloud; Xing Li, Zhejiang University and Alibaba Cloud; Guangwang Li and Pengyu Zhang, Alibaba Cloud; Peng Cheng and Jiming Chen, Zhejiang University; Shunmin Zhu, Tsinghua University and Alibaba Cloud

Available Media

As the cloud rapidly expands in scale, the virtual network controller must manage an increasing number of devices with higher update frequencies. Furthermore, the emergence of cloud-native services has substantially intensified program-triggered updates, leading to more frequent API invocations. To enhance performance and extensibility, we propose Poseidon, a novel virtual network control framework. Specifically, to reduce operational expenses (OpEx), we have consolidated the common functions of multiple service controllers into a single controller. To manage heterogeneous devices and eliminate the multi-table lookup complexity due to config dependencies, we introduce Trident, a tree-based service- and device-independent abstraction, so that config dependency calculation can be replaced by more efficient tree traversal. After deploying Poseidon on Alibaba Cloud, we observed a 21x increase in the throughput of virtual network configuration tasks, along with a 4.4x decrease in the P99 API processing latency. Poseidon completes the task of enabling hundreds of Elastic IP addresses (EIPs) 1.8 to 55 times faster than Vendors A and B, both of which are among the top 5 providers, for identical network configuration jobs.

OPPerTune: Post-Deployment Configuration Tuning of Services Made Easy

Gagan Somashekar, Stony Brook University; Karan Tandon and Anush Kini, Microsoft Research; Chieh-Chun Chang and Petr Husak, Microsoft; Ranjita Bhagwan, Google; Mayukh Das, Microsoft365 Research; Anshul Gandhi, Stony Brook University; Nagarajan Natarajan, Microsoft Research

Available Media

Real-world application deployments have hundreds of inter-dependent configuration parameters, many of which significantly influence performance and efficiency. With today's complex and dynamic services, operators need to continuously monitor and set the right configuration values (configuration tuning) well after a service is widely deployed. This is challenging since experimenting with different configurations post-deployment may reduce application performance or cause disruptions. While state-of-the-art ML approaches do help to automate configuration tuning, they do not fully address the multiple challenges in end-to-end configuration tuning of deployed applications.

This paper presents OpperTune, a service that enables configuration tuning of applications in deployment at Microsoft. OpperTune reduces application interruptions while maximizing the performance of deployed applications as and when the workload or the underlying infrastructure changes. It automates three essential processes that facilitate post-deployment configuration tuning: (a) determining which configurations to tune, (b) automatically managing the scope at which to tune the configurations, and (c) using a novel reinforcement learning algorithm to simultaneously and quickly tune numerical and categorical configurations, thereby keeping the overhead of configuration tuning low. We deploy OpperTune on two enterprise applications in Microsoft Azure's clusters. Our experiments show that OpperTune reduces the end-to-end P95 latency of microservice applications by more than 50% over expert configuration choices made ahead of deployment. The code and datasets used are made available at https://aka.ms/OPPerTune.

Track 2

ML but Faster

Session Chair: Hong Xu, The Chinese University of Hong Kong

Magnolia Room

Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

Jiangfei Duan, The Chinese University of Hong Kong; Ziang Song, ByteDance; Xupeng Miao and Xiaoli Xi, Carnegie Mellon University; Dahua Lin, The Chinese University of Hong Kong; Harry Xu, University of California, Los Angeles; Minjia Zhang, Microsoft; Zhihao Jia, Carnegie Mellon University

Available Media

Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be preempted by the cloud provider at any time. Prior work that supports DNN training on preemptive instances employs a reactive approach to handling instance preemptions and allocations after their occurrence, which only achieves limited performance and scalability.

We present Parcae, a system that enables cheap, fast, and scalable DNN training on preemptible instances by proactively adjusting the parallelization strategy of a DNN training job to adapt to predicted resource changes before instance preemptions and allocations really happen, which significantly reduces the cost of handling these events. Parcae optimizes liveput, a novel metric that measures the expected training throughput of a DNN job under various possible preemption scenarios. Compared to existing reactive, throughput-optimized systems, Parcae's proactive, live-optimized solution considers both the throughput of a job and its robustness under preemptions. To optimize liveput, Parcae supports lightweight instance migration and uses an availability predictor to forecast future preemptions. It then uses a liveput optimizer to discover an optimal strategy to parallelize DNN training under predicted preemptions. We evaluate Parcae on a variety of DNNs and preemption traces and show that Parcae outperforms existing spot-instance DNN training systems by up to 10×. More importantly, Parcae achieves near-optimal performance for training large DNNs under frequent preemptions, in which case existing approaches cannot make any progress.

Accelerating Neural Recommendation Training with Embedding Scheduling

Chaoliang Zeng, Xudong Liao, Xiaodian Cheng, Han Tian, Xinchen Wan, Hao Wang, and Kai Chen, iSING Lab, Hong Kong University of Science and Technology

Available Media

Deep learning recommendation models (DLRM) are extensively adopted to support many online services. Typical DLRM training frameworks adopt the parameter server (PS) in CPU servers to maintain memory-intensive embedding tables, and leverage GPU workers with embedding cache to accelerate compute-intensive neural network computation and enable fast embedding lookups. However, such distributed systems suffer from significant communication overhead caused by the embedding transmissions between workers and PS. Prior work reduces the number of cache embedding transmissions by compromising model accuracy, including oversampling hot embeddings or applying staleness-tolerant updates.

This paper reveals that many of such transmissions can be avoided given the predictability and infrequency natures of in-cache embedding accesses in distributed training. Based on this observation, we explore a new direction to accelerate distributed DLRM training without compromising model accuracy, i.e., embedding scheduling—with the core idea of proactively determining "where embeddings should be trained" and "which embeddings should be synchronized" to increase the cache hit rate and decrease unnecessary updates, thus achieving a low communication overhead. To realize this idea, we design Herald, a real-time embedding scheduler consisting of two main components: an adaptive location-aware inputs allocator to determine where embeddings should be trained and an optimal communication plan generator to determine which embeddings should be synchronized. Our experiments with real-world workloads show that Herald reduces 48%-89% embedding transmissions, leading up to 2.11× and up to 1.61× better performance with TCP and RDMA, respectively, over 100 Gbps Ethernet for end-to-end DLRM training.

DISTMM: Accelerating Distributed Multimodal Model Training

Jun Huang, The Ohio State University; Zhen Zhang, Amazon Web Services; Shuai Zheng, Boson AI; Feng Qin, The Ohio State University; Yida Wang, Amazon Web Services

Available Media

Multimodal model training takes multiple types of inputs to process with differently structured submodules, and aggregates outcomes from the submodules to learn the relationship among various types of inputs, e.g., correlating text to image for text-to-image generation. The differences of submodule architectures as well as their inputs lead to heterogeneity in terms of computation efficiency. Failing to account for such heterogeneity, existing distributed training systems treat all submodules as a monolithic entity and thus have sub-optimal performance. Moreover, the outcome aggregation phase introduces cross-sample dependencies with contrasting positive and negative sample pairs (i.e., contrastive loss). Such dependencies make the existing pipeline parallelism scheduling algorithms not applicable for multimodal training with contrastive loss.

To address the limitations of existing solutions, we propose DISTIMM. For a given multimodal model, DISTIMM exploits the heterogeneity among submodules, applying different distributed parallelism strategies for each submodule, e.g., using Tensor Parallelism for a computation-intensive submodule, and Data Parallelism for a submodule with a small number of parameters. DISTIMM balances the computation of parallelized submodules to reduce the computing resource idle time of waiting for the slowest submodule. DISTIMM further optimizes the locality of submodules by leveraging the heterogeneous bandwidth of interconnections among accelerators. To address the limitation of existing pipeline execution schedules, we propose a new pipeline execution primitive, called batch-sync instruction, and a corresponding schedule, called DISTIMM-Pipe. We build a prototype of DISTIMM and evaluate it with existing solutions on models with various sizes ranging from 1.1 billion to 26 billion parameters and observe 1.32-3.27 × speedup over Megatron-LM.

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models

Shubham Agarwal and Subrata Mitra, Adobe Research; Sarthak Chakraborty, UIUC; Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini, Adobe Research

Available Media

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, diffusion-models go through a large number of iterative denoising steps, and are resource-intensive, requiring expensive GPUs and incurring considerable latency. In this paper, we introduce a novel approximate-caching technique that can reduce such iterative denoising steps by reusing intermediate noise states created during a prior image generation. Based on this idea, we present an end-to-end text-to-image generation system, NIRVANA, that uses approximate-caching with a novel cache management policy to provide 21% GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Minghao Li, Harvard University; Ran Ben Basat, University College London; Shay Vargaftik, VMware Research; ChonLam Lao, Kevin Xu, Michael Mitzenmacher, and Minlan Yu, Harvard University

Available Media

Deep neural networks (DNNs) are the de facto standard for essential use cases, such as image classification, computer vision, and natural language processing. As DNNs and datasets get larger, they require distributed training on increasingly larger clusters. A main bottleneck is the resulting communication overhead where workers exchange model updates (i.e., gradients) on a per-round basis. To address this bottleneck and accelerate training, a widely-deployed approach is compression. However, previous deployments often apply bi-directional compression schemes by simply using a uni-directional gradient compression scheme in each direction. This results in significant computational overheads at the parameter server and increased compression error, leading to longer training and lower accuracy.

We introduce Tensor Homomorphic Compression (THC), a novel bi-directional compression framework that enables the direct aggregation of compressed values and thus eliminating the aforementioned computational overheads. Moreover, THC is compatible with in-network aggregation (INA), which allows for further acceleration. Our evaluation shows that training representative vision and language models with THC reaches target accuracy by 1.40× to 1.47× faster using INA and 1.28× to 1.33× faster using a software PS compared with state-of-the-art systems.

3:40 pm–4:10 pm

Break with Refreshments

Mezzanine

4:10 pm–5:50 pm

Track 1

Distributed Systems: Part 2

Session Chair: Alan Zaoxing Liu, Boston University

Santa Clara Ballroom

Accelerating Skewed Workloads With Performance Multipliers in the TurboDB Distributed Database

Jennifer Lam, Jeffrey Helt, and Wyatt Lloyd, Princeton University; Haonan Lu, University at Buffalo

Available Media

Distributed databases suffer from performance degradation under skewed workloads. Such workloads cause high contention, which is exacerbated by cross-node network latencies. In contrast, single-machine databases better handle skewed workloads because their centralized nature enables performance optimizations that execute contended requests more efficiently. Based on this insight, we propose a novel hybrid architecture that employs a single-machine database inside a distributed database and present TurboDB, the first distributed database that leverages this hybrid architecture to achieve up to an order of magnitude better performance than representative solutions under skewed workloads.

TurboDB introduces two designs to tackle the core challenges unique to its hybrid architecture. First, Hybrid Concurrency Control is a specialized technique that coordinates the single-machine and distributed databases to collectively ensure process-ordered serializability. Second, Phalanx Replication provides fault tolerance for the single-machine database without significantly sacrificing its performance benefits. We implement TurboDB using CockroachDB and Cicada as the distributed and single-machine databases, respectively. Our evaluation shows that TurboDB significantly improves the performance of CockroachDB under skewed workloads.

SIEVE is Simpler than LRU: an Efficient Turn-Key Eviction Algorithm for Web Caches

Yazhuo Zhang, Emory University; Juncheng Yang, Carnegie Mellon University; Yao Yue, Pelikan Foundation; Ymir Vigfusson, Emory University and Keystrike; K.V. Rashmi, Carnegie Mellon University
Community Award Winner!

Available Media

Caching is an indispensable technique for low-cost and fast data serving. The eviction algorithm, at the heart of a cache, has been primarily designed to maximize efficiency—reducing the cache miss ratio. Many eviction algorithms have been designed in the past decades. However, they all trade off throughput, simplicity, or both for higher efficiency. Such a compromise often hinders adoption in production systems.

This work presents SIEVE, an algorithm that is simpler than LRU and provides better than state-of-the-art efficiency and scalability for web cache workloads. We implemented SIEVE in five production cache libraries, requiring fewer than 20 lines of code changes on average. Our evaluation on 1559 cache traces from 7 sources shows that SIEVE achieves up to 63.2% lower miss ratio than ARC. Moreover, SIEVE has a lower miss ratio than 9 state-of-the-art algorithms on more than 45% of the 1559 traces, while the next best algorithm only has a lower miss ratio on 15%. SIEVE's simplicity comes with superior scalability as cache hits require no locking. Our prototype achieves twice the throughput of an optimized 16-thread LRU implementation. SIEVE is more than an eviction algorithm; it can be used as a cache primitive to build advanced eviction algorithms just like FIFO and LRU.

Harvesting Idle Memory for Application-managed Soft State with Midas

Yifan Qiao, UCLA; Zhenyuan Ruan, MIT CSAIL; Haoran Ma, UCLA; Adam Belay, MIT CSAIL; Miryung Kim and Harry Xu, UCLA

Available Media

Many applications can benefit from data that increases performance but is not required for correctness (commonly referred to as soft state). Examples include cached data from backend web servers and memoized computations in data analytics systems. Today's systems generally statically limit the amount of memory they use for storing soft state in order to prevent unbounded growth that could exhaust the server's memory. Static provisioning, however, makes it difficult to respond to shifts in application demand for soft state and can leave significant amounts of memory idle. Existing OS kernels can only spend idle memory on caching disk blocks—which may not have the most utility—because they do not provide the right abstractions to safely allow applications to store their own soft state.

To effectively manage and dynamically scale soft state, we propose soft memory, an elastic virtual memory abstraction with unmap-and-reconstruct semantics that makes it possible for applications to use idle memory to store whatever soft state they choose while guaranteeing both safety and efficiency. We present Midas, a soft memory management system that contains (1) a runtime that is linked to each application to manage soft memory objects and (2) OS kernel support that coordinates soft memory allocation between applications to maximize their performance. Our experiments with four real-world applications show that Midas can efficiently and safely harvest idle memory to store applications' soft state, delivering near-optimal application performance and responding to extreme memory pressure without running out of memory.

Efficient Exposure of Partial Failure Bugs in Distributed Systems with Inferred Abstract States

Haoze Wu and Jia Pan, Johns Hopkins University; Peng Huang, University of Michigan

Available Media

Many distributed system failures, especially the notorious partial service failures, are caused by bugs that are only triggered by subtle faults at rare timing. Existing testing is inefficient in exposing such bugs. This paper presents Legolas, a fault injection testing framework designed to address this gap. To precisely simulate subtle faults, Legolas statically analyzes the system code and instruments hooks within a system. To efficiently explore numerous faults, Legolas introduces a novel notion of abstract states and automatically infers abstract states from code. During testing, Legolas designs an algorithm that leverages the inferred abstract states to make careful fault injection decisions. We applied Legolas on the latest releases of six popular, extensively tested distributed systems. Legolas found 20 new bugs that result in partial service failures.

Load is not what you should balance: Introducing Prequal

Bartek Wydrowski, Google Research; Robert Kleinberg, Google Research and Cornell; Stephen M. Rumble, Google (YouTube); Aaron Archer, Google Research

Available Media

We present PReQuaL (Probing to Reduce Queuing and Latency), a load balancer for distributed multi-tenant systems. PReQuaL is designed to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. To achieve this, PReQuaL actively probes server load and leverages the power of d choices paradigm, extending it with asynchronous and reusable probes. Cutting against received wisdom, PReQuaL does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). We explore the major design features of PReQuaL on a testbed system and describe our experience using it to balance load within YouTube, where it has been running for more than a year. PReQuaL has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and other production systems at Google to run at much higher utilization.

Track 2

Wireless Hardware

Session Chair: Deepak Vasisht, University of Illinois Urbana–Champaign

Magnolia Room

Orthcatter: High-throughput In-band OFDM Backscatter with Over-the-Air Code Division

Caihui Du and Jihong Yu, Beijing Institute of Technology; Rongrong Zhang, Capital Normal University; Ju Ren, Tsinghua University; Jianping An, Beijing Institute of Technology

Available Media

The existing ambient backscatter systems suffer from either more spectrum utilization or low throughput. we propose Orthcatter, the first in-band OFDM backscatter system that provides a higher throughput while consuming fewer spectrum resources. Our key innovation is the designed over-the-air code division technique that enables the cancellation of the co-channel interferences, solving the core challenge of the in-band backscatter communication. Unlike the common code-division systems that generate orthogonal codewords locally, we construct the quasi-orthogonal backscatter codewords by swapping the subcarriers of each excitation OFDM symbol and concrete this design passively with a double side-band symbol construction method. Armed with these quasi-orthogonal codewords, we design a two-step interference cancellation scheme, significantly improving reliability. We prototype and test Orthcatter. The results show that Orthcatter can achieve throughput of 248kbps and a BER of 10^-4 under OFDM WiFi exciter, improving by over 4.6× and 300× compared with the state-of-the-art in-band backscatter system. Our throughput and BER can even be 11kbps higher and 59× better than the prior side-band backscatter systems, and the exciter-to-tag communication range is 3× of prior OFDM backscatter systems.

EdgeRIC: Empowering Real-time Intelligent Optimization and Control in NextG Cellular Networks

Woo-Hyun Ko, Texas A&M University; Ushasi Ghosh, University of California San Diego; Ujwal Dinesha, Texas A&M University; Raini Wu, University of California San Diego; Srinivas Shakkottai, Texas A&M University; Dinesh Bharadia, University of California San Diego

Available Media

Radio Access Networks (RAN) are increasingly softwarized and accessible via data-collection and control interfaces. RAN intelligent control (RIC) is an approach to manage these interfaces at different timescales. In this paper, we introduce EdgeRIC, a real-time RIC co-located with the Distributed Unit (DU). It is decoupled from the RAN stack, and operates at the RAN timescale. EdgeRIC serves as the seat of real-time AI-in-the-loop for decision and control. It can access RAN and application-level information to execute AI-optimized and other policies in real-time (sub-millisecond). We demonstrate that EdgeRIC operates as if embedded within the RAN stack. We showcase RT applications called μApps over EdgeRIC that significantly outperforms a cloud-based near real-time RIC (> 15 ms latency) in terms of attained system throughput. Further, our over-the-air experiments with AI-based policies showcase their resilience to channel dynamics. Remarkably, these AI policies outperform model-based strategies by 5% to 25% in both system throughput and end user application-level benchmarks across diverse mobile scenarios.

ADR-X: ANN-Assisted Wireless Link Rate Adaptation for Compute-Constrained Embedded Gaming Devices

Hao Yin, University of Washington; Murali Ramanujam, Princeton University; Joe Schaefer, Stan Adermann, Srihari Narlanka, and Perry Lea, Microsoft; Ravi Netravali, Princeton University; Krishna Chintalapudi, Microsoft Research

Available Media

The wireless channel between gaming console and accessories e.g. controllers and headsets, experiences extremely rapid variations due to abrupt head and hand movements amidst an exciting game. In the absence of prior studies on wireless packet losses for console gaming, through extensive evaluations and user studies, we find that state-of-the-art rate adaptation schemes, unable to keep up with these rapid changes, experience packet loss rates of 2-10% while loss rates that are 10× lower (0.1-0.5%) are required to ensure a high quality gaming experience. We present ADR-X, an ANN-based contextual multi-armed bandit rate adaptation technique that continuously predicts and tracks the channel and picks appropriate data rates. A key challenge for ADR-X is that it must run on power and compute constrained embedded devices under realtime constraints. ADR-X addresses this challenge by meticulously crafting an ANN that leverages existing communication theory results to incorporate domain knowledge. This allows ADR-X to achieve 10× lower packet losses than existing schemes while also running 100× faster than state-of-the-art reinforcement learning schemes, making it suitable for deployment on embedded gaming devices.

RFID+: Spatially Controllable Identification of UHF RFIDs via Controlled Magnetic Fields

Donghui Dai, The Hong Kong Polytechnic University; Zhenlin An, The Hong Kong Polytechnic University and Princeton University; Zheng Gong, The Hong Kong Polytechnic University; Qingrui Pan, The Hong Kong Polytechnic University and The University of Edinburgh; Lei Yang, Shenzhen Research Institute, The Hong Kong Polytechnic University

Available Media

In the fast-paced landscape of UHF RFID technology, achieving precise spatial-selective identification is of critical importance in the logistics and retail domain. This work introduces RFID+, a magnetically-driven UHF RFID system that leverages the matching loops of commercial-off-the-shelf UHF RFID tags for efficient energy harvesting from tailored magnetic fields. The RFID+ delivers a level of spatial precision comparable to that of HF NFC systems, effectively mitigating issues of miss-reading and cross-reading. Our primary contributions reside in the development of a specialized multi-turn, capacitor-segmented coil antenna and an innovative fast inventory algorithm. The RFID+ seamlessly integrates traditional radiative coupling with the innovative magnetic coupling in UHF RFID systems, bolstering their overall performance and efficiency. Real-world pilot studies in warehouses and logistics settings reveal that RFID+ significantly diminishes the miss-reading rate from 22.9% down to a remarkable 1.06%, while entirely eliminating cross-reading challenges. Moreover, our RFID+ variant demonstrates better resilience against materials traditionally challenging for UHF RFID, such as water bottles and containers. These advancements make RFID+ exceedingly relevant for practical applications in logistical networks.

SMUFF: Towards Line Rate Wi-Fi Direct Transport with Orchestrated On-device Buffer Management

Chengke Wang, Peking University; Hao Wang, Shenzhen Kaihong Digital Industry Development Co., Ltd.; Yuhan Zhou and Yunzhe Ni, Peking University; Feng Qian, University of Southern California; Chenren Xu, Peking University, Zhongguancun Laboratory, and Key Laboratory of High Confidence Software Technologies, Ministry of Education (PKU)

Available Media

Wi-Fi direct transport provides versatile connectivity that enables convenient data sharing and improves the productivity of mobile end users. However, as today's smartphones are capable of near-Gbps wireless data rates, current solutions do not efficiently utilize the available bandwidth in this single-hop environment. We show that existing transport schemes suffer from resource-intensive reliable delivery mechanisms, inadequate congestion control, and inefficient flow control for achieving line-rate transmission in peer-to-peer Wi-Fi direct links. In this paper, we present SMUFF, a reliable file transfer service that achieves nearly the practical line rate of the underlying wireless bandwidth. We note a unique feature of direct transport—the sender can monitor each buffer along the data path and determine an optimal sending rate accordingly. Therefore, SMUFF can maximize throughput by strategically backlogging the appropriate amount of data in the bottleneck buffer. We have deployed SMUFF on four different phone models, and our evaluations with other transport schemes show that SMUFF achieves up to 94.7% of the practical line rate and 22.6% throughput improvement with a 37% reduction in CPU usage and a 15% reduction in power consumption, compared to state-of-the-art solutions.

6:30 pm–8:00 pm

NSDI '24 Poster Session and Reception

Mezzanine

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, authors, and symposium organizers. View the list of accepted posters.

Thursday, April 18

8:00 am–9:00 am

Continental Breakfast

Mezzanine

9:00 am–10:20 am

Track 1

ML Scheduling

Session Chair: Behnaz Arzani, Microsoft

Santa Clara Ballroom

Vulcan: Automatic Query Planning for Live ML Analytics

Yiwen Zhang and Xumiao Zhang, University of Michigan; Ganesh Ananthanarayanan, Microsoft; Anand Iyer, Georgia Institute of Technology; Yuanchao Shu, Zhejiang University; Victor Bahl, Microsoft Corporation; Z. Morley Mao, University of Michigan and Google; Mosharaf Chowdhury, University of Michigan

Available Media

Live ML analytics have gained increasing popularity with large-scale deployments due to recent evolution of ML technologies. To serve live ML queries, experts nowadays still need to perform manual query planning, which involves pipeline construction, query configuration, and pipeline placement across multiple edge tiers in a heterogeneous infrastructure. Finding the best query plan for a live ML query requires navigating a huge search space, calling for an efficient and systematic solution.

In this paper, we propose Vulcan, a system that automatically generates query plans for live ML queries to optimize their accuracy, latency, and resource consumption. Based on the user query and performance requirements, Vulcan determines the best pipeline, placement, and query configuration for the query with low profiling cost; it also performs fast online adaptation after query deployment. Vulcan outperforms state-of-the-art ML analytics systems by 4.1×-30.1× in terms of search cost while delivering up to 3.3× better query latency.

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

Sudarsanan Rajasekaran and Manya Ghobadi, Massachusetts Institute of Technology; Aditya Akella, UT Austin

Available Media

We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an Affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.

Towards Domain-Specific Network Transport for Distributed DNN Training

Hao Wang and Han Tian, iSING Lab, Hong Kong University of Science and Technology; Jingrong Chen, Duke University; Xinchen Wan, Jiacheng Xia, and Gaoxiong Zeng, iSING Lab, Hong Kong University of Science and Technology; Wei Bai, Microsoft; Junchen Jiang, University of Chicago; Yong Wang and Kai Chen, iSING Lab, Hong Kong University of Science and Technology

Available Media

The nature of machine learning (ML) applications exposes rich characteristics to underlying network transport, yet little work has been done so far to systematically exploit these properties in transport layer design. This paper takes the initiative to pursue a domain-specific network transport, called MLT, for distributed DNN training that fully embraces several unique characteristics of machine learning.

At its heart, MLT employs three simple-yet-effective techniques to form a 3-step progressive scheme against long tail latency caused by transient packet drops and queueing. First, it leverages the independencies among gradient updates to enable per-packet load balancing to minimize network hotspots without worrying about packet re-ordering. Then, if hotspot arises, it performs priority queueing/dropping by differentiating gradients based on their layers and magnitudes to optimize model convergence and accuracy. Lastly, if drop occurs, it enables bounded-loss tolerance—a certain amount of gradient losses tolerated by the DNN training without affecting the final model performance.

MLT is readily deployable with commodity switches and imposes minimal modifications on popular DNN training libraries (e.g., TensorFlow, MXNet and PyTorch) and communication routines (e.g., PS and Ring All-reduce). We show, via both testbed experiments and simulations, that MLT can effectively optimize network tail latency and achieve up to 62.2% better end-to-end training time over prior work.

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

Daniele De Sensi, Sapienza University of Rome; Tommaso Bonato, ETH Zurich; David Saam, RWTH Aachen University; Torsten Hoefler, ETH Zurich

Available Media

The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the number of hops between communicating nodes, especially on networks like torus, where a higher number of hops implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that reduces the number of hops between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.

Track 2

Cloud Scheduling

Session Chair: Adam Belay, MIT CSAIL

Magnolia Room

LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search

Chengquan Feng, University of Science and Technology of China; Li Lyna Zhang, Microsoft Research; Yuanchi Liu, University of Science and Technology of China; Jiahang Xu and Chengruidong Zhang, Microsoft Research; Zhiyuan Wang, University of Science and Technology of China; Ting Cao and Mao Yang, Microsoft Research; Haisheng Tan, University of Science and Technology of China

Available Media

Hardware-Aware Neural Architecture Search (NAS) has demonstrated success in automating the design of affordable deep neural networks (DNNs) for edge platforms by incorporating inference latency in the search process. However, accurately and efficiently predicting DNN inference latency on diverse edge platforms remains a significant challenge. Current approaches require several days to construct new latency predictors for each one platform, which is prohibitively time-consuming and impractical.

In this paper, we propose LitePred, a lightweight approach for accurately predicting DNN inference latency on new platforms with minimal adaptation data by transferring existing predictors. LitePred builds on two key techniques: (i) a Variational Autoencoder (VAE) data sampler to sample high-quality training and adaptation data that conforms to the model distributions in NAS search spaces, overcoming the out-of-distribution challenge; and (ii) a latency distribution-based similarity detection method to identify the most similar pre-existing latency predictors for the new target platform, reducing adaptation data required while achieving high prediction accuracy. Extensive experiments on 85 edge platforms and 6 NAS search spaces demonstrate the effectiveness of our approach, achieving an average latency prediction accuracy of 99.3% with less than an hour of adaptation cost. Compared with SOTA platform-specific methods, LitePred achieves up to 5.3% higher accuracy with a significant 50.6× reduction in profiling cost. Code and predictors are available at https://github.com/microsoft/Moonlit/tree/main/LitePred.

Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds

Jiaqi Lou, University of Illinois Urbana-Champaign; Xinhao Kong, Duke University; Jinghan Huang, University of Illinois Urbana-Champaign; Wei Bai, Microsoft; Nam Sung Kim, University of Illinois Urbana-Champaign; Danyang Zhuo, Duke University

Available Media

Performance isolation is essential for sharing resources in multi-tenant public clouds. Compared with traditional kernel-based networking, RDMA presents unique challenges especially because RDMA NIC's complex microarchitecture resources are often hidden from users. Current RDMA isolation methods overlook these microarchitecture resources, leading to insufficient performance isolation. Consequently, a faulty/malicious tenant can exploit these microarchitecture resources to compromise well-behaved tenants' network performance. In this paper, we introduce the first microarchitecture-resource-aware RDMA performance isolation solution for public clouds, Harmonic. It consists of two key components designed to be conscious of the RDMA NIC's microarchitectural resources: (1) a programmable intelligent PCIe switch (prototyped with FPGA) and (2) an RDMA-friendly rate limiter. At runtime, these two components allow us to accurately monitor and modulate the RDMA NIC resource usage per tenant. We evaluate Harmonic with a state-of-the-art RDMA performance isolation test suite (Husky) and a popular in-memory database application (Redis). We demonstrate that Harmonic can not only successfully pass Husky but also provide Redis with 1.4× higher throughput than the best alternative isolation solution.

LDB: An Efficient Latency Profiling Tool for Multithreaded Applications

Inho Cho, MIT CSAIL; Seo Jin Park, University of Southern California; Ahmed Saeed, Georgia Tech; Mohammad Alizadeh and Adam Belay, MIT CSAIL

Available Media

Maintaining low tail latency is critical for the efficiency and performance of large-scale datacenter systems. Software bugs that cause tail latency problems, however, are notoriously difficult to debug. We present LDB, a new latency profiling tool that aims to overcome this challenge by precisely identifying the specific functions that are responsible for tail latency anomalies. LDB observes the latency of all functions in a running program. It uses a novel, software-only technique called stack sampling, where a busy-spinning stack scanner thread polls lightweight metadata recorded in the call stack, shifting tracing costs away from program threads. In addition, LDB uses event tagging to record requests, inter-thread synchronization, and context switching. This can be used, for example, to generate per-request timelines and to find the root cause of complex tail latency problems such as lock contention in multi-threaded programs. We evaluate LDB with three datacenter applications, finding latency problems in each. Our results further show that LDB produces actionable insights, has low overhead, and can rapidly analyze recordings, making it feasible to use in production settings.

UFO: The Ultimate QoS-Aware Core Management for Virtualized and Oversubscribed Public Clouds

Yajuan Peng, Southern University of Science and Technology and Shenzhen Institutes of Advanced Technology, Chinese Academy of Science; Shuang Chen and Yi Zhao, Shuhai Lab, Huawei Cloud; Zhibin Yu, Shuhai Lab, Huawei Cloud, and Shenzhen Institutes of Advanced Technology, Chinese Academy of Science

Available Media

Public clouds typically adopt (1) multi-tenancy to increase server utilization; (2) virtualization to provide isolation between different tenants; (3) oversubscription of resources to further increase resource efficiency. However, prior work all focuses on optimizing one or two elements, and fails to considerately bring QoS-aware multi-tenancy, virtualization and resource oversubscription together.

We find three challenges when the three elements coexist. First, the double scheduling symptoms are 10x worse with latency-critical (LC) workloads which are comprised of numerous sub-millisecond tasks and are significantly different from conventional batch applications. Second, inner-VM resource contention also exists between threads of the same VM when running LC applications, calling for inner-VM core isolation. Third, no application-level performance metrics can be obtained by the host to guide resource management in realistic public clouds.

To address these challenges, we propose a QoS-aware core manager dubbed UFO to specifically support co-location of multiple LC workloads in virtualized and oversubscribed public cloud environments. UFO solves the three above-mentioned challenges, by (1) coordinating the guest and host CPU cores (vCPU-pCPU coordination), and (2) doing fine-grained inner-VM resource isolation, to push core management in realistic public clouds to the extreme. Compared with the state-of-the-art core manager, it saves up to 50% (average of 22%) of physical cores under the same co-location scenario.

10:20 am–10:50 am

Break with Refreshments

Mezzanine

10:50 am–12:30 pm

Track 1

Programming the Network: Part 2

Session Chair: Luis Pedrosa, INESC-ID and Instituto Superior Técnico, University of Lisbon

Santa Clara Ballroom

Automatic Parallelization of Software Network Functions

Francisco Pereira, Fernando M. V. Ramos, and Luis Pedrosa, INESC-ID, Instituto Superior Técnico, University of Lisbon

Available Media

Software network functions (NFs) trade-off flexibility and ease of deployment for an increased challenge of performance. The traditional way to increase NF performance is by distributing traffic to multiple CPU cores, but this poses a significant challenge: how to parallelize an NF without breaking its semantics? We propose Maestro, a tool that analyzes a sequential implementation of an NF and automatically generates an enhanced parallel version that carefully configures the NIC's Receive Side Scaling mechanism to distribute traffic across cores, while preserving semantics. When possible, Maestro orchestrates a shared-nothing architecture, with each core operating independently without shared memory coordination, maximizing performance. Otherwise, Maestro choreographs a fine-grained read-write locking mechanism that optimizes operation for typical Internet traffic. We parallelized 8 software NFs and show that they generally scale-up linearly until bottlenecked by PCIe when using small packets or by 100~Gbps line-rate with typical Internet traffic. Maestro further outperforms modern hardware-based transactional memory mechanisms, even for challenging parallel-unfriendly workloads.

AutoSketch: Automatic Sketch-Oriented Compiler for Query-driven Network Telemetry

Haifeng Sun and Qun Huang, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University; Jinbo Sun, Institute of Computing Technology, Chinese Academy of Sciences; Wei Wang, Northeastern University, China; Jiaheng Li, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University; Fuliang Li, Northeastern University, China; Yungang Bao, Institute of Computing Technology, Chinese Academy of Sciences; Xin Yao and Gong Zhang, Huawei Theory Department

Available Media

Recent network telemetry witnesses tremendous progress in two directions: query-driven telemetry that targets expressiveness as the primary goal, and sketch-based algorithms that address resource-accuracy trade-offs. In this paper, we propose AutoSketch that aims to integrate the advantages of both classes. In a nutshell, AutoSketch automatically compiles high-level operators into sketch instances that can be readily deployed with low resource usage and incur limited accuracy drop. However, there remains a gap between the expressiveness of high-level operators and the underlying realization of sketch algorithms. AutoSketch bridges this gap in three aspects. First, AutoSketch extends its interface derived from existing query-driven telemetry such that users can specify the desired telemetry accuracy. The specified accuracy intent will be utilized to guide the compiling procedure. Second, AutoSketch leverages various techniques, such as syntax analysis and performance estimation, to construct efficient sketch instances. Finally, AutoSketch automatically searches for the most suitable parameter configurations that fulfill the accuracy intent with minimum resource usage. Our experiments demonstrate that AutoSketch can achieve high expressiveness, high accuracy, and low resource usage compared to state-of-the-art telemetry solutions.

Leo: Online ML-based Traffic Classification at Multi-Terabit Line Rate

Syed Usman Jafri, Sanjay Rao, Vishal Shrivastav, and Mohit Tawarmalani, Purdue University

Available Media

Online traffic classification enables critical applications such as network intrusion detection and prevention, providing Quality-of-Service, and real-time IoT analytics. However, with increasing network speeds, it has become extremely challenging to analyze and classify traffic online. In this paper, we present Leo, a system for online traffic classification at multi-terabit line rates. At its core, Leo implements an online machine learning (ML) model for traffic classification, namely the decision tree, in the network switch's data plane. Leo's design is fast (can classify packets at switch's line rate), scalable (can automatically select a resource-efficient design for the class of decision tree models a user wants to support), and runtime programmable (the model can be updated on-the-fly without switch downtime), while achieving high model accuracy. We implement Leo on top of Intel Tofino switches. Our evaluations show that Leo is able to classify traffic at line rate with nominal latency overhead, can scale to model sizes more than twice as large as state-of-the-art data plane ML classification systems, while achieving classification accuracy on-par with an offline traffic classifier.

Sequence Abstractions for Flexible, Line-Rate Network Monitoring

Andrew Johnson, Princeton University; Ryan Beckett, Microsoft Research; Xiaoqi Chen, Princeton University; Ratul Mahajan, University of Washington; David Walker, Princeton University

Available Media

We develop FLM, a high-level language that enables network operators to write programs that recognize and react to specific packet sequences. To be able to examine every packet, our compilation procedure can transform FLM programs into P4 code that can run on programmable switch ASICs. It first splits FLM programs into a state management component and a classical regular expression, then generates an efficient implementation of the regular expression using SMT-based program synthesis. Our experiments find that FLM can express 15 sequence monitoring tasks drawn from prior literature. Our compiler can convert all of these programs to run on switch hardware in way that fit within available pipeline stages and consume less than 15% additional header fields and instruction words when run alongside switch programs.

OctoSketch: Enabling Real-Time, Continuous Network Monitoring over Multiple Cores

Yinda Zhang, University of Pennsylvania; Peiqing Chen and Zaoxing Liu, University of Maryland

Available Media

Sketching algorithms (sketches) have emerged as a resource-efficient and accurate solution for software-based network monitoring. However, existing sketch-based monitoring makes sacrifices in online accuracy (query time accuracy) and performance (handling line rate traffic with low latency) when dealing with distributed traffic across multiple cores. In this work, we present OctoSketch, a software monitoring framework that can scale a wide spectrum of sketches to many cores with high online accuracy and performance. In contrast to previous systems that adopt straightforward sketch merges from individual cores to obtain the aggregated result, we devise a continuous, change-based mechanism that can generally be applied to sketches to perform the aggregation. This design ensures high online accuracy of the aggregated result at any query time and reduces computation costs to achieve high throughput. We apply OctoSketch to nine representative sketches on three software platforms (CPU, DPDK, and eBPF XDP). Our results demonstrate that OctoSketch achieves about 15.6× lower errors and up to 4.5× higher throughput than the state-of-the-art.

Track 2

Wireless Sensing

Session Chair: Chenren Xu, Peking University

Magnolia Room

NR-Surface: NextG-ready µW-reconfigurable mmWave Metasurface

Minseok Kim, Namjo Ahn, and Song Min Kim, KAIST

Available Media

Metasurface has recently emerged as an economic solution to expand mmWave coverage. However, their pervasive deployment remains a challenge, mainly due to the difficulty in reaching the tight 260ns NR synchronization requirement and real-time wireless reconfiguration while maintaining multi-year battery life. This paper presents NR-Surface, the first real-time reconfigurable metasurface fully compliant with the NR standard, operating at 242.7 µW for a 2.1-year lifetime on an AA battery. NR-Surface incorporates (i) a new extremely low-power (14KHz sampling) reconfiguration interface, NarrowBand Packet Unit (NBPU), for synchronization and real-time reconfiguration, and (ii) a highly responsive and low-leakage metasurface designed for low-duty cycled operation, by carefully leveraging the structure and the periodicity of the NR beam management procedure in the NR standard. NR-Surface is prototyped and evaluated end-to-end with NR BS built on srsRAN to demonstrate diverse usage scenarios including multiple NR-Surface per BS, multiple UE per NR-Surface, and 3D beamforming. Around-the-corner UE evaluations showcase NR-Surface efficacy under different user mobility patterns (20.3dB gain) and dynamic blockage (22.2dB gain).

Cyclops: A Nanomaterial-based, Battery-Free Intraocular Pressure (IOP) Monitoring System inside Contact Lens

Liyao Li, University at Buffalo SUNY and Northwest University; Bozhao Shang and Yun Wu, Northwest University and Shaanxi International Joint Research Centre for the Battery-Free Internet of Things; Jie Xiong, University of Massachusetts Amherst and Microsoft Research Asia; Xiaojiang Chen, Northwest University and Shaanxi International Joint Research Centre for the Battery-Free Internet of Things; Yaxiong Xie, University at Buffalo SUNY

Available Media

Intraocular pressure (IOP), commonly known as eye pressure, is a critical physiological parameter related to health. Contact lens-based IOP sensing has garnered significant attention in research. Existing research has been focusing on developing the sensor itself, so the techniques used to read sensing data only support a reading range of several centimeters, becoming the main obstacle for real-world deployment. This paper presents Cyclops, the first battery-free IOP sensing system integrated into a contact lens, which overcomes the proximity constraints of traditional reading methods. Cyclops features a three-layer antenna comprising two metallic layers and a nanomaterial-based sensing layer in between. This innovative antenna serves a dual purpose, functioning as both a pressure sensor and a communication antenna simultaneously. The antenna is connected to an RFID chip, which utilizes a low-power self-tuning circuit to achieve high-precision pressure sensing, akin to a 9-bit ADC. Extensive experimental results demonstrate that Cyclops supports communication at meter-level distances, and its IOP measurement accuracy surpasses that of commercial portable IOP measurement devices.

Habitus: Boosting Mobile Immersive Content Delivery through Full-body Pose Tracking and Multipath Networking

Anlan Zhang, University of Southern California; Chendong Wang, University of Wisconsin — Madison; Yuming Hu, University of Minnesota — Twin Cities; Ahmad Hassan and Zejun Zhang, University of Southern California; Bo Han, George Mason University; Feng Qian, University of Southern California; Shichang Xu, Google

Available Media

Delivering immersive content such as volumetric videos and virtual/mixed reality requires tremendous network bandwidth. Millimeter Wave (mmWave) radios such as 802.11ad/ay and mmWave 5G can provide multi-Gbps peak bandwidth, making them good candidates. However, mmWave is vulnerable to blockage/mobility and its signal attenuates very fast, posing a major challenge to mobile immersive content delivery systems where viewers are in constant motion and the human body may easily block the line-of-sight.

To overcome this challenge, in this paper, we investigate two under-explored dimensions. First, we use the combination of a viewer’s full-body pose and the network information to predict mmWave performance as the viewer exercises six-degree-of-freedom (6-DoF) motion. We apply both offline and online transfer learning to enable the prediction models to react to unseen changes during initial training. Second, we jointly use the omnidirectional radio and mmWave radio available on commodity mobile devices, which have complementary network characteristics, to deliver immersive data. We integrate the above two features into a user-space software framework called Habitus, and demonstrate how it can be easily integrated into existing immersive content delivery systems to boost their network performance, which leads to up to 72% of quality-of-experience (QoE) improvement

BFMSense: WiFi Sensing Using Beamforming Feedback Matrix

Enze Yi and Dan Wu, Peking University; Jie Xiong, University of Massachusetts Amherst; Fusang Zhang, Institute of Software, Chinese Academy of Sciences and University of Chinese Academy of Sciences; Kai Niu, Beijing Xiaomi Mobile Software Company Ltd.; Wenwei Li, Peking University; Daqing Zhang, Peking University and Institut Polytechnique de Paris

Available Media

WiFi-based contactless sensing has attracted a tremendous amount of attention due to its pervasiveness, low-cost, and non-intrusiveness to users. Existing systems mainly leverage channel state information (CSI) for sensing. However, CSI can only be extracted from very few commodity WiFi devices through driver hacking, severely limiting the adoption of WiFi sensing in real life. We observe a new opportunity that a large range of new-generation WiFi cards can report another piece of information, i.e., beamforming feedback matrix (BFM). In this paper, we propose to leverage this new BFM information for WiFi sensing. Through establishing the relationship between BFM and CSI, we lay the theoretical foundations for BFM-based WiFi sensing for the first time. We show that through careful signal processing, BFM can be utilized for fine-grained sensing. We showcase the sensing capability of BFM using two representative sensing applications, i.e., respiration sensing and human trajectory tracking. Comprehensive experiments show that BFM-based WiFi sensing can achieve highly accurate sensing performance on a large range of new-generation WiFi devices from various manufacturers, moving WiFi sensing one big step towards real-life adoption.

mmComb: High-speed mmWave Commodity WiFi Backscatter

Yoon Chae and Zhenzhe Lin, George Mason University; Kang Min Bae and Song Min Kim, Korea Advanced Institute of Science and Technology (KAIST); Parth Pathak, George Mason University

Available Media

High-speed connectivity is key to enabling a range of novel IoT applications. Millimeter-wave (mmWave) backscatter has emerged as a possible solution to create high-speed, low-power IoT networks. However, state-of-the-art mmWave backscatter systems are costly due to the need for dedicated mmWave reader devices. This paper presents mmComb, a mmWave backscatter system that is built to operate on commodity mmWave WiFi. mmComb is developed with the aim that mmWave backscatter tags can be directly integrated into 802.11ad/ay mmWave WiFi networks. mmComb makes two key contributions. First, We propose a technique to communicate with backscatter tags using existing beamforming protocol frames from mmWave WiFi devices, without any protocol modification. Second, we develop a self-interference suppression solution that intelligently uses receive beamforming to extract weak mmWave backscatter signal even in indoor multipath-rich channels. We implement our solution with a tag prototype and 60 GHz commodity WiFi devices. Our results show that mmComb can achieve a maximum data rate of 55 Mbps just by leveraging 802.11ad/ay control frames while consuming 87.3 μW with BER lower than 10^−3 up to 5.5 m range.

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:40 pm

Track 1

Security

Session Chair: Sebastian Angel, University of Pennsylvania

Santa Clara Ballroom

Where The Wild Things Are: Brute-Force SSH Attacks In The Wild And How To Stop Them

Sachin Kumar Singh and Shreeman Gautam, University of Utah; Cameron Cartier, University of Utah and Black Hills Information Security; Sameer Patil and Robert Ricci, University of Utah

Available Media

SSH (Secure Shell) is widely used for remote access to systems and cloud services. This access comes with the persistent threat of SSH password-guessing brute-force attacks (BFAs) directed at sshd-enabled devices connected to the Internet. In this work, we present a comprehensive study of such attacks on a production facility (CloudLab), offering previously unreported insight. Our study provides a detailed analysis of SSH BFAs occurring on the Internet today through an in-depth analysis of sshd logs collected over a period of four years from over 500 servers. We report several patterns in attacker behavior, present insight on the targets of the attacks, and devise a method for tracking individual attacks over time across sources. Leveraging our insight, we develop a defense mechanism against SSH BFAs that blocks 99.5% of such attacks, significantly outperforming the 66.1% coverage of current state-of-the-art rate-based blocking while also cutting false positives by 83%. We have deployed our defense in production on CloudLab, where it catches four-fifths of SSH BFAs missed by other defense strategies.

A System to Detect Forged-Origin BGP Hijacks

Thomas Holterbach and Thomas Alfroy, University of Strasbourg; Amreesh Phokeer, Internet Society; Alberto Dainotti, Georgia Tech; Cristel Pelsser, UCLouvain

Available Media

Despite global efforts to secure Internet routing, attackers still successfully exploit the lack of strong BGP security mechanisms. This paper focuses on an attack vector that is frequently used: Forged-origin hijacks, a type of BGP hijack where the attacker manipulates the AS path to make it immune to RPKI-ROV filters and appear as legitimate routing updates from a BGP monitoring standpoint. Our contribution is DFOH, a system that quickly and consistently detects forged-origin hijacks in the whole Internet. Detecting forged-origin hijacks boils down to inferring whether the AS path in a BGP route is legitimate or has been manipulated. We demonstrate that current state-of-art approaches to detect BGP anomalies are insufficient to deal with forged-origin hijacks. We identify the key properties that make the inference of forged AS paths challenging, and design DFOH to be robust against real-world factors. Our inference pipeline includes two key ingredients: (i) a set of strategically selected features, and (ii) a training scheme adapted to topological biases. DFOH detects 90.9% of the forged-origin hijacks within only ≈5min. In addition, it only reports ≈17.5 suspicious cases every day for the whole Internet, a small number that allows operators to investigate the reported cases and take countermeasures.

NetVigil: Robust and Low-Cost Anomaly Detection for East-West Data Center Security

Kevin Hsieh, Microsoft; Mike Wong, Princeton University and Microsoft; Santiago Segarra, Microsoft and Rice University; Sathiya Kumaran Mani, Trevor Eberl, and Anatoliy Panasyuk, Microsoft; Ravi Netravali, Princeton University; Ranveer Chandra and Srikanth Kandula, Microsoft

Available Media

The growing number of breaches in data centers underscores an urgent need for more effective security. Traditional perimeter defense measures and static zero-trust approaches are unable to address the unique challenges that arise from the scale, complexity, and evolving nature of today's data center networks. To tackle these issues, we introduce NetVigil, a robust and cost-efficient anomaly detection system specifically designed for east-west traffic within data center networks. NetVigil adeptly extracts security-focused, graph-based features from network flow logs and employs domain-specific graph neural networks (GNNs) and contrastive learning techniques to strengthen its resilience against normal traffic variations and adversarial evasion strategies. Our evaluation, over various attack scenarios and traces from real-world production clusters, shows that NetVigil delivers significant improvements in accuracy, cost, and detection latency compared to state-of-the-art anomaly detection systems, providing a practical, supplementary security mechanism to protect the east-west traffic within data center networks.

TANGO: Secure Collaborative Route Control across the Public Internet

Henry Birge-Lee, Sophia Yoo, Benjamin Herber, Jennifer Rexford, and Maria Apostolaki, Princeton University

Available Media

As the demands of modern latency-critical applications grow, major service providers are seeking to meet those demands by expanding their infrastructure to the edge and offering global connectivity through private WANs or Network-as-a-Service solutions. Unfortunately, these approaches are costly for smaller edge networks and lead to Internet consolidation. Worse, since the public Internet suffers from limited visibility and control over interdomain routing, smaller edges today are left with poor alternatives outside of joining the hypergiants. As a new alternative, we introduce TANGO, which enables smaller edges to expose paths and exert route control over the public Internet without relying on third parties or cooperation from the Internet core, to dynamically meet the performance needs of their customers. We show that, using collaboration, TANGO edges can jointly (i) expose more BGP-compliant wide-area paths via coordinated BGP advertisements; (ii) collect fine-grained, trustworthy telemetry using cryptographically-protected custom headers; and (iii) dynamically reroute traffic in the data plane. TANGO innovates in both the control and data planes, and runs on a programmable switch or in eBPF. Our Internet-scale experiments uncover rich path diversity, exposing paths that outperform the default BGP path 75-100% of the time for 20 edge pairs across multiple continents, while reducing latency by up to 39% compared to the default.

Sidekick: In-Network Assistance for Secure End-to-End Transport Protocols

Gina Yuan, Matthew Sotoudeh, and David K. Zhang, Stanford University; Michael Welzl, University of Oslo; David Mazières and Keith Winstein, Stanford University
Outstanding Paper Award and Community Award Winner!

Available Media

In response to concerns about protocol ossification and privacy, post-TCP transport protocols such as QUIC and Web-RTC include end-to-end encryption and authentication at the transport layer. This makes their packets opaque to middleboxes, freeing the transport protocol to evolve but preventing some in-network innovations and performance improvements. This paper describes sidekick protocols: an approach to in-network assistance for opaque transport protocols where in-network intermediaries help endpoints by sending information adjacent to the underlying connection, which remains opaque and unmodified on the wire.

A key technical challenge is how the sidekick connection can efficiently refer to ranges of packets of the underlying connection without the ability to observe cleartext sequence numbers. We present a mathematical tool called a quACK that concisely represents a selective acknowledgment of opaque packets, without access to cleartext sequence numbers.

In real-world and emulation-based evaluations, the sidekick improved performance in several scenarios: early retransmission over lossy Wi-Fi paths, proxy acknowledgments to save energy, and a path-aware congestion-control mechanism we call PACUBIC that emulates a "split" connection.

Track 2

Mobile Things

Session Chair: Dave Oran, Network Systems Research & Design

Magnolia Room

VILAM: Infrastructure-assisted 3D Visual Localization and Mapping for Autonomous Driving

Jiahe Cui, Beihang University, The Chinese University of Hong Kong, and Tianmushan Laboratory; Shuyao Shi and Yuze He, The Chinese University of Hong Kong; Jianwei Niu, Beihang University; Guoliang Xing, The Chinese University of Hong Kong; Zhenchao Ouyang, Tianmushan Laboratory and International Innovation Institute of Beihang University

Available Media

Visual Simultaneous Localization and Mapping (SLAM) presents a promising avenue for fulfilling the essential perception and localization tasks in autonomous driving systems using cost-effective visual sensors. Nevertheless, existing visual SLAM frameworks often suffer from substantial cumulative errors and performance degradation in complicated driving scenarios. In this paper, we propose VILAM, a novel framework that leverages intelligent roadside infrastructures to realize high-precision and globally consistent localization and mapping on autonomous vehicles. The key idea of VILAM is to utilize the precise scene measurement from the infrastructure as global references to correct errors in the local map constructed by the vehicle. To overcome the unique deformation in the 3D local map to align it with the infrastructure measurement, VILAM proposes a novel elastic point cloud registration method that enables independent optimization of different parts of the local map. Moreover, VILAM adopts a lightweight factor graph construction and optimization to first correct the vehicle trajectory, and thus reconstruct the consistent global map efficiently. We implement the VILAM end-to-end on a real-world smart lamppost testbed in multiple road scenarios. Extensive experiment results show that VILAM can achieve decimeter-level localization and mapping accuracy with consumer-level onboard cameras and is robust under diverse road scenarios. A video demo of VILAM on our real-world testbed is available at https://youtu.be/lTlqDNipDVE.

Catch Me If You Can: Laser Tethering with Highly Mobile Targets

Charles J. Carver, Hadleigh Schwartz, and Qijia Shao, Columbia University; Nicholas Shade, Joseph Lazzaro, Xiaoxin Wang, Jifeng Liu, and Eric Fossum, Dartmouth College; Xia Zhou, Columbia University

Available Media

Conventional wisdom holds that laser-based systems cannot handle high mobility due to the strong directionality of laser light. We challenge this belief by presenting Lasertag, a generic framework that tightly integrates laser steering with optical tracking to maintain laser connectivity with high-velocity targets. Lasertag creates a constantly connected, laser-based tether between the Lasertag core unit and a remote target, irrespective of the target's movement. Key elements of Lasertag include (1) a novel optical design that superimposes the optical paths of a steerable laser beam and image sensor, (2) a lightweight optical tracking mechanism for passive retroreflective markers, (3) an automated mapping method to translate scene points to laser steering commands, and (4) a predictive steering algorithm that overcomes limited image sensor frame rates and laser steering delays to quadruple the steering rate up to 151Hz. Experiments with the Lasertag prototype demonstrate that, on average, Lasertag transmits a median 0.97 of laser energy with a median alignment offset of only 1.03cm for mobile targets accelerating up to 49m/s^2, with speeds up to 6.5m/s, and distances up to 6m (≈ 47°/s). Additional experiments translate the above performance to a 10^-8 median bit error rate across trials when transmitting a 1Gbps, on-off keying signal. Lasertag paves the way for various laser applications (e.g., communication, sensing, power delivery) in mobile settings. A demonstration video of Lasertag is available at: mobilex.cs.columbia.edu/lasertag

MobileConfig: Remote Configuration Management for Mobile Apps at Hyperscale

Matt Guo, Meta Platforms; Soteris Demetriou, Imperial College London; Joey Yang, Michael Leighton, Diedi Hu, Tong Bao, Amit Adhikari, Thawan Kooburat, Annie Kim, and Chunqiang Tang, Meta Platforms

Available Media

While software configuration management is a ubiquitous practice in the industry and has been extensively studied, prior research has focused solely on desktop or server applications. This paper presents MobileConfig, perhaps the world's largest configuration management system for mobile apps. It has been in production since 2015 and manages apps running on billions of devices, including Facebook, Instagram, Messenger, and AR/VR/glasses apps. Every day, Meta's developers make a staggering number of live configuration changes, often in the thousands, to remotely control mobile apps, driving them to change runtime behaviors without requiring app code updates. These configuration changes serve diverse purposes such as A/B testing, feature rollout, and app personalization. We discuss how MobileConfig addresses several challenges unique to mobile environments, including (1) the lack of data consistency models that can simultaneously ensure both fast app startup and configuration data freshness; (2) the risk of misconfiguration impacting billions of app users; and (3) the proliferation of mobile client SDKs needed to support diverse mobile platforms, programming languages, and configuration use cases.

Passengers' Safety Matters: Experiences of Deploying a Large-Scale Indoor Delivery Monitoring System

Xiubin Fan, City University of Hong Kong; Zhongming Lin, The Hong Kong University of Science and Technology; Yuming Hu, University of Minnesota - Twin Cities; Tianrui Jiang, The Hong Kong University of Science and Technology; Feng Qian, University of Southern California; Zhimeng Yin, City University of Hong Kong; S.-H. Gary Chan, The Hong Kong University of Science and Technology; Dapeng Wu, City University of Hong Kong

Available Media

Delivering goods to many indoor stores poses significant safety issues, as heavy, high-stacked packages carried on delivery trolleys may fall and hurt passersby. This paper reports our experiences of developing and operating DeMo, a practical system for real-time monitoring of indoor delivery. DeMo attaches sensors to trolleys and analyzes the Inertial Measurement Unit (IMU) and Bluetooth Low Energy (BLE) readings to detect delivery violations such as speeding and using non-designated delivery paths. Differing from typical indoor localization applications, DeMo overcomes unique challenges such as unique sensor placement and complex electromagnetic characteristics underground. In particular, DeMo adapts the classical logarithmic radio signal model to support fingerprint-free localization, drastically lowering the deployment and maintenance cost. DeMo has been operating since May 2020, covering more than 200 shops with 42,248 deliveries (3521.4 km) across 12 subway stations in Hong Kong. DeMo's 3-year operation witnessed a significant violation rate drop, from 19% (May 2020) to 2.7% (Mar 2023).

AUGUR: Practical Mobile Multipath Transport Service for Low Tail Latency in Real-Time Streaming

Yuhan Zhou, School of Computer Science, Peking University and Tencent Inc.; Tingfeng Wang, Tencent Inc.; Liying Wang, School of Computer Science, Peking University; Nian Wen, Rui Han, Jing Wang, Chenglei Wu, Jiafeng Chen, and Longwei Jiang, Tencent Inc.; Shibo Wang, Xi'an Jiaotong University and Tencent Inc.; Honghao Liu, Tencent Inc.; Chenren Xu, School of Computer Science, Peking University and Zhongguancun Laboratory and Key Laboratory of High Confidence Software Technologies, Ministry of Education (PKU)

Available Media

Real-time streaming applications like cloud gaming require consistently low latency, even at the tail. Our large-scale measurement based on a major cloud gaming service provider reveals that in Wi-Fi networks, the delay of the wireless hop can inflate due to its fluctuating nature, making it difficult to achieve consistently low tail latency. While cellular paths can be leveraged to alleviate the impact of wireless fluctuation of Wi-Fi paths, our user study reveals that it is crucial to constrain cellular data usage while using multipath transport. In this paper, we present AUGUR, a multipath transport service designed to reduce long tail latency and video frame stall rates in mobile real-time streaming. To address the challenge of reducing long tail latency by utilizing cellular paths while minimizing cellular data usage, AUGUR captures user characteristics by deriving state probability models and formulates the equilibrium into Integer Linear Programming (ILP) problems for each user session to determine the opportunity of frame retransmission and path selection. Our trace-driven emulation and large-scale real-world deployment in a Tencent Start cloud gaming platform demonstrate that AUGUR achieves up to 66.0% reduction in tail latency and 99.5% reduction in frame stall rate with 88.1% decrease in cellular data usage compared to other multipath transport schemes.

3:40 pm–4:10 pm

Break with Refreshments

Mezzanine

4:10 pm–5:30 pm

Track 1

Cloud Systems

Session Chair: Irene Zhang, Microsoft Research

Santa Clara Ballroom

Zombie: Middleboxes that Don’t Snoop

Collin Zhang, Cornell; Zachary DeStefano, Arasu Arun, and Joseph Bonneau, NYU; Paul Grubbs, University of Michigan; Michael Walfish, NYU

Available Media

Zero-knowledge middleboxes (ZKMBs) are a recent paradigm in which clients get privacy and middleboxes enforce policy: clients prove in zero knowledge that the plaintext underlying their encrypted traffic complies with network policies, such as DNS filtering. However, prior work had impractically poor performance and was limited in functionality.

This work presents Zombie, the first system built using the ZKMB paradigm. Zombie introduces techniques that push ZKMBs to the verge of practicality: preprocessing (to move the bulk of proof generation to idle times between requests), asynchrony (to remove proving and verifying costs from the critical path), and batching (to amortize some of the verification work). Zombie’s choices, together with these techniques, reduce client and middlebox overhead by ≈ 3.5×, lowering the critical path overhead for a DNS filtering application on commodity hardware to less than 300ms or, in the asynchronous configuration, to 0.

As an additional contribution that is likely of independent interest, Zombie introduces a portfolio of techniques to encode regular expressions in probabilistic (and zero-knowledge) proofs. These techniques significantly improve performance over a standard baseline, asymptotically and concretely. Zombie builds on this portfolio to support policies based on regular expressions, such as data loss prevention.

Solving Max-Min Fair Resource Allocations Quickly on Large Graphs

Pooria Namyar, Microsoft and University of Southern California; Behnaz Arzani and Srikanth Kandula, Microsoft; Santiago Segarra, Microsoft and Rice University; Daniel Crankshaw and Umesh Krishnaswamy, Microsoft; Ramesh Govindan, University of Southern California; Himanshu Raj, Microsoft

Available Media

We consider the max-min fair resource allocation problem. The best-known solutions use either a sequence of optimizations or waterfilling, which only applies to a narrow set of cases. These solutions have become a practical bottleneck in WAN traffic engineering and cluster scheduling, especially at larger problem sizes. We improve both approaches: (1) we show how to convert the optimization sequence into a single fast optimization, and (2) we generalize waterfilling to the multi-path case. We empirically show our new algorithms Pareto-dominate prior techniques: they produce faster, fairer, and more efficient allocations. Some of our allocators also have theoretical guarantees: they trade off a bounded amount of unfairness for faster allocation. We have deployed our allocators in Azure's WAN traffic engineering pipeline, where we preserve solution quality and achieve a roughly 3× speedup.

Cloud-LoRa: Enabling Cloud Radio Access LoRa Networks Using Reinforcement Learning Based Bandwidth-Adaptive Compression

Muhammad Osama Shahid, Daniel Koch, Jayaram Raghuram, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison

Available Media

The Cloud Radio Access Network (CRAN) architecture has been proposed as a way of addressing the network throughput and scalability challenges of large-scale LoRa networks. CRANs can improve network throughput by coherently combining signals, and scale to multiple channels by implementing the receivers in the cloud. However, in remote LoRa deployments, a CRAN's demand for high-backhaul bandwidths can be challenging to meet. Therefore, bandwidth-aware compression of LoRa samples is needed to reap the benefits of CRANs. We introduce Cloud-LoRa, the first practical CRAN for LoRa, that can detect sub-noise LoRa signals and perform bandwidth-adaptive compression. To the best of our knowledge, this is the first demonstration of CRAN for LoRa operating in real-time. We deploy Cloud-LoRa in an agricultural field over multiple days with USRP as the gateway. A cellular backhaul hotspot is then used to stream the compressed samples to a Microsoft Azure server. We demonstrate SNR gains of over 6 dB using joint multi-gateway decoding and over 2x throughput improvement using state-of-the-art receivers, enabled by CRAN in real-world deployments.

Cloudy with a Chance of Cyberattacks: Dangling Resources Abuse on Cloud Platforms

Jens Frieß, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt; Tobias Gattermayer, National Research Center for Applied Cybersecurity ATHENE and Fraunhofer Institute for Secure Information Technology SIT; Nethanel Gelernter, IONIX; Haya Schulmann, Goethe-Universität Frankfurt and National Research Center for Applied Cybersecurity ATHENE; Michael Waidner, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt and Fraunhofer Institute for Secure Information Technology SIT

Available Media

Recent works showed that it is feasible to hijack resources on cloud platforms. In such hijacks, attackers can take over released resources that belong to legitimate organizations. It was proposed that adversaries could abuse these resources to carry out attacks against customers of the hijacked services, e.g., through malware distribution. However, to date, no research has confirmed the existence of these attacks.

We identify, for the first time, real-life hijacks of cloud resources. This yields a number of surprising and important insights. First, contrary to previous assumption that attackers primarily target IP addresses, our findings reveal that the type of resource is not the main consideration in a hijack. Attackers focus on hijacking records that allow them to determine the resource by entering freetext. The costs and overhead of hijacking such records are much lower than those of hijacking IP addresses, which are randomly selected from a large pool.

Second, identifying hijacks poses a substantial challenge. Monitoring resource changes, e.g., changes in content, is insufficient, since such changes could also be legitimate. Retrospective analysis of digital assets to identify hijacks is also arduous due to the immense volume of data involved and the absence of indicators to search for. To address this challenge, we develop a novel approach that involves analyzing data from diverse sources to effectively differentiate between malicious and legitimate modifications. Our analysis has revealed 20,904 instances of hijacked resources on popular cloud platforms. While some hijacks are short-lived (up to 15 days), 1/3 persist for more than 65 days.

We study how attackers abuse the hijacked resources and find that, in contrast to the threats considered in previous work, the majority of the abuse (75%) is blackhat search engine optimization. We also find fraudulent certificates and stolen cookies. We cluster the abuse resources and abuse content to identify about 1,800 individual attacking infrastructures.

Track 2

Modeling Networks

Session Chair: Robert Ricci, University of Utah

Magnolia Room

CAPA: An Architecture For Operating Cluster Networks With High Availability

Bingzhe Liu, UIUC; Colin Scott, Mukarram Tariq, Andrew Ferguson, Phillipa Gill, Richard Alimi, Omid Alipourfard, Deepak Arulkannan, Virginia Jean Beauregard, and Patrick Conner‎, Google; P. Brighten Godfrey, UIUC; Xander Lin, Joon Ong, Mayur Patel, Amr Sabaa, Arjun Singh, Alex Smirnov, Manish Verma, Prerepa V Viswanadham, and Amin Vahdat, Google

Available Media

Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counter-factual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years.

NetAssistant: Dialogue Based Network Diagnosis in Data Center Networks

Haopei Wang, Anubhavnidhi Abhashkumar, Changyu Lin, Tianrong Zhang, Xiaoming Gu, Ning Ma, Chang Wu, Songlin Liu, Wei Zhou, Yongbin Dong, Weirong Jiang, and Yi Wang, ByteDance Inc

Available Media

In large-scale data center networks, answering network diagnosis queries from users still heavily rely on manual oncall services. A widespread scenario is when network users query whether any network issue is causing problems with their services/applications. However, this approach requires extensive experience and considerable efforts from network engineers who must repeatedly go through lots of monitoring dashboards and logs. It is notoriously slow, error-prone, and costly. We ask: is this the right solution, given the state of the art in network intelligence?

To answer, we first extensively study thousands of real network diagnosis cases and provide insights into how to address these issues more efficiently. Then we propose an AI enabled diagnosis framework and instantiate it in a task-oriented dialogue based diagnosis system, or colloquially, a chatbot, called NetAssistant. It accepts questions in natural language and performs proper diagnosis workflows in a timely manner. NetAssistant has been deployed and running in the data centers of our company for more than three years with hundreds of usages every day. We show it significantly decreases the number and duration of human involved oncalls. We share our experience on how to make it reliable and trustworthy and showcase how it helps solve real production issues efficiently.

Klonet: an Easy-to-Use and Scalable Platform for Computer Networks Education

Tie Ma, Long Luo, and Hongfang Yu, University of Electronic Science and Technology of China; Xi Chen, Southwest Minzu University; Jingzhao Xie, Chongxi Ma, Yunhan Xie, Gang Sun, and Tianxi Wei, University of Electronic Science and Technology of China; Li Chen, Zhongguancun Laboratory; Yanwei Xu and Nicholas Zhang, Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd.

Available Media

Currently, one of the simplest and most effective ways for people to gain an in-depth understanding of computer networks is through hands-on practice and experimentation on software platforms. While education is important for the field of computer networks, existing platforms are inadequate in usability and scalability, failing to fully meet all the teaching needs of computer networking education.

This paper describes our experiences in designing and using Klonet, an emulation platform for computer networking education. Klonet is easy-to-use for both students and tutors, which has been carefully designed to lower the barrier to use, thus making the practice more efficient. Klonet also demonstrates good scalability. It adopts a container-based distributed architecture and a virtual network embedding algorithm customized for this platform. Evaluation experiments show that Klonet exhibits better scalability, such as supporting more students with fewer hardware resources (i.e., servers) and deploying virtual network topologies more quickly. Furthermore, to ensure stability during teaching, Klonet enhances the robustness of its upper orchestrator and underlying virtual networks. So far, Klonet has been adopted in 3 universities and 4 courses, serving more than 800 students. We showcase Klonet's usefulness in networking education with real use cases, including a scenario with ~10,000 emulated routers. We also share our lessons learned from the 4 years of Klonet development and 2 years of operations.

ExChain: Exception Dependency Analysis for Root Cause Diagnosis

Ao Li, Carnegie Mellon University; Shan Lu, Microsoft Research and University of Chicago; Suman Nath, Microsoft Research; Rohan Padhye and Vyas Sekar, Carnegie Mellon University

Available Media

Many failures in large-scale online services stem from incorrect handling of exceptions. We focus on exception-handling failures characterized by three features that make them difficult to diagnose using classical techniques: (1) implicit dependencies across multiple exceptions due to state changes; (2) silent code handling without logging; and (3) separation (in code and in time) between the root cause exception and the failure manifestation. In this paper, we present the design and implementation of ExChain, a framework that helps developers diagnose such exception-dependent failures in test/canary deployment environments. ExChain constructs causal links between exceptions even in the presence of the aforementioned factors. Our key observation is that mishandled exceptions invariably modify critical system states, which impact downstream functions. A key challenge in tracking these states is balancing the tradeoff between performance overhead and accuracy. To this end, ExChain uses state-impact analysis to establish potential causal links between exceptions and uses a novel hybrid taint tracking approach for tracking state propagation. Using ExChain, we were able to successfully identify the root cause for 8 out of 11 reported subtle exception-dependent failures in 10 popular applications. ExChain significantly outperforms state-of-art approaches, while producing several orders of magnitude fewer false positives. ExChain also offers significantly better accuracy-performance tradeoffs relative to baseline static/dynamic analysis alternatives.

NSDI '24 Technical Sessions

Papers and Proceedings

Tuesday, April 16

8:00 am–8:55 am

Continental Breakfast

8:55 am–9:10 am

Opening Remarks and Awards

9:10 am–10:30 am

Clouds but Faster

Scheduling the Network

10:30 am–11:00 am

Break with Refreshments

11:00 am–12:40 pm

Serverless

Network Protocols

12:40 pm–2:00 pm

Symposium Luncheon and Test of Time Award Presentation

2:00 pm–3:40 pm

Distributed Systems: Part 1

Programming the Network: Part 1

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:50 pm

Video

Sharing the Network

Wednesday, April 17

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:20 am

ML at Scale

Satellites and Things

10:20 am–10:50 am

Break with Refreshments

10:50 am–12:30 pm

Wide-Area and Edge

Verification

12:30 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:40 pm

Networking at Scale

ML but Faster

3:40 pm–4:10 pm

Break with Refreshments

4:10 pm–5:50 pm

Distributed Systems: Part 2

Wireless Hardware

6:30 pm–8:00 pm

NSDI '24 Poster Session and Reception

Thursday, April 18

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:20 am

ML Scheduling

Cloud Scheduling

10:20 am–10:50 am

Break with Refreshments

10:50 am–12:30 pm

Programming the Network: Part 2

Wireless Sensing