OSDI '23 Technical Sessions

All the times listed below are in Pacific Daylight Time (PDT).

Monday, July 10

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:00 am

OSDI '23 and USENIX ATC '23 Joint Keynote Address

Title TBA

Ion Stoica, University of California, Berkeley

10:00 am–10:30 am

Break with Refreshments

10:30 am–10:45 am

Opening Remarks and Awards

Roxana Geambasu, Columbia University, and Ed Nightingale, Apple

10:45 am–12:10 pm

Make Your Bits Go Faster

Ship your Critical Section Not Your Data: Enabling Transparent Delegation with TCLocks

Vishal Gupta, EPFL; Kumar Kartikeya Dwivedi, SRM University; Yugesh Kothari, Yueyang Pan, and Sanidhya Kashyap, EPFL

Today's high-performance applications heavily rely on various synchronization mechanisms, such as locks. While locks ensure mutual exclusion of shared data, their design impacts application scalability. Locks, as used in practice, move the lock-guarded shared data to the core holding it, which leads to shared data transfer among cores. This design adds unavoidable critical path latency leading to performance scalability issues. Meanwhile, some locks avoid this shared data movement by localizing the access to shared data on one core, and shipping the critical section to that specific core. However, such locks require modifying applications to explicitly package the critical section, which makes it virtually infeasible for complicated applications with large code-bases, such as the Linux kernel.

We propose transparent delegation, in which a waiter automatically encodes its critical section information on its stack and notifies the waiter. The lock holder executes the shipped critical section on the waiter's behalf using a light-weight context switch. Using transparent delegation, we design a family of locking protocols (TCLocks), which require zero modification to applications' logic. The evaluation shows that TCLocks provide up to 5.2x performance improvement compared with recent locking algorithms.

RON: One-Way Circular Shortest Routing to Achieve Efficient and Bounded-waiting Spinlocks

Shiwu Lo, Han-Ting Lin, Yaohong Xie, and Lin Zhaoting, National Chung-Cheng University; Yu-Hsueh Fang, National Cheng Kung University; Jingshen Lin, National Chung-Cheng University; Jim Huang, National Cheng Kung University; Kam Yiu Lam, City University of Hong Kong; Yuan-Hao Chang, Academia Sinica

As the number of cores increases, the efficiency of accessing shared variables through the lock-unlock method decreases. Additionally, the more processor cores there are, the longer the farthest transmission distance is. However, a non-uniform memory access (NUMA)-aware algorithm only considers the transmission delay between processors, so it may not be able to fully utilize the connection network of a multi-core processor. Thus, the large amount of low- and variable-cost data sharing between cores limits the scalability of a multi-core processor. The difficulty of this problem is that the reduction in communication cost cannot compensate for the increase in the time complexity of the spinlocks.

We propose a method called Routing on Network-on-chip (RON) to minimize communication cost between cores by using a routing table. In RON, we pre-calculate a global optimized locking-unlocking order for a thread to enter the critical section. Based on the locking-unlocking order, RON delivers locks and data in a one-way circular manner among cores to achieve (1) minimized global data movement cost and (2) bounded waiting time, because there are more threads waiting for the lock on the to-be-visited cores than those on the recently-visited cores and each core will be visited within a bounded waiting time. We use microbenchmarks for quantitative analysis and use multi-core benchmarks to understand the performance under various workloads.

In terms of user space performance (by implementing the algorithms in user space library), compared with ShflLock and C-BO-MCS, RON increases the performance of Google LevelDB by 22.1% and 24.2%, respectively. In terms of kernel space performance, compared with using ShflLock, RON in the Linux kernel can improve the performance of Google LevelDB by 1.8 times. On dealing with the problem of oversubscription (i.e., more threads than cores), RON-plock can solve this problem in constant space complexity and its performance is 3.7 times and 18.9 times better than ShflLock-B and C-BO-MCS-B, respectively.

Userspace Bypass: Accelerating Syscall-intensive Applications

Zhe Zhou, Yanxiang Bi, Junpeng Wan, and Yangfan Zhou, Fudan University; Zhou Li, University of California, Irvine

Context switching between kernel mode and user mode often causes prominent overhead, which slows down applications with frequent system calls (or syscalls), like applications with high I/O. The overhead is further amplified by security mechanisms like KPTI. To accelerate such applications, efforts have been made to remove syscalls from the I/O paths, mainly by combining drivers and apps in the same space, or batching syscalls. Nonetheless, such solutions require developers to refactor their applications, or even update hardware, which impedes their broad adoption.

In this paper, we propose another approach, userspace bypass, to accelerate syscall-intensive applications, by transparently moving userspace instructions into kernel. Userspace bypass requires no modification to userspace binaries or code, achieving full binary compatibility. Specifically, to avoid overhead caused by frequent syscalls, kernel identifies the short userspace execution path between consecutive system calls, and then converts the instructions in the path into code blocks with SFI guarantee. Every time kernel completes a syscall, it does not return to userspace but run the code block to emulate returning to userspace, which avoids the context switching overhead. According to our evaluation, small size I/O can be accelerated by up to 92%, intact Redis RPS can be improved by up to 11% and Nginx RPS increases by 8.5%. Our further analysis on the results show that the performance boost would be reduced if KPTI is turned off, but acceleration ratios are similar between the virtualized and physical environment.

Triangulating Python Performance Issues with Spyke

Emery Berger, Sam Stern, and Juan Altmayer Pizzorno, University of Massachusetts Amherst

This paper presents Spyke, a profiler specialized for Python. Spyke combines a suite of innovations to precisely and simultaneously profile CPU, memory, and GPU usage, all with low overhead. Spyke's CPU and memory profilers help Python programmers direct their optimization efforts by distinguishing between inefficient Python and efficient native execution time and memory usage. Spyke's memory profiler employs a novel sampling algorithm that lets it operate with low overhead yet high precision. It also incorporates a novel algorithm that automatically pinpoints memory leaks, whether within Python or across the Python-native boundary. Finally, Spyke introduces a novel metric called copy volume, which highlights costly copying operations that can occur when Python silently converts between C and Python data representations, or between CPU and GPU. Since its introduction, Spyke has been widely adopted in the Python community, with over 500,000 downloads to date. We present experience reports from external users who used Spyke to achieve significant performance improvements and memory savings.

Relational Debugging --- Pinpointing Root Causes of Performance Problems

Xiang Ren, Sitao Wang, Zhuqi Jin, David Lion, and Adrian Chu, University of Toronto; Tianyin Xu, University of Illinois at Urbana-Champaign; Ding Yuan, University of Toronto

Performance debugging is notoriously elusive—real-world performance problems are rarely clear-cut failures, but manifested through accumulation of fine-grained symptoms. Oftentimes, it is challenging to determine performance anomalies— absolute measures are unreliable, as system performance is inherently relative to workloads. Existing techniques focus on identifying absolute predicates that deviate between executions, which limits their application to performance problems.

This paper introduces relational debugging, a new technique that automatically pinpoints root causes of performance problems. The core idea is to capture and reason out relations between fine-grained runtime events. We show that relations provide immense utilities to explain performance anomalies and locate root causes. Relational debugging is highly effective with a minimal two executions (a good and a bad run), eliminating the pain point of producing and labeling many different executions required by traditional techniques.

We realize relational debugging by developing a practical tool named Perspect. Perspect directly operates on x86 binaries to accommodate real-world diagnosis scenarios. We evaluate Perspect on twelve challenging performance issues with various symptoms in Go runtime, MongoDB, Redis, and Coreutils. Perspect accurately located (or exclude) root causes of these issues. In particular, we used Perspect to diagnose two open bugs, where developers failed to find root causes—the root causes reported by Perspect was confirmed by developers. A controlled user study shows that Perspect can speedup the debugging times by at least 10.87 times.

12:10 pm–1:40 pm

Symposium Luncheon

1:40 pm–3:05 pm

Secure Your Bits I

Accountable authentication with privacy protection: The Larch system for universal login

Emma Dauterman, UC Berkeley; Danny Lin, Woodinville High School; Henry Corrigan-Gibbs, MIT CSAIL; David Mazieres, Stanford University

Credential compromise is currently hard to detect and hard to recover from. To address this problem, we present larch, an accountable authentication framework with strong security and privacy properties. Larch provides strong user privacy while ensuring that every authentication is correctly recorded by the larch log server. Specifically, an attacker that compromises a user’s device cannot authenticate without creating evidence in the log, and the log cannot learn which web service (relying party) the user is authenticating to. To enable fast adoption, larch is backwards-compatible with relying parties that support FIDO2, TOTP, and password-based login. Furthermore, larch does not degrade the security and privacy a user already expects: the log server cannot authenticate on behalf of a user, and larch does not allow relying parties to link a user across accounts. We implement larch for FIDO2, TOTP, and password-based login. Given a client with 4 cores and a log server with 8 cores, an authentication with larch takes 143ms for FIDO2, 76ms for TOTP, and 75ms for passwords (excluding preprocessing, which takes 1.27s for TOTP).

Privacy-Compliant Web Applications by Construction

Kinan Dak Albab, Ishan Sharma, Justus Adam, Benjamin Kilimnik, Aaron Jeyaraj, Raj Paul, Artem Agvanian, Leonhard Spiegelberg, and Malte Schwarzkopf, Brown University

Data privacy laws like the EU’s GDPR grant users new rights to their data, such as the right to request access and deletion. Manual compliance with these requests is error-prone, and imposes costly burdens, especially on smaller organizations, as non-compliance risks steep fines.

Pelton is a new, MySQL-compatible database that complies with privacy laws by construction. The key idea is to make the data ownership and sharing semantics explicit in the storage system. This requires Pelton to capture and enforce applications’ complex data ownership and sharing semantics, but in exchange simplifies privacy compliance. Using a small set of schema annotations, Pelton infers storage organization, procedures for data retrieval and deletion, and reports compliance errors if an application risks violating the GDPR.

We built a prototype of Pelton and evaluated its expressivity and performance. Pelton successfully expresses the data sharing semantics of real web applications, and guides developers to getting privacy compliance right. Pelton also matches or exceeds the performance of existing storage systems, at the cost of a modest increase in state size.

Encrypted Databases Made Secure Yet Maintainable

Mingyu Li, Xuyang Zhao, and Le Chen, Shanghai Jiao Tong University; Cheng Tan, Northeastern University; Huorong Li and Sheng Wang, Alibaba Group; Zeyu Mi and Yubin Xia, Shanghai Jiao Tong University; Feifei Li, Alibaba Group; Haibo Chen, Shanghai Jiao Tong University

State-of-the-art encrypted databases can be categorized into two types: one shields the whole DBMS engine in a trusted domain (Type-I), and the other uses a protected user-defined function (UDF) plugin (Type-II). However, the former lacks the essential ability of database administrator (DBA) maintainability, while the latter fails to guarantee the security of UDF’s interfaces. In particular, we devise smuggle attack that can infer Type-II’s secrets efficiently and stealthily.

We present HEDB, which prevents the smuggle attack yet retains maintainability. HEDB proposes a dual-mode EDB design based on our study of DBA maintenance tasks. The execution mode serves user query by isolating DBAs from UDF to resist smuggle attack, while the maintenance mode provides authenticated replay for DBMS maintenance and anonymized replay for privacy-preserving UDF troubleshooting. Evaluation shows that HEDB prevents smuggle attack effectively, and supports common maintenance tasks with 5.88% runtime cost and 9.26% storage cost.

LVMT: An Efficient Authenticated Storage for Blockchain

Chenxing Li, Shanghai Tree-Graph Blockchain Research Institute; Sidi Mohamed Beillahi, University of Toronto; Guang Yang and Ming Wu, Shanghai Tree-Graph Blockchain Research Institute; Wei Xu, Tsinghua University; Fan Long, University of Toronto

Accessing the storage is the performance bottleneck of a blockchain, because each access can be amplified to potentially O(log n) disk I/O operations in the standard Merkle Patricia Trie (MPT) storage structure. In this paper, we propose a multi-Layer Versioned Multipoint Trie (LVMT), a novel high-performance blockchain storage with significantly reduced I/O amplifications. LVMT uses the authenticated multipoint evaluation tree (AMT) vector commitment protocol to update commitment proofs in constant time. LVMT adopts a multi-layer design to support unlimited key-value pairs and stores version numbers instead of value hash to avoid costly elliptic curve multiplication operations. Our experiments show that read and write operations are 7x faster in LVMT than in an MPT. Our experiments also show that LVMT increases the execution throughput of a blockchain system by up to 3.1x.

Honeycomb: An Secure, Efficient GPU Execution Environment with Minimal TCB

Haohui Mai, unaffiliated; Jiacheng Zhao, Institute of Computing Technology, Chinese Academy of Sciences; Christos Kozyrakis, Stanford University; Mingyu Gao, Tsinghua University; Hongren Zheng, Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University; Quanxi Li, Institute of Computing Technology, Chinese Academy of Sciences; Zibin Liu and Cong Wang, unaffiliated; Huimin Cui, ICTCAS; Xiaobing Feng

GPUs have been deployed as the most ubiquitous accelerators for production applications like machine learning and autonomous driving in modern cloud computing systems. The security features of GPUs in such a public and shared environment becomes vitally important when the processed data are sensitive and expose user privacy. Recent advances in GPU TEEs are promising but fail to address the practical concerns of production applications on real-world legacy GPUs. We present Honeycomb, a software-based, secure, and efficient TEE for GPU computations. Honeycomb adopts commodity CPU TEE, a security monitor, and validated executions to enforce security isolation across different applications. Instead of admitting arbitrary, untrusted applications in conventional TEEs, Honeycomb only admits validated executions into the system. A validated execution consists of the actual GPU kernel in binary code and a corresponding validation proof to show that the runtime behaviors of the kernel confined to the security policy of Honeycomb. Honeycomb enables two TEE applications to securely exchange clear-text data using shared device memory on GPU. We have implemented Honeycomb on top of an AMD 6900XT GPU. Our evaluation shows that Honeycomb incurs a performance overhead for 8% for the inference application of ResNet 18 neural network model.

3:05 pm–3:35 pm

Break with Refreshments

3:35 pm–5:00 pm

Secure Your Bits II

An Extensible Orchestration and Protection Framework for Confidential Cloud Computing

Adil Ahmad and Alex Schultz, Arizona State University; Pedro Fonseca, Purdue University; Byoungyoung Lee, Seoul National University

Confidential computing solutions are crucial to address the cloud privacy concerns. Although SGX has witnessed significant adoption in the cloud, the reliance on hardware implementation is restrictive for cloud providers in terms of orchestrating deployments and providing stronger security to their clients’ enclaves. eOPF addresses this limitation by providing a comprehensive, secure hypervisor-level instrumentation framework with the ability to monitor all enclave-OS interactions and implement protected services. eOPF overcomes several challenges including bridging the semantic gap between the hypervisor and SGX and attesting the co-location of the framework with enclaves. Using eOPF, we implement two protected services that provide platform resource orchestration and complementary enclave side-channel defense. Our evaluation shows that eOPF incurs very low performance overhead (<2%) in its default state and only a modest overhead (geometric mean of 17% on SPEC) when strong, complementary side-channel defenses are enabled, making eOPF an efficient and practical solution for the cloud.

Nimble: Rollback Protection for Confidential Cloud Services

Sebastian Angel, Microsoft Research; Aditya Basu, Penn State University; Weidong Cui, Microsoft Research; Trent Jaeger, Penn State University; Stella Lau, MIT CSAIL; Srinath Setty, Microsoft Research; Sudheesh Singanamalla, University of Washington

This paper introduces Nimble, a cloud service that helps applications running in trusted execution environments (TEEs) to detect rollback attacks (i.e., detect whether a data item retrieved from persistent storage is the latest version). To achieve this, Nimble realizes an append-only ledger service by employing a simple state machine running in a TEE in conjunction with a crash fault-tolerant storage service. Nimble then replicates this trusted state machine to ensure the system is available even if a minority of state machines crash. A salient aspect of Nimble is a new reconfiguration protocol that allows a cloud provider to replace the set of nodes running the trusted state machine whenever it wishes—without affecting safety. We have formally verified Nimble’s core protocol in Dafny, and have implemented Nimble such that its trusted state machine runs in multiple TEE platforms (Intel SGX and AMD SNP-SEV). Our results show that a deployment of Nimble on machines running in different availability zones can achieve from tens of thousands of requests/sec with an end-to-end latency of under 3.2 ms (based on an in-memory key-value store) to several thousands of requests/sec with a latency of 30ms (based on Azure Table).

Kerveros: Efficient and Scalable Cloud Admission Control

Sultan Mahmud Sajal, The Pennsylvania State University; Luke Marshall and Beibin Li, Microsoft Research; Shandan Zhou and Abhisek Pan, Microsoft; Konstantina Mellou and Deepak Narayanan, Microsoft Research; Timothy Zhu, The Pennsylvania State University; David Dion, Microsoft Azure; Thomas Moscibroda, Microsoft; Ishai Menache, Microsoft Research

The infinite capacity of cloud computing is an illusion: in reality, cloud providers cannot always have enough capacity of the right type, in the right place, at the right time to meet all demand. Consequently, cloud providers need to implement admission-control policies to ensure high availability to already accepted capacity requests. Admission control in the public cloud is hard because supply and demand change dynamically: hardware can become unavailable for many reasons, and actual VM consumption may vary due to diverse factors such as tenant scale-outs and fulfillment of VM reservations that are made by customers ahead of time. In this paper, we design and implement Kerveros -- a flexible admission-control system that has three desired properties: i) high computational scalability to handle a large inventory, ii) good packing efficiency to optimize resource usage, and iii) providing sufficient capacity for high VM availability. To achieve this, Kerveros uses novel bookkeeping techniques to quickly determine the available capacity for incoming VM requests. Our system has been deployed in PCloud, a large public cloud provider. Results from both simulations and production confirm that Kerveros achieves more than four nines of availability while sustaining request processing latencies of a few milliseconds.

Security and Performance in the Delegated User-level Virtualization

Jiahao Chen, Dingji Li, Zeyu Mi, Yuxuan Liu, Binyu Zang, Haibing Guan, and Haibo Chen, Shanghai Jiao Tong University

Today’s mainstream virtualization systems suffer from severe security threats due to the large attack surface exposed by the in-kernel hypervisor components such as KVM. To this end, this paper proposes a new design called delegated virtualization, which decouples the commodity hypervisor into two planes: the hypervisor plane for hypervisor control (which is usually small and with fixed logic) and the VM plane for handling virtual machine (VM) requests and exceptions at runtime. As our investigation shows that all known hypervisor vulnerabilities that threaten the host kernel lie in the VM plane, delegated virtualization completely offloads the in-kernel VM plane to a user-space hypervisor called DuVisor that directly interacts with its VM without exiting to the host kernel, based on a small hardware extension (481 lines of Chisel). We have implemented the hardware extension on an open-source RISC-V CPU on FireSim and built a Rust-based DuVisor atop the hardware. The evaluation results show that DuVisor significantly reduces the attack surface with negligible performance overhead (< 5%).

Core slicing: closing the gap between leaky confidential VMs and bare-metal cloud

Ziqiao Zhou, Microsoft Research; Yizhou Shan, Huawei Cloud; Weidong Cui, Microsoft Research; Xinyang Ge, Netflix; Marcus Peinado and Andrew Baumann, Microsoft Research

Virtual machines are the basis of resource isolation in today's public clouds, yet the security risks of entrusting that isolation to a cloud provider's hypervisor are substantial. Such concerns have motivated the design of hardware extensions for "confidential VMs" that seek to remove the hypervisor from the trusted computing base by adding a highly-privileged firmware layer that checks hypervisor actions, and supports memory encryption and remote attestation. However, the hypervisor retains control of resource management and observes associated actions of the guest including nested page table faults and CPU scheduling, and thus confidential VMs remain vulnerable to an ever-changing variety of hypervisor-level side channel attacks. Bare-metal cloud servers avoid such leaks, but remain a niche due to the high cost of dedicated hardware.

We observe that typical cloud VMs run with a static allocation of memory and discrete cores, and increasingly rely on I/O offload, thus negating the apparent need for a hypervisor and the fragile hypervisor/guest isolation boundary. Our design, core slicing, enables multiple untrusted guest OSes to run on shared bare-metal hardware. To ensure isolation without the complexity of virtualization, we propose simple hardware extensions that restrict guests to a static slice of a machine's cores, memory and virtual I/O devices, and delegate resource allocation to a dedicated management slice. We demonstrate practicality and evaluate performance with prototypes for RISC-V and x86.

5:30 pm–7:00 pm

OSDI '23 Poster Session and Reception

Sponsored by Amazon

Would you like to share a provocative opinion, interesting preliminary work, or a cool idea that will spark discussion at this year's OSDI? The poster session is the perfect venue to introduce such new or ongoing work. Poster presenters will have the opportunity to discuss their work, get exposure, and receive feedback from other attendees during the in-person evening reception. View the Call for Posters and submit a proposal by June 7.

Tuesday, July 11

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:10 am

Expanding, Hardening, and Deploying Your Bits

ExoFlow: A Universal Workflow System for Exactly-Once DAGs

Siyuan Zhuang, UC Berkeley; Stephanie Wang, UC Berkeley, Anyscale; Eric Liang and Yi Cheng, Anyscale; Ion Stoica, UC Berkeley

Given the fundamental tradeoff between run-time and recovery performance, current distributed systems often build application-specific recovery strategies to minimize overheads. However, it is increasingly common for different applications to be composed into heterogeneous pipelines. Implementing multiple interoperable recovery techniques in the same system is rare and difficult. Thus, today's users must choose between: (1) building on a single system, and face a fixed choice of performance vs. recovery overheads, or (2) the challenging task of stitching together multiple systems that can offer application-specific tradeoffs.

We present ExoFlow, a universal workflow system that enables a flexible choice of recovery vs. performance tradeoffs, even within the same application. The key insight behind our solution is to decouple execution from recovery and provide exactly-once semantics as a separate layer from execution. For generality, workflow tasks can return references that capture arbitrary inter-task communication. To enable the workflow system and therefore the end user to take control of recovery, we design task annotations that specify execution semantics such as nondeterminism. ExoFlow generalizes recovery for existing workflow applications ranging from ETL pipelines to stateful serverless workflows, while enabling further optimizations in task communication and recovery.

Hyrax: Fail-in-Place Operation in Cloud Platforms

Jialun Lyu, University of Toronto; Daniel S. Berger, Microsoft Azure and UW; Marisa You, Microsoft Azure; Celine Irvene, Microsoft Research; Mark Jung, Tyler Narmore, and Jacob Shapiro, Microsoft Azure; Luke Marshall, Microsoft Research; Savyasachi Samal, Microsoft Azure; Ioannis Manousakis, unaffiliated; Ashish Raniwala and Ricardo Bianchini, Microsoft Azure; Bianca Schroeder, University of Toronto

Today's cloud platforms handle server hardware failures by shutting down the affected server and only turning it back online once it has been repaired by a technician. At cloud scale, this all-or-nothing operating model is becoming increasingly unsustainable. This model is also at odds with technology trends, such as the need for new cooling technology.

This paper introduces Hyrax, a modified datacenter stack that enables compute servers with failed components to continue hosting VMs while hiding the underlying degraded capacity and performance. A key enabler of Hyrax is a novel model of changes in memory interleaving when deactivating faulty memory modules. Experiments on cloud production servers show that Hyrax overcomes common hardware failures without impacting peak VM performance. In large-scale simulations with production traces, Hyrax reduces server repair requirements by over 50% without impacting VM scheduling.

Strictly Serializable Timestamp Ordering by Avoiding the Timestamp-Inversion Pitfall

Haonan Lu, University at Buffalo; Shuai Mu, Stony Brook University; Siddhartha Sen, Microsoft Research; Wyatt Lloyd, Princeton University

Strictly serializable systems provide strong consistency guarantees, which simplify the development of applications that process critical data. However, existing strictly serializable protocols are unnecessarily expensive when transactions arrive in an order that is naturally consistent, a scenario prevalent in many real-world workloads. To exploit this natural arrival order, we leverage timestamp ordering, a lightweight technique originally designed for weaker consistency. However, when using timestamp ordering to enforce strict serializability, we identify a subtle correctness pitfall that arises, which we call timestamp-inversion. We find that timestamp-inversion affects several well-known prior works, and as a result they do not provide strict serializability as claimed.

We present Strictly Serializable Timestamp Ordering (SSTO), the first design that enables low overhead strict serializability by avoiding the timestamp-inversion pitfall. SSTO achieves low overhead—i.e., one-round latency, lock-free, and no waiting—for many transactions, by employing a set of techniques including: a client-side safeguard, asynchrony-aware timestamps, server-side timestamp refinement, smart retry, and a special protocol for read-only transactions. Our evaluation shows that SSTO outperforms state-of-the-art solutions by an order of magnitude on many workloads.

Conveyor: One-Tool-Fits-All Software Deployment at ComX

Boris Grubic, Meta; Yang Wang, Meta, The Ohio State University; Tyler Petrochko, Ran Yaniv, Brad Jones, David Callies, Matt Clarke-Lauer, and Dan Kelley, Meta; Soteris Demetriou, Meta, Imperial College London; Kenny Yu and Chunqiang Tang, Meta

This paper presents the design, implementation, usage scenarios and statistics of Conveyor, ComX’s universal software deployment tool. Conveyor has been in production since 2015 and performs over 100,000 deployments per week across more than 10,000 services and millions of machines. While continuous delivery is a common industry practice, no systematic characterization of deployment tools’ usage in production at scale exists today. We fill that gap by analyzing real-world use cases and highlighting advanced features necessary for a deployment tool to effectively support all services in the fleet.We further suggest enhancements to cluster management systems and monitoring tools in relation to the deployment tool. Finally, our analysis of tens of thousands of deployment pipelines reveals that 99% use either push-on-green (86%) or fully automated deployments on a fixed schedule (13%) without manual approval. This indicates that engineers trust deployment automation. Despite the removal of manual approval, only 0.12% of releases were reverted by humans, indicating that automated testing and health checks during deployments result in highly reliable releases.

10:10 am–10:45 am

Break with Refreshments

10:45 am–12:10 pm

Query Your Bits

Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases

Tamer Eldeeb and Xincheng Xie, Columbia University; Philip A. Bernstein, Microsoft Research; Asaf Cidon and Junfeng Yang, Columbia University

Designers of on-disk database systems that support distributed transactions typically face a difficult choice. They can either use an expensive commit protocol like two-phase commit (2PC) to guarantee atomicity, and suffer from \textit{slow} distributed transactions, or forgo 2PC, which leads to either weaker semantics, limitations to the programming model, or constrained scalability, making the system less \textit{general}. We argue this compromise is no longer necessary within modern datacenters. Low latency 2PC (∼150 μs on Azure for 2PC-over-Paxos) can be achieved by using fast datacenter RPCs and placing the relatively small transaction logs on low latency storage, and careful protocol design. With fast 2PC, the data contention bottleneck for many transactions shifts from the commit protocol to actually reading the data from the relatively slow storage while holding transaction locks.

We present Chardonnay, a scalable, on-disk, multi-versioned transactional key-value store optimized for single datacenter deployments with fast 2PC. Chardonnay has a \textit{general} interface supporting range and point reads, and writes within multi-step strictly-serializable ACID transactions. The key idea that underlies Chardonnay's design is to support externally consistent snapshot reads on commodity hardware, using a novel lock-free read protocol. Chardonnay uses this protocol to cheaply determine the read-write sets of queries, enabling Chardonnay to transparently prefetch data needed for a transaction prior to the transaction execution and lock acquisition. This enables Chardonnay to achieve \textit{fast} transactions by minimizing contention, and to avoid aborts due to deadlocks by ordering its lock requests.

ScaleDB: A Scalable, Asynchronous In-Memory Database

Syed Akbar Mehdi, The University of Texas at Austin / Google; Deukyeon Hwang and Simon Peter, University of Washington; Lorenzo Alvisi, Cornell University

ScaleDB is a serializable in-memory transactional database that achieves excellent scalability on multi-core machines by asynchronously updating range indexes. We find that asynchronous range index updates can significantly improve database scalability by applying updates in batches, reducing contention on critical sections. To avoid stale reads, ScaleDB uses small hash indexlets to hold delayed updates. We use indexlets to design ACC, an asynchronous concurrency control protocol providing serializability. With ACC, it is possible to delay range index updates without adverse performance effects on transaction execution in the common case.

ACC delivers scalable serializable isolation for transactions, with high throughput and low abort rate. Evaluation on a dual-socket server with 36 cores shows that ScaleDB achieves 9.5× better query throughput than Peloton on the YCSB benchmark and 1.8× better transaction throughput than Cicada on the TPC-C benchmark.

VBase: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity

Qianxi Zhang, Shuotao Xu, Qi Chen, and Guoxin Sui, Microsoft Research; Jiadong Xie, East China Normal University; Zhizhen Cai and Yaoqi Chen, University of Science and Technology of China; Yinxuan He, Renmin University of China; Yuqing Yang, Fan Yang, Mao Yang, and Lidong Zhou, Microsoft Research

Approximate similarity queries on high-dimensional vector indices have become the cornerstone for many critical online services. An increasing need for more sophisticated vector queries requires integrating vector search systems with relational databases. However, high-dimensional vector indices do not exhibit monotonicity, a critical property of conventional indices. The lack of monotonicity forces existing vector systems to rely on tentative monotonicity-preserving indices, set up temporarily for a target vector's TopK nearest neighbors, to facilitate queries. This leads to suboptimal performance due to the difficulty to prophesy the optimal $K$.

This paper presents VBase, a system that efficiently supports complex queries of both approximate similarity search and relational operators. VBase identifies a common foundation, \emph{relaxed monotonicity}, to unify two seemingly incompatible systems. This common foundation allows VBase to circumvent the constraints of a TopK-only interface to achieve significantly higher efficiency, while provably preserving the semantics of TopK-based solutions. Evaluation results show VBase offers up to three orders-of-magnitude higher performance than state-of-the-art vector systems on complex online vector queries. VBase further enables analytical similarity queries that previous vector systems do not, and shows 7,000$\times$ speedup with 99.9\% accuracy of exact queries.

Detecting Transactional Bugs in Database Engines via Graph-Based Oracle Construction

Zu-Ming Jiang and Si Liu, ETH Zurich; Manuel Rigger, National University of Singapore; Zhendong Su, ETH Zurich

Transactions are an important feature of database management systems (DBMSs), as they provide guarantees for a sequence of statements (i.e., the ACID guarantees). Consequently, approaches have been proposed to automatically find transactional bugs in DBMSs. However, they cannot handle complex operations and predicates common in real-world database queries, and thus miss bugs. This paper introduces a general, effective technique for finding transactional bugs in DBMSs that supports complex operations and predicates. At the conceptual level, we address the test-oracle problem by constructing semantically equivalent test cases based on fine-grained statement-level dependencies in transactions. At the technical level, we introduce (1) statement-dependency graphs to describe dependencies among statements in transactions, (2) SQL-level instrumentation to capture possible statement-level dependencies, and (3) transactional oracle construction to generate semantically equivalent test cases using statement-dependency graphs. We also establish the correctness of our approach in generating semantically equivalent test cases. We have realized our technique as a tool, TxCheck, and evaluated it on three widely-used and well-tested DBMSs, namely TiDB, MySQL, and MariaDB. In total, TxCheck found 56 unique bugs, 52 of which have been confirmed and 13 already fixed. We believe that TxCheck can help solidify DBMSs’ support for transactions thanks to its generality and effectiveness.

Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate

Audrey Cheng, David Chu, Terrance Li, and Jason Chan, UC Berkeley; Xiangyao Yu, University of Wisconsin - Madison; Joseph M. Hellerstein, Natacha Crooks, and Ion Stoica, UC Berkeley

Most caching policies focus on increasing object hit rate to improve overall system performance. However, these algorithms are insufficient for transactions. In this work, we define a new metric, transactional hit rate, to capture when caching reduces latency for transactions. We present DeToX, a caching system that leverages transactional dependencies to make eviction and prefetching decisions. DeToX is able to significantly outperform single-object alternatives on real-world workloads and popular OLTP benchmarks, providing up to a 130% increase in transaction hit rate and 3.4x improvement in cache efficiency.

12:10 pm–1:40 pm

Symposium Luncheon

1:40 pm–2:50 pm

Store Your Bits

Replicating Persistent Memory Key-Value Stores with Efficient RDMA Abstraction

Qing Wang, Youyou Lu, Jing Wang, and Jiwu Shu, Tsinghua University

Combining persistent memory (PM) with RDMA is a promising approach to performant replicated distributed key-value stores (KVSs). However, existing replication approaches do not work well when applied to PM KVSs: 1) Using RPC induces software queueing and execution at backups, increasing request latency; 2) Using one-sided RDMA WRITE causes numerous streams of small PM writes, leading to severe device-level write amplification (DLWA) on PM.

In this paper, we propose Rowan, an efficient RDMA abstraction to handle replication writes in PM KVSs; it aggregates massive concurrent remote writes from different servers, and lands these writes to PM in a sequential (thus low DLWA) and one-sided (thus low latency) manner. We realize Rowan with off-the-shelf RDMA NICs. Further, we build Rowan-KV, a log-structured PM KVS using Rowan for replication. Evaluation shows that under write-intensive workloads, compared with PM KVSs using RPC and RDMA WRITE for replication, Rowan-KV boosts throughput by 1.22× and 1.39× as well as lowers median PUT latency by 1.77× and 2.11×, respectively, while largely eliminating DLWA.

eZNS: An Elastic Zoned Namespace for Commodity ZNS SSDs

Jaehong Min and Chenxingyu Zhao, University of Washington; Ming Liu, University of Wisconsin-Madison; Arvind Krishnamurthy, University of Washington

Emerging Zoned Namespace (ZNS) SSDs, providing the coarse-grained zone abstraction, hold the potential to significantly boost the cost-efficiency of future storage infrastructure and mitigate performance unpredictability. However, existing ZNS SSDs have a static zoned interface, making them in-adaptable to workload runtime behavior, unscalable to underlying hardware capabilities, and interfering with co-located zones. Applications either under-provision the zone resources yielding unsatisfied throughput, create over-provisioned zones that squander the cost, or experience unexpected I/O latencies.

We propose eZNS, an elastic zoned namespace interface that exposes an identical zone with predictable characteristics. eZNS comprises two major components: a zone arbiter that manages zone allocation and active resources on the control-plane, a hierarchical I/O scheduler with read congestion control and write admission control on the data-plane. Together, eZNS enables the transparent use of a ZNS SSD and closes the gap between application requirements and zone interface properties. Our evaluations over RocksDB demonstrate that eZNS outperforms a static zoned interface by 84.5% and 52.1% in throughput and tail latency, respectively, at most.

SEPH: Scalable, Efficient, and Predictable Hashing on Persistent Memory

Chao Wang, Junliang HU, and Ming-Chang Yang, The Chinese University of Hong Kong; Tsun-Yu Yang, Computer Science & Engineering Department of The Chinese University of Hong Kong; Yuhong Liang, The Chinese University of Hong Kong

With the merits of high density, non-volatility, and DRAM-scale latency/bandwidth, persistent memory (PM) brings hope to high-performance storage systems, in which hashing-based index structures receive great attention owing to its efficient query performance. Though lots of efforts have been made to rethink the hashing schemes for PM in recent years, nevertheless, based on our investigation, none of them can hit performance scalability, efficiency and predictability with one stone, seriously limiting their practicability to time-sensitive or latency-critical applications. To this end, this paper presents SEPH, a Scalable, Efficient, and Predictable Hashing for PM. SEPH paves a new direction to build the hash table by introducing the novel Level Segment (LS) structure, a key to break the dilemma between efficiency and predictability standing in front of the existing hashing schemes for PM. With the LS based hash table structure, SEPH enables a low-overhead split to greatly suppress the resizing-incurred unpredictability, and develops a novel semi lock-free concurrency control that requires a nearly-minimal amount of writes to handle an item insertion for achieving ever-higher efficiency and scalability while ensuring the correctness and crash consistency. Compared to state-of-the-art hashing schemes, SEPH demonstrates higher efficiency (up to 15.4× higher throughput), better scalability (performance scales up to 48 threads), and more reliable predictability (improving the tail latency by up to 19.3×).

Fast RDMA-accelerated Remote Fork for Serverless Computing

Xingda Wei, Fangming Lu, Tianxia Wang, Jinyu Gu, Yuhan Yang, Rong Chen, and Haibo Chen, Shanghai Jiao Tong University

Serverless platforms usually run functions in containers, facing (1) the trade-off between container startup time and re- source provisioning, and (2) the performance cost of state transfer between containers. We present MITOSIS, an operating system primitive that provides fast remote fork to address the above issues. MITOSIS is the first to fork over 10,000 new containers from one instance across multiple machines within a second, while the new containers efficiently inherit the pre-materialized states of the forked one. The key enabler is a deep integration of OS kernel designs with modern hard- ware features, which imitates local fork with RDMA’s remote memory read capability. We have implemented MITOSIS on Linux and integrated it with FN, a popular serverless com- puting framework. Under sudden load spikes in real-world serverless workloads, MITOSIS reduces the function tail la- tency by 89% with orders of magnitude lower memory usage. It also improves the real-world serverless workflow requiring state transfer by 86%.

2:50 pm–3:20 pm

Break with Refreshments

3:20 pm–4:30 pm

Manage Your Bits I

Johnny Cache: the End of DRAM Cache Conflicts (in Tiered Main Memory Systems)

Baptiste Lepers, Université de Neuchâtel; Willy Zwaenepoel, University of Sydney

We demonstrate that hardware management of a tiered memory system offers better performance for many applications than current methods of software management. Hardware management treats the fast memory as a cache on slower memory. The advantages are that caching can be done at cache line granularity and that data appears in fast memory as soon as it is accessed. The potential for cache conflicts has, however, led previous works to conclude these hardware methods generally perform poorly.

In this paper, we show that the use of low-overhead conflict avoidance techniques eliminates conflicts almost entirely and thereby address the above limitation. We explore two techniques. One technique simply tries to avoid conflicts between pages at page allocation time. The other technique relies on sampling memory accesses to avoid specifically conflicts between hot pages, at page allocation time and by dynamic remapping in the case of conflicts between hot pages missed at allocation time.

We have implemented these techniques in the Linux kernel on an Intel Optane machine in a system called JC. We use HPC applications, key-value stores and databases to compare JC to the default Linux tiered memory management implementation and to a state-of-the-art software management approach, as exemplified by the HeMem system.

Our measurements show that JC outperforms Linux and HeMem by up to 5x. A surprising conclusion of this paper is that a cache can be made to perform close-to-optimally by minimizing conflicts at page allocation time, without any monitoring of hot pages or dynamic page remapping.

TailCheck: A Lightweight Heap Overflow Detection Mechanism with Page Protection and Tagged Pointers

Amogha Udupa Shankaranarayana Gopal, Raveendra Soori, Michael Ferdman, and Dongyoon Lee, Stony Brook University

A heap overflow vulnerability occurs when a program written in an unmanaged language such as C or C++ accesses a memory location beyond an object allocation boundary. Malicious users may exploit this vulnerability to corrupt an adjacent object in memory, creating an entry point for a security attack. Despite decades of research, unfortunately, it still remains challenging to detect heap overflow vulnerabilities in real-world programs at a low cost.

We present TAILCHECK, a new lightweight heap overflow detection scheme that leverages page protection and pointer tagging. When an object is created, TAILCHECK allocates an additional page-protected shadow object, called a TailObject, placing the distance from the object to its TailObject as a tag stored in the unused high-order bits of the object pointer. For every access to the original object, TAILCHECK performs an additional memory access to the TailObject, whose address is computed using the tag. Heap overflows are detected as page faults when an access occurs beyond the TailObject. We evaluated TAILCHECK with four server applications (apache, nginx, memached, redis) and the SPEC CPU2017 and SPEC CPU2006 benchmarks, successfully finding heap overflows in SPEC CPU2017 gcc. TAILCHECK experiences 4% and 2% run-time overhead for the average and tail (99%) latencies for server applications; and only 33% and 29% run-time overhead for SPEC CPU2017 and SPEC CPU2006, respectively, less than the state-of-the-art solution.

SMART: A High-Performance Apative Radix Tree for Disaggregated Memory

Xuchuan Luo, Fudan University; Pengfei Zuo, Huawei Cloud; Jiacheng Shen and Jiazhen Gu, The Chinese University of Hong Kong; Xin Wang, Fudan University; Michael Lyu, The Chinese University of Hong Kong; Yangfan Zhou, Fudan University

Disaggregated memory (DM) is an increasingly prevalent architecture in academia and industry with high resource utilization. It separates computing and memory resources into two pools and interconnects them with fast networks. Existing range indexes on DM are based on B+ trees, which suffer from large inherent read and write amplifications. The read and write amplifications rapidly saturate the network bandwidth, resulting in low request throughput and high access latency of B+ trees on DM.
In this paper, we propose to use the radix tree, which is more suitable for DM than the B+ tree due to smaller read and write amplifications. However, constructing a radix tree on DM is challenging due to the costly lock-based concurrency control, the bounded memory-side IOPS, and the complicated computing-side cache validation. To address these challenges, we design \textbf{SMART}, the first radix tree for disaggregated memory with high performance. Specifically, we leverage 1) a \textit{hybrid concurrency control} scheme including lock-free internal nodes and fine-grained lock-based leaf nodes to reduce lock overhead, 2) a computing-side \textit{read-delegation and write-combining} technique to break through the IOPS upper bound by reducing redundant I/Os, and 3) a simple yet effective \textit{reverse check} mechanism for computing-side cache validation. Experimental results show that SMART achieves $6.1\times$ higher throughput under typical write-intensive workloads and $2.8\times$ higher throughput under read-only workloads, compared with state-of-the-art B+ trees on DM.

ORC: Increasing Cloud Memory Density via Object Reuse with Capabilities

Vasily A. Sartakov, Lluís Vilanova, and Munir Geden, Imperial College London; David Eyers, University of Otago; Takahiro Shinagawa, The University of Tokyo; Peter Pietzuch, Imperial College London

Cloud environments host many tenants, and typically there is substantial overlap between the application binaries and libraries executed by different tenants. Thus, memory de-duplication will increase memory density beneficially by allocating memory for shared binaries only once. Existing de-duplication approaches, however, either rely on a shared OS to de-deduplicate binary objects, which provides unacceptably weak isolation; or exploit hypervisor-based de-duplication at the level of memory pages, blinded to the semantics of the objects to be shared.

We describe \emph{Object Reuse with Capabilities~(ORC)}, which supports the fine-grained sharing of binary objects between tenants, while isolating tenants strongly through a small trusted computing base~(TCB). ORC uses hardware support for memory capabilities to isolate tenants, which permits shared objects to be accessible to multiple tenants safely. Since ORC shares binary objects within a single address space through capabilities, it uses a new relocation type to create per-tenant state using thread-local storage when loading shared objects. ORC supports the loading of objects by an untrusted guest, outside of its TCB, and it must only verify the safety of the loaded data. Our experiments show that, compared to hypervisor-based de-deduplication, ORC achieves a higher memory density with lower performance overhead.

4:30 pm–4:45 pm

Short Break

4:45 pm–5:55 pm

Manage Your Bits II

Global Capacity Management With Flux

Marius Eriksen, Kaushik Veeraraghavan, Yusuf Abdulghani, Andrew Birchall, Po-Yen Chou, Adela Kabijlo, Ranjith Kumar S, Maroo Lieuw, Justin Meza, Scott Michelson, Thomas Rohloff, Hayley Russell, Jeff Qin, and Chunqiang Tang, Meta

Customers of both private and public cloud providers must wrestle with the problem of regionalization: how should service capacity be apportioned across a large number of geo-distributed datacenter regions? This problem is further complicated by the complex service dependency graphs that arise from microservice architectures, as well as capacity availability and hardware mix that can vary greatly by region.

Historically, regionalization has been solved through a slow-moving and manual process, whereby owners of large services directly negotiate capacity allocation and distribution with the cloud provider. However, as both service and cloud footprints continue to grow, these manual processes are becoming untenable, and tend to produce both a great amount of toil for everyone involved, as well as suboptimal results.

At OrgX we have built a system, Flux, to automate capacity regionalization, moving it from a bottoms-up, manual process, to a top-down, automated one. Flux employs RPC tracing to identify service capacity models, and uses these to compute an optimal joint capacity and traffic distribution plan that spans 1000s of services across 10s of products, and involves millions of servers. These plans are orchestrated by a system that safely and efficiently rebalances service capacity and product traffic across 10s of regions on a continuous basis.

Gratuit: Preventing Overload with Graceful Feature Degradation

Justin Meza, Thote Gowda, Ahmed Eid, Tomiwa Ijaware, Dmitry Chernyshev, Yi Yu, Nazim Uddin, Chad Nachiappan, Sari Tran, Shuyang Shi, Tina Luo, Ke Hong, Sankaralingam Panneerselvam, Hans Ragas, Svetlin Manavski, Weidong Wang, and Francois Richard, Meta Platforms, Inc.

Every day, billions of people depend on Internet services for communication, commerce, and entertainment. Like a utility, such as water or electricity, people expect Internet services to remain highly available. Yet planetary-scale data center infrastructures consisting of millions of servers experience unplanned capacity outages and unexpected demand for resources; how can such infrastructures remain reliable in the face of capacity and workload flux?

In this paper, we introduce Gratuit, a system for improving the availability of large-scale, globally- distributed Internet services using graceful feature degradation. In response to overload conditions, Gratuit enables site operators to gradually disable less-critical features in order to reduce capacity demand. Gratuit presents a common interface to product developers to define feature knobs that represent degradation capabilities. Gratuit automatically tests knobs to understand each knobs’ product- and infrastructure-level trade-offs. At CloudCo, we have used Gratuit to improve global product availability in the face of worldwide demand-surges in addition to large-scale infrastructure failures.

Cilantro: A Framework for Performance-Aware Resource Allocation for General Objectives via Online Feedback

Romil Bhardwaj, UC Berkeley; Kirthevasan Kandasamy, University of Wisconsin-Madison; Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica, UC Berkeley

Traditional systems for allocating finite cluster resources among competing jobs have either aimed at providing fairness, relied on users to specify their resource requirements, or have estimated these requirements via surrogate metrics (e.g. CPU utilization). These approaches do not account for a job’s real world performance (e.g. P95 latency). Existing performance-aware systems use offline profiled data and/or are designed for specific allocation objectives. In this work, we argue that resource allocation systems should directly account for real-world performance and the varied allocation objectives of users. In this pursuit, we build Cilantro.

At the core of Cilantro is an online learning mechanism which forms feedback loops with the jobs to estimate the resource to performance mappings and load shifts. This relieves users from the onerous task of job profiling and collects reliable real-time feedback. This is then used to achieve a variety of user-specified scheduling objectives. Cilantro handles the uncertainty in the learned models by adapting the underlying policy to work with confidence bounds. We demonstrate this in two settings. First, in a multi-tenant 1000-CPU cluster with 20 independent jobs, three of Cilantro’s policies outperform 9 other baselines on three different performance-aware scheduling objectives, improving user utilities by up to 1.2 − 3.7x. Second, in a microservices setting, where 160 CPUs must be distributed between 19 inter-dependent microservices, Cilantro outperforms 3 other baselines, reducing the end-to-end P99 latency to x0.57 the next best baseline.

Karma: Resource Allocation for Dynamic Demands

Midhul Vuppalapati, Giannis Fikioris, and Rachit Agarwal, Cornell University; Asaf Cidon, Columbia University; Anurag Khandelwal, Yale University; Eva Tardos, Cornell University

The classical max-min fairness algorithm for resource allocation provides many desirable properties, e.g., Pareto efficiency, strategy-proofness and fairness. This paper builds upon the observation that max-min fairness guarantees these properties under a strong assumption---user demands being static over time---and that, for the realistic case of dynamic user demands, max-min fairness loses one or more of these properties.

We present Karma, a generalization of max-min fairness for dynamic user demands. The key insight in Karma is to introduce "memory" into max-min fairness --- when allocating resources, Karma takes users' past allocations into account: in each quantum, users donate their unused resources and are assigned credits when other users borrow these resources; Karma carefully orchestrates exchange of credits across users (based on their instantaneous demands, donated resources and borrowed resources), and performs prioritized resource allocation based on users' credits. We prove theoretically that Karma guarantees Pareto efficiency, online strategy-proofness, and optimal fairness for dynamic user demands (without future knowledge of user demands). Empirical evaluations over production workloads show that these properties translate well into practice: Karma is able to reduce disparity in performance across users to a bare minimum while maintaining Pareto-optimal system-wide performance.

6:00 pm–7:30 pm

USENIX ATC '23 Poster Session and Reception

The USENIX ATC '23 poster session and reception will feature posters by authors presenting their work in person at the conference. USENIX ATC '23 authors can find out how to present their work on the Instructions for Presenters webpage.

Wednesday, July 12

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:10 am

Train Your Bits I

Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Zhuohan Li and Lianmin Zheng, UC Berkeley; Yinmin Zhong, Peking University; Vincent Liu, University of Pennsylvania; Ying Sheng, Stanford University; Xin Jin, Peking University; Yanping Huang and Zhifeng Chen, Google; Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, UC Berkeley

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, Beta, that determines an efficient strategy for placing and parallelizing collections of large machine-learning models across a distributed cluster. Evaluation results on production workloads show that Beta can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints for more than 99% of requests.

Grinder: Analysis and Optimization for Dynamic Control Flow in Deep Learning

Chen Zhang, Tsinghua University; Lingxiao Ma and Jilong Xue, Microsoft Research; Yining Shi, Peking University & Microsoft Research; Ziming Miao, Microsoft; Fan Yang, Microsoft Research Asia; Jidong Zhai, Tsinghua University; Zhi Yang, Peking University; Mao Yang, Microsoft Research

As deep neural networks (DNNs) become increasingly complex, it is often necessary to write DNN programs with complex control flow logic (e.g., loop, branch, and recursion). Efficiently executing such DNN computations on hardware accelerators is a challenge. Existing DNN frameworks typically execute the control flow on the host and accelerate the rest of the computation on hardware such as GPUs. This often introduces significant synchronization overhead between the host and the accelerator and can prevent global optimization across control flow scopes.

To address this challenge, we propose Grinder, a new DNN compiler that co-optimizes the execution of control flow and data flow on hardware accelerators. Grinder provides the uTask abstraction to unify the representation of DNN models, including both control flow and data flow. This allows Grinder to expose a holistic scheduling space for rescheduling control flow to the lower-level hardware parallelism. Grinder uses a heuristic policy to find efficient schedules and is able to automatically move control flows into device kernels, enabling optimization across control flow boundaries. Evaluation shows Grinder can accelerate DNN models with control flows by up to 8.2x over the fastest one of the state-of-the-art DNN frameworks and compilers.

Welder: Scheduling Deep Learning Memory Access via Tile-graph

Yining Shi, Peking University & Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou, Microsoft Research

With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern deep neural networks (DNNs) are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models. This inefficiency is caused by both the conventional treatment of DNNs as compute-intensive workloads and the lack of holistic memory access optimization in DNN models.

In this paper, we introduce Welder, a deep learning compiler that optimizes the execution efficiency from a holistic memory access perspective. The core of Welder is tile-graph, an abstraction that facilitates fine-grained data management at tile level. By leveraging the observation of optimization independence across memory layers, Welder is able to decompose the whole combinatorial DNN optimization space into several independent ones and effectively trade off between intra- and inter-operator data reuse using a tile traffic-based cost model. This allows Welder to unify previous ad-hoc memory optimizations into a single space, generate efficient execution plans with 74 more optimization patterns, and outperform state-of-the-art solutions significantly. Welder is also able to handle DNN models with arbitrarily large input by combining the existing accelerator memory and host memory as a whole system.

Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators

Jie Zhao, State Key Laboratory of Mathematical Engineering and Advanced Computing; Lianmin Zheng, UC Berkeley; Siyuan Feng, Shanghai Jiao Tong University; Chen Tian, ACM; Xiaoqiang Dan, Fei Liu, Chengke Wang, Sheng Yuan, Wenyuan Lv, and Qikai Xie, Stream Computing Inc.

Fully exploiting the computing power of an accelerator specialized for deep neural networks (DNNs) calls for the synergy between network and hardware architectures, but existing approaches partition a computational graph of DNN into multiple sub-graphs by abstracting away hardware architecture and assign resources to each sub-graph, not only producing redundant off-core data movements but also under-utilizing the hardware resources of a domain-specific architecture (DSA).

This paper introduces a systematic approach for effectively scheduling DNN computational graphs on DSA platforms. By fully taking into account hardware architecture when partitioning a computational graph into coarse-grained sub-graphs, our work enables the synergy between network and hardware architectures, addressing several challenges of prior work: (1) it produces larger but fewer kernels, converting a large number of off-core data movements into on-core data exchanges; (2) it exploits the imbalanced memory usage distribution across DNN network architecture, better saturating the DSA memory hierarchy; (3) it enables across-layer instruction scheduling not studied before, further exploiting the parallelism across different specialized compute units.

Results of seven DNN inference models on a DSA platform show that our work outperforms TVM and AStitch by 11.15x and 6.16x, respectively, and obtains throughput competitive to the vendor-crafted implementation. A case study on GPU also demonstrates that generating kernels for our sub-graphs can surpass CUTLASS with and without convolution fusion by 1.06x and 1.23x, respectively.

10:10 am–10:45 am

Break with Refreshments

10:45 am–12:10 pm

Train Your Bits II

EinNet: Optimizing Tensor Programs with Derivation-Based Transformations

Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, and Shuhong Huang, Tsinghua University; Xupeng Miao, Carnegie Mellon University; Shizhi Tang and Kezhao Huang, Tsinghua University; Zhihao Jia, Carnegie Mellon University

Boosting the execution performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world applications. However, existing approaches to optimizing the tensor computation of a DNN only consider transformations representable by a fixed set of predefined tensor operators, resulting a restricted optimized exploration space. To address this, we propose EINNET, the first derivation-based tensor program optimizer. EINNET optimizes tensor programs by leveraging transformations between general tensor algebra expressions and automatically creating new operators desired by transformations, enabling a significantly larger search space that includes those supported by prior works. Evaluation on seven DNNs shows that EINNET outperforms existing optimizers by up to 2.72× (1.52× on average) on A100 and up to 2.68× (1.55× on average) on V100, respectively.

Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter

Qinghao Hu, Tianwei Zhang, Yonggang Wen, and Meng Zhang, Nanyang Technological University; Peng Sun, SenseTime; Zhisheng Ye, Peking University; Qiaoling Chen, National University of Singapore

Hyperparameter tuning is an essential step in deep learning model development that provides better model performance at the cost of substantial resources. While existing systems can improve tuning efficiency, they still fail to handle large models with billions of parameters and efficiently leverage cluster resources. Motivated by these deficiencies, we introduce Hydro, a surrogate-based hyperparameter tuning service that optimizes tuning workloads in both the job-level and cluster-level granularities. Specifically, it consists of two key components: (1) Hydro Tuner automatically generates and optimizes surrogate models via scaling, parametrization and fusion; (2) Hydro Coordinator improves tuning efficiency and cluster-wide resource utilization by adaptively leveraging ephemeral and heterogeneous resources. Our comprehensive experiments on two tuning algorithms across six models show that Hydro Tuner can dramatically reduce tuning makespan by up to 78.5x and no reduction in tuning quality. Moreover, Hydro Coordinator accelerates the tuning workload by 3.2x, without sacrificing the training throughput of large models.

Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms

Yuke Wang, Boyuan Feng, and Zheng Wang, University of California, Santa Barbara; Tong Geng, Kevin Barker, and Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California, Santa Barbara

The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery.

To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel fine-grained dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the GNN execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.55x, 5.10x, and 12.75x faster than DGL, MGG-UVM, and ROC frameworks, respectively. MGG is implemented with ~9.9K LoC and is open-sourced at https://anonymous.4open.science/r/MGG-Artifact-7ECF.

Optimizing Dynamic Neural Networks with Brainstorm

Weihao Cui, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research; Lingji Ouyang, USTC; Yichuan Wang, Shanghai Jiao Tong University; Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, and Jilong Xue, Microsoft Research; Lili Qiu, UT Austin, MSR Asia Shanghai; Lidong Zhou, Microsoft Research; Quan Chen, Shanghai Jiao Tong University; Haisheng Tan, University of Science and Technology of China; Minyi Guo, Shanghai Jiao Tong University

Dynamic neural networks (NNs), which can adapt sparsely activated sub-networks to inputs during inference, have shown significant advantages over static ones in terms of accuracy, computational efficiency, and adaptiveness. However, existing deep learning frameworks and compilers mainly focus on optimizing static NNs with deterministic execution, missing optimization opportunities brought by non-uniform distribution of activation in dynamic NNs. The key to optimizing dynamic NNs is the traceability of how data are dynamically dispatched to different paths at inference. Such dynamism often happens at sub-tensor level (e.g., conditional dispatching tokens of a tensor), thus hard for existing tensor-centric frameworks to trace due to misaligned expression granularity.

In this paper, we present Brainstorm, a deep learning framework for optimizing dynamic NNs, which bridges the gap by unifying how dynamism should be expressed. Brainstorm proposes (1) Cell, the key data abstraction that lets model developers express the data granularity where dynamism exists, and (2) Router, a unified interface to let model developers express how Cells should be dynamically dispatched. Brainstorm handles efficient execution of routing actions. This design allows Brainstorm to collect profiles of fine-grained dataflow at the correct granularity. The traceability further opens up a new space of dynamic optimization for dynamic NNs to specialize their execution to the runtime dynamism distribution. Extensive evaluations show Brainstorm brings up to 11.1× speedup or leads to 42% less memory consumption for popular dynamic neural networks with the proposed dynamic optimizations.

AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models

Fan Lai, University of Michigan; Wei Zhang, Rui Liu, William Tsai, Xiaohan Wei, Yuxi Hu, Sabin Devkota, Jianyu Huang, Jongsoo Park, Xing Liu, Zeliang Chen, Ellie Wen, Paul Rivera, Jie You, and Jason Chen, Meta Inc.; Mosharaf Chowdhury, University of Michigan

Deep learning recommendation models (DLRMs) are using increasingly larger embedding tables to represent categori- cal sparse features. Each embedding row of the table repre- sents the trainable weight vector of a specific feature instance. While more embedding rows typically enable better model accuracy by considering more feature instances, they lead to large deployment cost and slow model execution.

Unlike existing efforts that primarily focus on optimizing DLRMs for the given embedding, we present a complementary system, AdaEmbed, to reduce the size of embeddings needed for the same DLRM accuracy via in-training embed- ding pruning. Our key insight is that the data distribution of embedding instances and model (embedding) weights vary across embedding rows and over time, implying varying em- bedding importance to model accuracy. However, identifying important embeddings and then enforcing pruning for modern DLRMs with up to billions of embeddings (terabytes) is challenging. Given the total embedding size, AdaEmbed considers embeddings with higher runtime access frequency and larger training gradients to be more important, and it adaptively determines per-feature embedding size to dynamically prune less important embeddings at scale. Our evaluations in industrial settings show that AdaEmbed saves 35-60% embedding size needed in deployment and improves model execution speed by 11-34%, while achieving noticeable accuracy gains.

12:10 pm–1:40 pm

Lunch (on your own)

1:40 pm–3:05 pm

Verify Your Bits

BWoS: Formally Verified Block-based Work Stealing for Parallel Processing

Jiawei Wang, Huawei Dresden Research Center, Huawei Central Software Institute, Technische Universität Dresden; Bohdan Trach, Ming Fu, Diogo Behrens, Jonathan Schwender, Yutao Liu, and Jitang Lei, Huawei Dresden Research Center, Huawei Central Software Institute; Viktor Vafeiadis, MPI-SWS; Hermann Härtig, Technische Universität Dresden; Haibo Chen, Huawei Central Software Institute, Shanghai Jiao Tong University

Work stealing is a widely-used scheduling technique for parallel processing on multicore. Each core owns a queue of tasks and avoids idling by stealing tasks from other queues. Prior work mostly focuses on balancing workload among cores, disregarding whether stealing may adversely impact the owner's performance or hinder synchronization optimizations. Real-world industrial runtimes for parallel processing heavily rely on work-stealing queues for scalability, and such queues can become bottlenecks to their performance.

We present Block-based Work Stealing (BWoS), a novel and pragmatic design that splits per-core queues into multiple blocks. Thieves and owners rarely operate on the same blocks, greatly removing interferences and enabling aggressive optimizations on the owner's synchronization with thieves. Furthermore, BWoS enables a novel probabilistic stealing policy that guarantees thieves steal from longer queues with higher probability. In our evaluation, using BWoS improves performance by up to 1.25x in the Renaissance macrobenchmark when applied to Java GC, provides average 1.26x speedup in JSON processing when applied to Go runtime, and improves maximum throughput of Hyper HTTP server by 1.12x when applied to Rust Tokio runtime. In microbenchmarks, it provides 8-11x better performance than state-of-the-art designs. We have formally verified and optimized BWoS on weak memory models with a model-checking-based framework.

AutoV: Scaling Machine-Checkable Verification for Large System Software

Xupeng Li, Xuheng Li, Wei Qiang, Ronghui Gu, and Jason Nieh, Columbia University

System software is often large and complex, resulting in many vulnerabilities that can potentially be exploited to compromise the security of a system. Formal verification offers a potential solution to creating bug-free software, but a key impediment to its adoption remains proof cost. We present AutoV, a highly automated verification framework to construct machine-checkable proofs in Coq for system software with much less proof cost. AutoV uses a novel program structure reconstruction technique to leverage LLVM to translate C code directly into a Coq representation, supporting full C semantics, including C macros, inline assembly, and compiler directives, so that complex systems code no longer has to be manually modified to be verified. AutoV uses a layering structure to modularize and decompose verification into smaller steps, making it possible to automatically generate intermediate layer Coq specifications and refinement proofs to simplify proving the functional correctness of systems code. We use AutoV to verify the functional correctness of a KVM hypervisor implementation. Verification using AutoV required roughly 70% less proof effort than the manually-written specifications and proofs to verify an older implementation. Furthermore, the proofs using AutoV hold for the unmodified implementation that is directly compiled and executed.

Verifying vMVCC, a high-performance database using multi-version concurrency control

Yun-Sheng Chang, MIT CSAIL; Ralf Jung, ETH Zürich; Upamanyu Sharma, Massachusetts Institute of Technology; Joseph Tassarotti, New York University; Frans Kaashoek, MIT; Nickolai Zeldovich, MIT CSAIL

Multi-version concurrency control (MVCC) is a widely used, sophisticated approach for handling concurrent transactions in a database system. vMVCC is the first MVCC-based database that comes with a machine-checked proof of correctness, providing clients with a guarantee that it will correctly handle all transactions despite a sophisticated design and implementation that might otherwise be error-prone. vMVCC is implemented in Go, as a single-node in-memory database, and uses several optimizations, such as RDTSC-based timestamps, to achieve high performance (30--71% the throughput of Silo, a state-of-the-art in-memory database, for YCSB and TPC-C workloads). Formally specifying and verifying vMVCC required adopting sophisticated proof techniques, such as prophecy variables and logical atomicity, owing to the fact that MVCC transactions can linearize at timestamps prior to transaction execution.

Automated Verification of Idempotence for Stateful Serverless Applications

Haoran Ding, Zhaoguo Wang, Zhuohao Shen, Rong Chen and Haibo Chen, Shanghai Jiao Tong University

Serverless computing has become a popular cloud computing paradigm. By default, when a serverless function fails, the serverless platform re-executes the function to tolerate the failure. However, such a retry-based approach requires functions to be idempotent, i.e., the function should expose the same behavior regardless of retry. This requirement is challenging for developers, especially when functions are stateful—failures may cause functions to repeatedly read and update shared states, potentially corrupting system consistency.

This paper presents Flux, the first toolkit that automatically verifies the idempotence of serverless applications. Flux is built on a new correctness definition, idempotence consistency, which stipulates that serverless function retry is transparent by users. To verify idempotence consistency, Flux defines a novel property—idempotence simulation, which decomposes the proof for a concurrent serverless application into the reasoning of individual functions. Furthermore, Flux extends existing verification techniques to realize automated reasoning, enabling Flux to identify idempotence-violating operations and fix them with existing log-based methods.

We demonstrate the efficacy of Flux with 27 representative serverless applications. Flux successfully identifies previously unknown issues in 12 applications, 8 of which have been confirmed by developers. Compared to state-of-the-art systems (namely Beldi and Boki) that log every operation, Flux achieves up to 6× lower latency and up to 10× higher peak throughput, as it logs only the identified idempotence-violating ones.

Sharding the State Machine: Automated Modular Reasoning for Complex Concurrent Systems

Travis Hance and Yi Zhou, Carnegie Mellon University; Andrea Lattuada, ETH Zurich; Reto Achermann, University of British Columbia; Alex Conway, VMware Research; Ryan Stutsman, University of Utah; Gerd Zellweger, VMware Research; Chris Hawblitzel, Microsoft Research; Jon Howell, VMware Research; Bryan Parno, Carnegie Mellon University

We present Seagull, an automated verification framework for concurrent code with shared memory. Seagull scales to complex systems by splitting system-wide proofs into isolated concerns such that each can be substantially automated. As a starting point, Seagull’s ownership type system allows a developer to straightforwardly prove both data safety and the logical correctness of thread-local operations. Seagull then introduces the concept of a Localized Transition System, which connects the correctness of local actions to the correctness of the entire system. We demonstrate Seagull by verifying two state-of-the-art concurrent systems comprising thousands of lines: a library for black-box replication on NUMA architectures, and a highly concurrent page cache.

3:05 pm–3:35 pm

Break with Refreshments

3:35 pm–5:00 pm

Transfer Your Bits

Flor: An Open High Performance RDMA Framework over Heterogeneous RNICs

Qiang Li, Alibaba Group; Yixiao Gao, Nanjing Univeristy; Xiaoliang Wang, Nanjing University; Haonan Qiu, Alibaba Group; Yanfang Le, Intel Corp; Qiao Xiang, Xiamen University; Derui Liu, Fei Feng, Peng Zhang, Bo Li, Jianbo Dong, Lingbo Tang, Hongqiang Harry Liu, Shaozong Liu, Weijie Li, Rui Miao, Yaohui Wu, Zhiwu Wu, Chao Han, Lei Yan, Zheng Cao, and Zhongjie Wu, Alibaba Group; Chen Tian and Guihai Chen, Nanjing University; Dennis Cai, Alibaba Group; Jinbo Wu, Alibaba Cloud; Jiaji Zhu and Jiesheng Wu, Alibaba Group; Jiwu Shu, Xiamen University

Datacenter applications have been increasingly applying RDMA for the ultra-low latency and low CPU overhead. However, RDMA-capable NICs (RNICs) of different vendors and different generations from the same vendors do not cooperate well, which causes bandwidth imbalance in the production network. Our observation of the heterogeneous RNICs is that though the data path functions of these RNICs follow the same RoCEv2 specifications, their control path functions are vendor and version specific. To this end, we propose Flor, an open framework that provides a flexible control plane in software and a unified hardware plane by adopting heterogeneous RNICs. The hardware plane requires no changes of current specifications. The software plane can run in NPU of RNICs, DPUs and host CPUs, following which we build up strengthen reliable transport over the large-scale lossy Ethernet. We implemented and evaluated Flor in both testbed and production clusters over Intel E180, Mellanox CX-4 and CX-5 and Broadcom RNICs. Experiments show that Flor achieves comparable performance to vanilla RDMA in many scenarios including 1/4096 packet loss, 6000:1 incast, and large-scale cross-pod communication. Flor mitigates the performance gap of CX-4 and CX-5 RNICs from 24.3% to 1.3% when they are deployed together without PFC dependency.

ShRing: Networking with Shared Receive Rings

Boris Pismenny, Technion and NVIDIA; Adam Morrison, Tel Aviv University; Dan Tsafrir, Technion & VMware Research

Multicore systems employ parallelism to accommodate high incoming Ethernet traffic, allocating one receive (Rx) ring with 1Ki entries per core, by default. This ring's size is sufficient to absorb packet bursts of single-core workloads. But the aggregate memory size of all Rx buffers (pointed to by all Rx rings) can exceed the LLC size. We observe that such an aggregate size might incur nonnegligible overheads when scaling to hundreds of incoming Gbps, as memory accesses to Rx buffers (by either NIC or CPU) are increasingly served by main memory. Rx buffer accesses likewise compete for cache capacity with concurrent accesses unrelated to networking.

To alleviate this problem, we propose "shRing," which shares a single Rx ring among several cores when networking memory bandwidth consumption is high. Helped by minimal firmware changes, we implement shRing on a real NIC. We find that synchronization costs associated with shRing's sharing are offset by its smaller memory footprint, which increases the throughput of NFV workloads by up to 1.27x. ShRing's smaller footprint may additionally prevent overload conditions, allowing the system to process packets faster than they arrive and thus reduce the latency by up to 38x.

ServiceRouter: a Scalable and Minimal Cost Service Mesh

Harshit Saokar, Meta; Soteris Demetriou, Meta and Imperial College London; Nick Magerko, Max Kontorovich, Josh Kirstein, and Margot Leibold, Meta; Dimitrios Skarlatos, Carnegie Mellon University; Hitesh Khandelwal and Chunqiang Tang, Meta

Datacenter applications are often structured as many interconnected microservices, and service mesh has become a popular approach to route RPC traffic among services. This paper presents perhaps one of the world's largest service meshes called ServiceRouter (SR), which has been in production since 2012. SR differs from publicly known service meshes and load-balancing solutions in several important ways. First, SR is designed from the ground up for hyperscale, currently employing O(10^6) of L7 routers to route O(10^10) of requests per second across O(10^4) services. Second, SR's novel use of an embedded routing library, as opposed to the common approach of using sidecar or remote proxies, reduces the hardware cost of our hyperscale service mesh by O(10^5) machines. Third, SR provides built-in support for sharded services, which account for 92% of the total RPC requests in our fleet, whereas existing general-purpose service meshes do not support sharded services. Finally, SR introduces the concept of locality rings to simultaneously minimize RPC latency and balance load across geo-distributed regions, which to our knowledge has not been attempted before.

Characterizing Off-path SmartNIC for Accelerating Distributed Applications

Xingda Wei, Rongxin Chen, Yuhan Yang, Rong Chen, and Haibo Chen, Shanghai Jiao Tong University

SmartNIC has recently emerged as an attractive device to accelerate distributed systems. However, there has been no comprehensive characterization of SmartNIC, and prior designs usually utilize only a single path for workload offloading. This paper presents the first comprehensive study of off-path SmartNIC. Our experimental study uncovers the key performance characteristics of the communication among the client, SmartNIC SoC, and the host. To this end, this paper presents an optimization guideline that can be used by designers to smartly exploit multiple communication paths of SmartNICs, which exposes new optimization spaces for distributed systems. Our case studies using a state-of-the-art SmartNIC-based distributed file system (i.e., LineFS) and an RDMA-based disaggregated key/value store (i.e., DrTM-KV) show that following our study and guideline can improve LineFS by up to 58% and DrTM-KV by 25%.

Ensō: A Streaming Interface for NIC-Application Communication

Hugo Sadok and Nirav Atre, Carnegie Mellon University; Zhipeng Zhao, Microsoft; Daniel S. Berger, Microsoft Research & University of Washington; James C. Hoe, Carnegie Mellon University; Aurojit Panda, New York University; Justine Sherry, Carnegie Mellon University; Ren Wang, Intel Labs

Today, most communication between NICs and software requires exchanging fixed sized packet-buffers. This packetized interface was designed for an era when NICs implemented few offloads, and software implemented the logic for translating between application data and packets. However, both NICs and networked software have evolved: modern NICs implement hardware offloads, e.g., TSO, LRO, and serialization offloads that can more efficiently translate between application data and packets. Furthemore, modern software increasingly batches network I/O to reduce overheads. These changes have led to a mismatch between the packetized interface, which assumes that the NIC and software exchange fixed sized buffers, and the features provided by modern NICs and used by modern software. This incongruence between interface and data adds software complexity and I/O overheads, which in turn limit communication performance.

This paper proposes, Ensō, a new streaming NIC to software interface designed to better support how NICs and software interact today. At its core, Ensō eschews fixed size buffers, and instead structures communication as a stream that can be used to send arbitrary data sizes. We show that this change reduces software overheads, reduces PCIe bandwidth requirements, and leads to fewer cache misses. These improvements allow an Ensō based NIC to saturate a 100 Gbps link with minimum-sized packets (forwarding at 148.8 Mpps) using a single core, and improves throughput for high-performance network applications by 1.5-5x.

5:00 pm–5:10 pm

Closing Remarks

Roxana Geambasu, Columbia University, and Ed Nightingale, Apple