All the times listed below are in Eastern Daylight Time (EDT). All sessions will be held in the Grand Ballroom Complex unless otherwise noted.
Papers and full proceedings are available for download below to registered attendees now and will be available to everyone beginning Monday, July 10. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].
Proceedings Front Matter
Monday, July 10
8:00 am–9:00 am
Grand-Liberty Foyer/Constitution Foyer
9:00 am–10:00 am
OSDI '23 and USENIX ATC '23 Joint Keynote Address
Ion Stoica, University of California, Berkeley
Technology ecosystems often undergo significant transformations as they mature. For example, telephony, the Internet, and PCs all started with a single provider, but each is now served by a competitive market that uses comprehensive technology standards to provide compatibility. In this talk, I will present our view on how the cloud ecosystem, only fifteen years old, could evolve as it matures, and discuss our early results and experience in this endeavor.
Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, where he holds the Xu Bao Chancellor's Chair and is leading Sky Computing Lab. He is currently doing research on cloud computing and AI systems. Current and past work includes Ray, Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow, Honorary Member of the Romanian Academy of Sciences, and has received numerous awards, including the Mark Weiser Award (2019), SIGOPS Hall of Fame Award (2015), and several "Test of Time" awards. He also co-founded several companies, including Anyscale (2019), Databricks (2013), and Conviva (2006).
10:00 am–10:30 am
Break with Refreshments
Grand-Liberty Foyer/Constitution Foyer
10:30 am–10:45 am
Opening Remarks and Awards
Roxana Geambasu, Columbia University, and Ed Nightingale, Apple
10:45 am–12:10 pm
Make Your Bits Go Faster
Session Chair: Malte Schwarzkopf, Brown University
Vishal Gupta, EPFL; Kumar Kartikeya Dwivedi, SRMIST; Yugesh Kothari, Yueyang Pan, Diyu Zhou, and Sanidhya Kashyap, EPFL
Today's high-performance applications heavily rely on various synchronization mechanisms, such as locks. While locks ensure mutual exclusion of shared data, their design impacts application scalability. Locks, as used in practice, move the lock-guarded shared data to the core holding it, which leads to shared data transfer among cores. This design adds unavoidable critical path latency leading to performance scalability issues. Meanwhile, some locks avoid this shared data movement by localizing the access to shared data on one core, and shipping the critical section to that specific core. However, such locks require modifying applications to explicitly package the critical section, which makes it virtually infeasible for complicated applications with large code bases, such as the Linux kernel.
We propose transparent delegation, in which a waiter automatically encodes its critical section information on its stack and notifies the combiner (lock holder). The combiner executes the shipped critical section on the waiter's behalf using a lightweight context switch. Using transparent delegation, we design a family of locking protocols, called TCLocks, that requires zero modification to applications' logic. The evaluation shows that TCLocks provide up to 5.2x performance improvement compared with recent locking algorithms.
Shiwu Lo, Han-Ting Lin, Yao-Hung Hsieh, and Chao-Ting Lin, National Chung Cheng University; Yu-Hsueh Fang, National Cheng Kung University; Ching-Shen Lin, National Chung Cheng University; Ching-Chun (Jim) Huang, National Cheng Kung University; Kam Yiu Lam, City University of Hong Kong; Yuan-Hao Chang, Academia Sinica, Taiwan
As the number of processor cores increases, the efficiency of accessing shared variables through the lock-unlock method decreases. A NUMA-aware algorithm, which only considers the transmission delay between processors, may not fully utilize the connection network of a multi-core processor. This limits the scalability of a multi-core processor due to the large amount of low- and variable-cost data sharing between cores. The problem is that the reduction in communication cost cannot compensate for the increase in the time complexity of the spinlocks, and the farthest transmission distance becomes longer with more cores.
We propose a method called Routing on Network-on-chip (RON) to minimize the communication cost between cores by using a routing table and pre-calculating an optimized locking-unlocking order. RON delivers locks and data in a one-way circular manner among cores to (1) minimize global data movement cost and (2) achieve bounded waiting time. Microbenchmarks provide quantitative analysis, while multi-core benchmarks show performance under various workloads.
Zhe Zhou, Yanxiang Bi, Junpeng Wan, and Yangfan Zhou, Fudan University; Zhou Li, University of California, Irvine
Context switching between kernel mode and user mode often causes prominent overhead, which slows down applications with frequent system calls (or syscalls), e.g., those with high I/O demand. The overhead is further amplified by security mechanisms like Linux kernel page-table isolation (KPTI). To accelerate such applications, many efforts have been put in removing syscalls from the I/O paths, mainly by combining drivers and applications in the same space or batching syscalls. Nonetheless, such solutions require developers to refactor their applications or even update hardware, which impedes their broad adoption.
In this paper, we propose another approach, userspace bypass (UB), to accelerate syscall-intensive applications, by transparently moving userspace instructions into kernel. Userspace bypass requires no modification to userspace binaries or code and achieves full binary compatibility. Specifically, to avoid overhead caused by frequent syscalls, kernel identifies the short userspace execution path between consecutive system calls, and converts the instructions in the path into code blocks with Software-Based Fault Isolation (SFI) guarantee. According to our evaluation, I/O micro-benchmark can be accelerated by 30.3 – 88.3%, Redis GET Requests Per Second (RPS) can be improved by 4.4 – 10.8% for 1B – 4KiB data sizes, when the application is executed in a virtualized setting with KPTI turned on. The performance boost will be reduced when KPTI is turned off.
Emery D. Berger, Sam Stern, and Juan Altmayer Pizzorno, University of Massachusetts Amherst
Awarded Best Paper!
This paper proposes the Scalene Python profiler. Scalene precisely and simultaneously profiles CPU, memory, and GPU usage, all with low overhead. Scalene's CPU and memory profilers help Python programmers direct their optimization efforts by distinguishing between inefficient Python and efficient native execution time and memory usage. Scalene's memory profiler employs a novel sampling algorithm that lets it operate with low overhead yet high precision. It also incorporates a novel algorithm that automatically pinpoints memory leaks within Python or across the Python/native boundary. Scalene tracks a new metric called copy volume, which highlights costly copying operations that can occur when Python silently converts between native and Python data representations, or between CPU and GPU. Since its introduction, Scalene has been widely adopted, with over 675,000 downloads to date. We present experience reports from developers who used Scalene to achieve significant performance improvements and memory savings.
Xiang (Jenny) Ren, Sitao Wang, Zhuqi Jin, David Lion, and Adrian Chiu, University of Toronto; Tianyin Xu, University of Illinois at Urbana-Champaign; Ding Yuan, University of Toronto
Performance debugging is notoriously elusive—real-world performance problems are rarely clear-cut failures, but manifest through the accumulation of fine-grained symptoms. Oftentimes, it is challenging to determine performance anomalies—absolute measures are unreliable, as system performance is inherently relative to workloads. Existing techniques focus on identifying absolute predicates that deviate between executions, which limits their application to performance problems.
This paper introduces relational debugging, a new technique that automatically pinpoints the root causes of performance problems. The core idea is to capture and reason out relations between fine-grained runtime events. We show that relations provide immense utilities to explain performance anomalies and locate root causes. Relational debugging is highly effective with a minimal two executions (a good and a bad run), eliminating the pain point of producing and labeling many different executions required by traditional techniques.
We realize relational debugging by developing a practical tool named Perspect. Perspect directly operates on x86 binaries to accommodate real-world diagnosis scenarios. We evaluate Perspect on twelve challenging performance issues with various symptoms in Go runtime, MongoDB, Redis, and Coreutils. Perspect accurately located (or excluded) the root causes of these issues. In particular, we used Perspect to diagnose two open bugs, where developers failed to find root causes—the root causes reported by Perspect were confirmed by developers. A controlled user study shows that Perspect can speed up debugging by at least 10.87 times.
12:10 pm–1:40 pm
Back Bay Ballroom
1:40 pm–3:05 pm
Secure Your Bits I
Session Chair: Bryan Parno, Carnegie Mellon University
Emma Dauterman, UC Berkeley; Danny Lin, Woodinville High School; Henry Corrigan-Gibbs, MIT; David Mazières, Stanford University
Credential compromise is hard to detect and hard to mitigate. To address this problem, we present larch, an accountable authentication framework with strong security and privacy properties. Larch protects user privacy while ensuring that the larch log server correctly records every authentication. Specifically, an attacker who compromises a user’s device cannot authenticate without creating evidence in the log, and the log cannot learn which web service (relying party) the user is authenticating to. To enable fast adoption, larch is backwards-compatible with relying parties that support FIDO2, TOTP, and password-based login. Furthermore, larch does not degrade the security and privacy a user already expects: the log server cannot authenticate on behalf of a user, and larch does not allow relying parties to link a user across accounts. We implement larch for FIDO2, TOTP, and password-based login. Given a client with four cores and a log server with eight cores, an authentication with larch takes 150ms for FIDO2, 91ms for TOTP, and 74ms for passwords (excluding preprocessing, which takes 1.23s for TOTP).
Kinan Dak Albab, Ishan Sharma, Justus Adam, Benjamin Kilimnik, Aaron Jeyaraj, Raj Paul, Artem Agvanian, Leonhard Spiegelberg, and Malte Schwarzkopf, Brown University
Data privacy laws like the EU's GDPR grant users new rights, such as the right to request access to and deletion of their data. Manual compliance with these requests is error-prone and imposes costly burdens especially on smaller organizations, as non-compliance risks steep fines.
K9db is a new, MySQL-compatible database that complies with privacy laws by construction. The key idea is to make the data ownership and sharing semantics explicit in the storage system. This requires K9db to capture and enforce applications' complex data ownership and sharing semantics, but in exchange simplifies privacy compliance. Using a small set of schema annotations, K9db infers storage organization, generates procedures for data retrieval and deletion, and reports compliance errors if an application risks violating the GDPR.
Our K9db prototype successfully expresses the data sharing semantics of real web applications, and guides developers to getting privacy compliance right. K9db also matches or exceeds the performance of existing storage systems, at the cost of a modest increase in state size.
Mingyu Li, Shanghai Jiao Tong University; Shanghai AI Laboratory; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Xuyang Zhao and Le Chen, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Cheng Tan, Northeastern University; Huorong Li and Sheng Wang, Alibaba Group; Zeyu Mi, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Yubin Xia, Shanghai Jiao Tong University; Shanghai AI Laboratory; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Feifei Li, Alibaba Group; Haibo Chen, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China
State-of-the-art encrypted databases (EDBs) can be divided into two types: one that protects the whole DBMS engine in a trusted domain, and one that protects only operators that support queries over encrypted data. Both types have limitations when dealing with malicious database administrators (DBAs). The first type either exposes the data to DBAs or makes maintenance operations difficult if the DBA role is eliminated. The second type is vulnerable to abuse of the operator interfaces; in particular, we devise a smuggle attack that enables DBAs to secretly and effectively access data.
We introduce HEDB, which prevents smuggle attacks and preserves database maintainability. HEDB uses a dual-mode EDB design based on our analysis of DBA maintenance tasks. Execution Mode handles user queries by isolating DBAs from operators to prevent smuggle attacks, while Maintenance Mode enables DBMS maintenance and operator troubleshooting through authenticated replay and anonymized replay, respectively. Our evaluation shows that HEDB blocks smuggle attacks and supports common maintenance tasks with 5.88% runtime cost and 9.26% storage cost.
Chenxing Li, Shanghai Tree-Graph Blockchain Research Institute; Sidi Mohamed Beillahi, University of Toronto; Guang Yang and Ming Wu, Shanghai Tree-Graph Blockchain Research Institute; Wei Xu, Tsinghua University; Fan Long, University of Toronto
Authenticated storage access is the performance bottleneck of a blockchain, because each access can be amplified to potentially O(logn) disk I/O operations in the standard Merkle Patricia Trie (MPT) storage structure. In this paper, we propose a multi-Layer Versioned Multipoint Trie (LVMT), a novel high-performance blockchain storage with significantly reduced I/O amplifications. LVMT uses the authenticated multipoint evaluation tree (AMT) vector commitment protocol to update commitment proofs in constant time. LVMT adopts a multi-layer design to support unlimited key-value pairs and stores version numbers instead of value hashes to avoid costly elliptic curve multiplication operations. In our experiment, LVMT outperforms the MPT in real Ethereum traces, delivering read and write operations six times faster. It also boosts blockchain system execution throughput by up to 2.7 times.
Haohui Mai, PrivacyCore Inc.; Jiacheng Zhao, SKLP, Institute of Computing Technology, CAS; Zhongguancun Laboratory; and UCAS; Hongren Zheng, IIIS, Tsinghua University; Yiyang Zhao, SKLP, Institute of Computing Technology, CAS; and UCAS; Zibin Liu, BUPT; Mingyu Gao, IIIS, Tsinghua University; Cong Wang, IDEA Shenzhen; Huimin Cui, SKLP, Institute of Computing Technology, CAS; and UCAS; Xiaobing Feng, SKLP, Institute of Computing Technology, CAS; Zhongguancun Laboratory; and UCAS; Christos Kozyrakis, PrivacyCore Inc. and Stanford
Graphics Processing Units (GPUs) unlock emerging use cases like large language models and autonomous driving. They process a large amount of sensitive data, where security is of critical importance. GPU Trusted Execution Environments (TEEs) generally provide security to GPU applications with modest overheads. Recent proposals for GPU TEEs are promising, but many of them require hardware changes that have a long lead time to deploy in production environments.
This paper presents Honeycomb, a software-based, secure and efficient TEE for GPU applications. The key idea of Honeycomb is to leverage static analysis to validate the security of GPU applications at load time. Co-designing with the CPU TEE, as well as adding OS and driver support, Honeycomb is able to remove both the OS and the driver from the trusted computing base (TCB). Validation also ensures that all applications inside the system are secure, enabling a concise and secure approach to exchange data in plaintext via shared device memory on the GPU.
We have prototyped Honeycomb targeting the AMD RX6900XT GPU. Honeycomb is evaluated on five representative benchmarks and 23 applications in total, covering workloads of high performance computing, deep learning, and image processing. The results show that Honeycomb is both practical and efficient to secure real-world GPU applications. Validating applications to run on Honeycomb requires modest developer efforts. The TCB is 18× smaller than the Linux-based systems. Secure inter-process communication is up to 529× faster. Moreover, running large language model workloads like BERT and NanoGPT has ∼2% overheads.
3:05 pm–3:10 pm
Presentation of the VMware Systems Research Award
3:10 pm–3:35 pm
Break with Refreshments
3:35 pm–5:00 pm
Secure Your Bits II
Session Chair: Jay Lorch, Microsoft Research
Adil Ahmad and Alex Schultz, Arizona State University; Byoungyoung Lee, Seoul National University; Pedro Fonseca, Purdue University
Confidential computing solutions are crucial to address the cloud privacy concerns. Although SGX has witnessed significant adoption in the cloud, the reliance on hardware implementation is restrictive for cloud providers in terms of orchestrating deployments and providing stronger security to their clients’ enclaves. eOPF addresses this limitation by providing a comprehensive, secure hypervisor-level instrumentation framework with the ability to monitor all enclave-OS interactions and implement protected services. eOPF overcomes several challenges including bridging the semantic gap between the hypervisor and SGX and attesting the co-location of the framework with enclaves. Using eOPF, we implement two protected services that provide platform resource orchestration and complementary enclave side-channel defense. Our evaluation shows that eOPF incurs very low performance overhead (<2%) in its default state and only a modest overhead (geometric mean of 17% on SPEC) when strong, complementary side-channel defenses are enabled, making eOPF an efficient and practical solution for the cloud.
Sebastian Angel, Microsoft Research; Aditya Basu, Penn State University; Weidong Cui, Microsoft Research; Trent Jaeger, Penn State University; Stella Lau, MIT CSAIL; Srinath Setty, Microsoft Research; Sudheesh Singanamalla, University of Washington
This paper introduces Nimble, a cloud service that helps applications running in trusted execution environments (TEEs) to detect rollback attacks (i.e., detect whether a data item retrieved from persistent storage is the latest version). To achieve this, Nimble realizes an append-only ledger service by employing a simple state machine running in a TEE in conjunction with a crash fault-tolerant storage service. Nimble then replicates this trusted state machine to ensure the system is available even if a minority of state machines crash. A salient aspect of Nimble is a new reconfiguration protocol that allows a cloud provider to replace the set of nodes running the trusted state machine whenever it wishes—without affecting safety. We have formally verified Nimble’s core protocol in Dafny, and have implemented Nimble such that its trusted state machine runs in multiple TEE platforms (Intel SGX and AMD SNP-SEV). Our results show that a deployment of Nimble on machines running in different availability zones can achieve from tens of thousands of requests/sec with an end-to-end latency of under 3.2 ms (based on an in-memory key-value store) to several thousands of requests/sec with a latency of 30ms (based on Azure Table).
Sultan Mahmud Sajal, Microsoft Research and Pennsylvania State University; Luke Marshall and Beibin Li, Microsoft Research; Shandan Zhou and Abhisek Pan, Microsoft Azure; Konstantina Mellou and Deepak Narayanan, Microsoft Research; Timothy Zhu, Pennsylvania State University; David Dion and Thomas Moscibroda, Microsoft Azure; Ishai Menache, Microsoft Research
The infinite capacity of cloud computing is an illusion: in reality, cloud providers cannot always have enough capacity of the right type, in the right place, at the right time to meet all demand. Consequently, cloud providers need to implement admission-control policies to ensure accepted capacity requests experience high availability. However, admission control in the public cloud is hard due to dynamic changes in both supply and demand: hardware might become unavailable, and actual VM consumption could vary for a variety of reasons including tenant scale-outs and fulfillment of VM reservations made by customers ahead of time. In this paper, we design and implement Kerveros, a flexible admission-control system that has three desired properties: i) high computational scalability to handle a large inventory, ii) accurate capacity provisioning for high VM availability, and iii) good packing efficiency to optimize resource usage. To achieve this, Kerveros uses novel bookkeeping techniques to quickly estimate the capacity available for incoming VM requests. Our system has been deployed in Microsoft Azure. Results from both simulations and production confirm that Kerveros achieves more than four nines of availability while sustaining request processing latencies of a few milliseconds.
Jiahao Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Dingji Li, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University; Zeyu Mi, Yuxuan Liu, and Binyu Zang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Haibing Guan, Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China
Today’s mainstream virtualization systems are plagued by severe security threats due to the large attack surface exposed by in-kernel hypervisor components such as KVM. To address this issue, this paper proposes a novel design called delegated virtualization, which decouples the commodity hypervisor into two planes: the hypervisor plane for hypervisor control (which is typically small and has fixed logic) and the VM plane for handling virtual machine (VM) requests and exceptions at runtime. As our investigation reveals that all known hypervisor vulnerabilities that threaten the host kernel lie in the VM plane, delegated virtualization completely offloads the in-kernel VM plane to a user-space hypervisor called DuVisor that directly interacts with its VM without exiting to the host kernel, based on a small hardware extension (481 lines of Chisel). We have implemented the hardware extension on an open-source RISC-V CPU on FireSim and built a Rust-based DuVisor atop it. The evaluation results demonstrate that DuVisor significantly reduces the attack surface with negligible performance overhead (< 5%). DuVisor’s source code is publicly available at https://github.com/IPADS-DuVisor.
Ziqiao Zhou, Microsoft Research; Yizhou Shan, University of California, San Diego; Weidong Cui, Xinyang Ge, Marcus Peinado, and Andrew Baumann, Microsoft Research
Virtual machines are the basis of resource isolation in today's public clouds, yet the security risks of entrusting that isolation to a cloud provider's hypervisor are substantial. Such concerns have motivated the design of hardware extensions for "confidential VMs" that seek to remove the hypervisor from the trusted computing base by adding a highly-privileged firmware layer that checks hypervisor actions, and supports memory encryption and remote attestation. However, the hypervisor retains control of resource management and observes associated actions of the guest including nested page table faults and CPU scheduling, and thus confidential VMs remain vulnerable to an ever-changing variety of hypervisor-level side channel attacks. Bare-metal cloud servers avoid such leaks, but remain a niche due to the high cost of dedicated hardware.
We observe that typical cloud VMs run with a static allocation of memory and discrete cores, and increasingly rely on I/O offload, thus negating the apparent need for a hypervisor and the fragile hypervisor/guest isolation boundary. Our design, core slicing, enables multiple untrusted guest OSes to run on shared bare-metal hardware. To ensure isolation without the complexity of virtualization, we propose simple hardware extensions that restrict guests to a static slice of a machine's cores, memory and virtual I/O devices, and delegate resource allocation to a dedicated management slice. We demonstrate practicality and evaluate performance with prototypes for RISC-V and x86.
5:30 pm–7:00 pm
OSDI '23 Poster Session and Reception
Sponsored by Amazon
Back Bay Ballroom
Would you like to share a provocative opinion, interesting preliminary work, or a cool idea that will spark discussion at this year's OSDI? The poster session is the perfect venue to introduce such new or ongoing work. Poster presenters will have the opportunity to discuss their work, get exposure, and receive feedback from other attendees during the in-person evening reception. View the list of accepted posters.
Tuesday, July 11
8:00 am–9:00 am
Grand-Liberty Foyer/Constitution Foyer
9:00 am–10:10 am
Expanding, Hardening, and Deploying Your Bits
Session Chair: Ryan Huang, University of Michigan
Siyuan Zhuang, UC Berkeley; Stephanie Wang, UC Berkeley and Anyscale; Eric Liang and Yi Cheng, Anyscale; Ion Stoica, UC Berkeley
Given the fundamental tradeoff between run-time and recovery performance, current distributed systems often build application-specific recovery strategies to minimize overheads. However, it is increasingly common for different applications to be composed into heterogeneous pipelines. Implementing multiple interoperable recovery techniques in the same system is rare and difficult. Thus, today's users must choose between: (1) building on a single system, and face a fixed choice of performance vs. recovery overheads, or (2) the challenging task of stitching together multiple systems that can offer application-specific tradeoffs.
We present ExoFlow, a universal workflow system that enables a flexible choice of recovery vs. performance tradeoffs, even within the same application. The key insight behind our solution is to decouple execution from recovery and provide exactly-once semantics as a separate layer from execution. For generality, workflow tasks can return references that capture arbitrary inter-task communication. To enable the workflow system and therefore the end user to take control of recovery, we design task annotations that specify execution semantics such as nondeterminism. ExoFlow generalizes recovery for existing workflow applications ranging from ETL pipelines to stateful serverless workflows, while enabling further optimizations in task communication and recovery.
Jialun Lyu, Microsoft Azure and University of Toronto; Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, and Savyasachi Samal, Microsoft Azure; Ioannis Manousakis and Lisa Hsu, Formerly of Microsoft Azure; Preetha Subbarayalu, Ashish Raniwala, Brijesh Warrier, and Ricardo Bianchini, Microsoft Azure; Bianca Schroeder, University of Toronto; Daniel S. Berger, Microsoft Azure and University of Washington
Today’s cloud platforms handle server hardware failures by shutting down the affected server and only turning it back online once it has been repaired by a technician. At cloud scale, this all-or-nothing operating model is becoming increasingly unsustainable. This model is also at odds with technology trends, such as the need for new cooling technology.
This paper introduces Hyrax, a datacenter stack that enables compute servers with failed components to continue hosting VMs while hiding the underlying degraded capacity and performance. A key enabler of Hyrax is a novel model of changes in memory interleaving when deactivating faulty memory modules. Experiments on cloud production servers show that Hyrax overcomes common hardware failures without impacting peak VM performance. In large-scale simulations with production traces, Hyrax reduces server repair requirements by 50-60% without impacting VM scheduling.
NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall
Haonan Lu, University at Buffalo; Shuai Mu, Stony Brook University; Siddhartha Sen, Microsoft Research; Wyatt Lloyd, Princeton University
Strictly serializable datastores greatly simplify application development. However, existing techniques pay unnecessary costs for naturally consistent transactions, which arrive at servers in an order that is already strictly serializable. We exploit this natural arrival order by executing transactions with minimal costs while optimistically assuming they are naturally consistent, and then leverage a timestamp-based technique to efficiently verify if the execution is indeed consistent. In the process of this design, we identify a fundamental pitfall in relying on timestamps to provide strict serializability and name it the timestamp-inversion pitfall. We show that timestamp inversion has affected several existing systems.
We present Natural Concurrency Control (NCC), a new concurrency control technique that guarantees strict serializability and ensures minimal costs---i.e., one-round latency, lock-free, and non-blocking execution---in the common case by leveraging natural consistency. NCC is enabled by three components: non-blocking execution, decoupled response management, and timestamp-based consistency checking. NCC avoids the timestamp-inversion pitfall with response timing control and proposes two optimization techniques, asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover, NCC designs a specialized protocol for read-only transactions, which is the first to achieve optimal best-case performance while guaranteeing strict serializability without relying on synchronized clocks. Our evaluation shows NCC outperforms state-of-the-art strictly serializable solutions by an order of magnitude on many workloads.
Boris Grubic, Meta; Yang Wang, Meta and the Ohio State University; Tyler Petrochko, Ran Yaniv, Brad Jones, David Callies, Matt Clarke-Lauer, and Dan Kelley, Meta; Soteris Demetriou, Meta and Imperial College London; Kenny Yu and Chunqiang Tang, Meta
We present Conveyor, Meta's software deployment tool, along with the valuable data obtained from managing over 30,000 deployment pipelines that deploy all kinds of services at Meta across millions of machines. We describe a wide range of deployment scenarios that Conveyor supports to achieve universal coverage. At Meta, out of all the deployment pipelines for services deployed via containers, 97% of them employ fully automated deployments without manual intervention: 55% utilize continuous deployment, instantly deploying every code change to production after passing automated tests, and the remaining 42% are automatically deployed on a fixed schedule (mostly daily or weekly) without manual validation. We highlight several distinguishing features of Conveyor, including safe in-place updates to reduce hardware costs, analysis of code dependencies to prevent faulty releases, and the capability to deploy complex ML models at scale.
10:10 am–10:45 am
Break with Refreshments
10:45 am–12:10 pm
Query Your Bits
Session Chair: Atul Adya, Databricks
Tamer Eldeeb and Xincheng Xie, Columbia University; Philip A. Bernstein, Microsoft Research; Asaf Cidon and Junfeng Yang, Columbia University
Distributed on-disk database systems could either use an expensive commit protocol like two-phase commit (2PC) to guarantee atomicity, and suffer from slow distributed transactions, or forgo 2PC, which lead to weaker semantics, limitations to the programming model, or constrained scalability, making the system less general. We argue this compromise is no longer necessary within modern datacenters. Low latency 2PC (∼150 μs on Azure for 2PC over Paxos) can be achieved using low-latency storage for the relatively small transaction logs, fast RPCs, and careful protocol design. With fast 2PC, the data contention bottleneck for many transactions shifts from 2PC to reading the data itself from the relatively slow storage while holding transaction locks.
We present Chardonnay, a scalable, on-disk, multi-versioned transactional key-value store optimized for single datacenter deployments with fast 2PC. Chardonnay has a general interface supporting point reads, scans, and writes within multi-step strictly serializable ACID transactions. The key mechanism underlying Chardonnay's design is strongly consistent snapshot reads on commodity hardware, using a novel lock-free read protocol. Chardonnay uses this protocol to cheaply determine the read-write sets of queries, enabling Chardonnay to transparently prefetch data needed for a transaction prior to the execution of the transaction and the acquisition of locks. This enables Chardonnay to achieve fast transactions by minimizing contention, and avoids aborts due to deadlocks by ordering lock requests.
Syed Akbar Mehdi, The University of Texas at Austin; Deukyeon Hwang and Simon Peter, University of Washington; Lorenzo Alvisi, Cornell University
ScaleDB is a serializable in-memory transactional database that achieves excellent scalability on multi-core machines by asynchronously updating range indexes. We find that asynchronous range index updates can significantly improve database scalability by applying updates in batches, reducing contention on critical sections. To avoid stale reads, ScaleDB uses small hash indexlets to hold delayed updates. We use indexlets to design ACC, an asynchronous concurrency control protocol providing serializability. With ACC, it is possible to delay range index updates without adverse performance effects on transaction execution in the common case.
ACC delivers scalable serializable isolation for transactions, with high throughput and low abort rate. Evaluation on a dual-socket server with 36 cores shows that ScaleDB achieves 9.5× better query throughput than Peloton on the YCSB benchmark and 1.8× better transaction throughput than Cicada on the TPC-C benchmark.
Qianxi Zhang, Shuotao Xu, Qi Chen, and Guoxin Sui, Microsoft Research Asia; Jiadong Xie, Microsoft Research Asia and East China Normal University; Zhizhen Cai and Yaoqi Chen, Microsoft Research Asia and University of Science and Technology of China; Yinxuan He, Microsoft Research Asia and Renmin University of China; Yuqing Yang, Fan Yang, Mao Yang, and Lidong Zhou, Microsoft Research Asia
Approximate similarity queries on high-dimensional vector indices have become the cornerstone for many critical online services. An increasing need for more sophisticated vector queries requires integrating vector search systems with relational databases. However, high-dimensional vector indices do not exhibit monotonicity, a critical property of conventional indices. The lack of monotonicity forces existing vector systems to rely on monotonicity-preserving tentative indices, set up temporarily for a target vector's TopK nearest neighbors, to facilitate queries. This leads to suboptimal performance due to the difficulty to predict the optimal K.
This paper presents VBase, a system that efficiently supports complex queries of both approximate similarity search and relational operators. VBase identifies a common property, relaxed monotonicity, to unify two seemingly incompatible systems. This common property allows VBase to circumvent the constraints of a TopK-only interface to achieve significantly higher efficiency, while provably preserving the semantics of TopK-based solutions. Evaluation results show VBase offers up to three orders-of-magnitude higher performance than state-of-the-art vector systems on complex online vector queries. VBase further enables analytical similarity queries that previous vector systems do not, and shows 7,000× speedup with 99.9% accuracy of exact queries.
Zu-Ming Jiang and Si Liu, ETH Zurich; Manuel Rigger, National University of Singapore; Zhendong Su, ETH Zurich
Transactions are an important feature of database management systems (DBMSs), as they provide the ACID guarantees for a sequence of database operations. Consequently, approaches have been proposed to automatically find transactional bugs in DBMSs. However, they cannot handle complex operations and predicates common in real-world database queries, and thus miss bugs.
This paper introduces a general, effective technique for finding transactional bugs in DBMSs that supports complex SQL queries and predicates. At the conceptual level, we address the test-oracle problem by constructing semantically-equivalent test cases based on fine-grained statement-level dependencies in transactions. At the technical level, we introduce (1) statement-dependency graphs to describe dependencies among SQL statements in transactions, (2) SQL-level instrumentation to capture possible statement-level dependencies, and (3) transactional oracle construction to generate semantically-equivalent test cases using statement-dependency graphs. We also establish the correctness of our approach in generating semantically-equivalent test cases. We have realized our technique as a tool, TxCheck, and evaluated it on three widely-used and well-tested DBMSs, namely TiDB, MySQL, and MariaDB. In total, TxCheck found 56 unique bugs, 52 of which have been confirmed and 18 already fixed. We believe that TxCheck can help solidify DBMSs’ support for transactions thanks to its generality and effectiveness.
Audrey Cheng, David Chu, Terrance Li, Jason Chan, Natacha Crooks, Joseph M. Hellerstein, and Ion Stoica, UC Berkeley; Xiangyao Yu, University of Wisconsin—Madison
Most caching policies focus on increasing object hit rate to improve overall system performance. However, these algorithms are insufficient for transactions. In this work, we define a new metric, transactional hit rate, to capture when caching reduces latency for transactions. We present DeToX, a caching system that leverages transactional dependencies to make eviction and prefetching decisions. DeToX is able to significantly outperform single-object alternatives on real-world workloads and popular OLTP benchmarks, providing up to a 130% increase in transaction hit rate and 3.4x improvement in cache efficiency.
12:10 pm–1:40 pm
Back Bay Ballroom
1:40 pm–2:50 pm
Store Your Bits
Session Chair: Roxana Geambasu, Columbia University
Qing Wang, Youyou Lu, Jing Wang, and Jiwu Shu, Tsinghua University
Combining persistent memory (PM) with RDMA is a promising approach to performant replicated distributed key-value stores (KVSs). However, existing replication approaches do not work well when applied to PM KVSs: 1) Using RPC induces software queueing and execution at backups, increasing request latency; 2) Using one-sided RDMA WRITE causes numerous streams of small PM writes, leading to severe device-level write amplification (DLWA) on PM.
In this paper, we propose Rowan, an efficient RDMA abstraction to handle replication writes in PM KVSs; it aggregates massive concurrent remote writes from different servers, and lands these writes to PM in a sequential (thus low DLWA) and one-sided (thus low latency) manner. We realize Rowan with off-the-shelf RDMA NICs. Further, we build Rowan-KV, a log-structured PM KVS using Rowan for replication. Evaluation shows that under write-intensive workloads, compared with PM KVSs using RPC and RDMA WRITE for replication, Rowan-KV boosts throughput by 1.22× and 1.39× as well as lowers median PUT latency by 1.77× and 2.11×, respectively, while largely eliminating DLWA.
Jaehong Min and Chenxingyu Zhao, University of Washington; Ming Liu, University of Wisconsin-Madison; Arvind Krishnamurthy, University of Washington
Emerging Zoned Namespace (ZNS) SSDs, providing the coarse-grained zone abstraction, hold the potential to significantly enhance the cost-efficiency of future storage infrastructure and mitigate performance unpredictability. However, existing ZNS SSDs have a static zoned interface, making them in-adaptable to workload runtime behavior, unscalable to underlying hardware capabilities, and interfering with co-located zones. Applications either under-provision the zone resources yielding unsatisfied throughput, create over-provisioned zones and incur costs, or experience unexpected I/O latencies.
We propose eZNS, an elastic-zoned namespace interface that exposes an adaptive zone with predictable characteristics. eZNS comprises two major components: a zone arbiter that manages zone allocation and active resources on the control plane, a hierarchical I/O scheduler with read congestion control, and write admission control on the data plane. Together, eZNS enables the transparent use of a ZNS SSD and closes the gap between application requirements and zone interface properties. Our evaluations over RocksDB demonstrate that eZNS outperforms a static zoned interface by 17.7% and 80.3% in throughput and tail latency, respectively, at most.
Chao Wang, Junliang Hu, Tsun-Yu Yang, Yuhong Liang, and Ming-Chang Yang, The Chinese University of Hong Kong
With the merits of high density, non-volatility, and DRAM-scale latency/bandwidth, persistent memory (PM) brings hope to high-performance storage systems, in which hashing-based index structures receive great attention owing to the efficient query performance. Though lots of efforts have been made to rethink the hashing schemes for PM in recent years, nevertheless, based on our investigation, none of them can hit performance scalability, efficiency, and predictability with one stone, seriously limiting their practicality to time-sensitive or latency-critical applications. To this end, this paper presents SEPH, a Scalable, Efficient, and Predictable Hashing for PM. SEPH paves a new direction to build the hash table by introducing the novel Level Segment (LS) structure, a key to breaking the dilemma between efficiency and predictability standing in front of the existing hashing schemes for PM. With the LS-based hash table structure, SEPH further enables a low-overhead split to greatly suppress the resizing-incurred unpredictability, and develops a semi lock-free concurrency control that requires a nearly-minimal amount of writes to handle an item insertion for achieving ever-higher efficiency and scalability while ensuring the correctness and crash consistency. Compared to state-of-the-art hashing schemes, SEPH demonstrates higher efficiency (up to 15.4× higher throughput), better scalability (performance scales up to 48 threads), and more reliable predictability (improving the tail latency by up to 19.3×).
Xingda Wei, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, and Shanghai AI Laboratory; Fangming Lu, Tianxia Wang, Jinyu Gu, and Yuhan Yang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, and Shanghai AI Laboratory; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University
Serverless platforms essentially face a tradeoff between container startup time and provisioned concurrency (i.e., cached instances), which is further exaggerated by the frequent need for remote container initialization. This paper presents MITOSIS, an operating system primitive that provides fast remote fork, which exploits a deep codesign of the OS kernel with RDMA. By leveraging the fast remote read capability of RDMA and partial state transfer across serverless containers, MITOSIS bridges the performance gap between local and remote container initialization. MITOSIS is the first to fork over 10,000 new containers from one instance across multiple machines within a second, while allowing the new containers to efficiently transfer the pre-materialized states of the forked one. We have implemented MITOSIS on Linux and integrated it with FN, a popular serverless platform. Under load spikes in real-world serverless workloads, MITOSIS reduces the function tail latency by 89% with orders of magnitude lower memory usage. For serverless workflow that requires state transfer, MITOSIS improves its execution time by 86%.
2:50 pm–2:55 pm
2022 Conference Climate & Harassment Survey Results Announcement
2:55 pm–3:20 pm
Break with Refreshments
3:20 pm–4:30 pm
Manage Your Bits I
Session Chair: Emery Berger, University of Massachusetts Amherst
Baptiste Lepers, Université de Neuchâtel; Willy Zwaenepoel, University of Sydney
We demonstrate that hardware management of a tiered memory system offers better performance for many applications than current methods of software management. Hardware management treats the fast memory as a cache on slower memory. The advantages are that caching can be done at cache line granularity and that data appears in fast memory as soon as it is accessed. The potential for cache conflicts has, however, led previous works to conclude these hardware methods generally perform poorly.
In this paper, we show that the use of low-overhead conflict avoidance techniques eliminates conflicts almost entirely and thereby address the above limitation. We explore two techniques. One technique simply tries to avoid conflicts between pages at page allocation time. The other technique relies on sampling memory accesses to avoid specifically conflicts between hot pages, at page allocation time and by dynamic remapping in the case of conflicts between hot pages missed at allocation time.
We have implemented these techniques in the Linux kernel on an Intel Optane machine in a system called JC. We use HPC applications, key-value stores and databases to compare JC to the default Linux tiered memory management implementation and to a state-of-the-art software management approach, as exemplified by the HeMem system.
Our measurements show that JC outperforms Linux and HeMem by up to 5x. A surprising conclusion of this paper is that a cache can be made to perform close-to-optimally by minimizing conflicts at page allocation time, without any monitoring of hot pages or dynamic page remapping.
Amogha Udupa Shankaranarayana Gopal, Raveendra Soori, Michael Ferdman, and Dongyoon Lee, Stony Brook University
A heap overflow vulnerability occurs when a program written in an unmanaged language such as C or C++ accesses a memory location beyond an object allocation boundary. Malicious users may exploit this vulnerability to corrupt an adjacent object in memory, creating an entry point for a security attack. Despite decades of research, unfortunately, it still remains challenging to detect heap overflow vulnerabilities in real-world programs at a low cost.
We present TAILCHECK, a new lightweight heap overflow detection scheme that leverages page protection and pointer tagging. When an object is created, TAILCHECK allocates an additional page-protected shadow object, called a TailObject, placing the distance from the object to its TailObject as a tag stored in the unused high-order bits of the object pointer. For every access to the original object, TAILCHECK performs an additional memory access to the TailObject, whose address is computed using the tag. Heap overflows are detected as page faults when an access occurs beyond the TailObject. We evaluated TAILCHECK with four server applications (apache, nginx, memached, redis) and the SPEC CPU2017 and SPEC CPU2006 benchmarks, successfully finding heap overflows in SPEC CPU2017 gcc. TAILCHECK experiences 4% and 2% run-time overhead for the average and tail (99%) latencies for server applications; and only 33% and 29% run-time overhead for SPEC CPU2017 and SPEC CPU2006, respectively, less than the state-of-the-art solution.
Xuchuan Luo, School of Computer Science, Fudan University; Pengfei Zuo, Huawei Cloud; Jiacheng Shen and Jiazhen Gu, The Chinese University of Hong Kong; Xin Wang, School of Computer Science, Fudan University; Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China; Michael R. Lyu, The Chinese University of Hong Kong; Yangfan Zhou, School of Computer Science, Fudan University; Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China
Disaggregated memory (DM) is an increasingly prevalent architecture in academia and industry with high resource utilization. It separates computing and memory resources into two pools and interconnects them with fast networks. Existing range indexes on DM are based on B+ trees, which suffer from large inherent read and write amplifications. The read and write amplifications rapidly saturate the network bandwidth, resulting in low request throughput and high access latency of B+ trees on DM.
In this paper, we propose to use the radix tree, which is more suitable for DM than the B+ tree due to smaller read and write amplifications. However, constructing a radix tree on DM is challenging due to the costly lock-based concurrency control, the bounded memory-side IOPS, and the complicated computing-side cache validation. To address these challenges, we design SMART, the first radix tree for disaggregated memory with high performance. Specifically, we leverage 1) a hybrid concurrency control scheme including lock-free internal nodes and fine-grained lock-based leaf nodes to reduce lock overhead, 2) a computing-side read-delegation and write-combining technique to break through the IOPS upper bound by reducing redundant I/Os, and 3) a simple yet effective reverse check mechanism for computing-side cache validation. Experimental results show that SMART achieves 6.1x higher throughput under typical write-intensive workloads and 2.8x higher throughput under read-only workloads, compared with state-of-the-art B+ trees on DM.
Vasily A. Sartakov, Lluís Vilanova, and Munir Geden, Imperial College London; David Eyers, University of Otago; Takahiro Shinagawa, The University of Tokyo; Peter Pietzuch, Imperial College London
Cloud environments host many tenants, and typically there is substantial overlap between the application binaries and libraries executed by tenants. Thus, memory de-duplication can increase memory density by allocating memory for shared binaries only once. Existing de-duplication approaches, however, either rely on a shared OS to de-deduplicate binary objects, which provides unacceptably weak isolation; or exploit hypervisor-based de-duplication at the level of memory pages, which is blind to the semantics of the objects to be shared.
We describe Object Reuse with Capabilities (ORC), which supports the fine-grained sharing of binary objects between tenants, while isolating tenants strongly through a small trusted computing base (TCB). ORC uses hardware support for memory capabilities to isolate tenants, which permits shared objects to be accessible to multiple tenants safely. Since ORC shares binary objects within a single address space through capabilities, it uses a new relocation type to create per-tenant state when loading shared objects. ORC supports the loading of objects by an untrusted guest, outside of its TCB, only verifying the safety of the loaded data. Our experiments show that ORC achieves a higher memory density with a lower overhead than hypervisor-based de-deduplication.
4:30 pm–4:45 pm
4:45 pm–5:55 pm
Manage Your Bits II
Session Chair: Dilma Da Silva, Texas A&M University
Marius Eriksen, Kaushik Veeraraghavan, Yusuf Abdulghani, Andrew Birchall, Po-Yen Chou, Richard Cornew, Adela Kabiljo, Ranjith Kumar S, Maroo Lieuw, Justin Meza, Scott Michelson, Thomas Rohloff, Hayley Russell, Jeff Qin, and Chunqiang Tang, Meta
Customers of both private and public cloud providers must wrestle with the problem of regionalization: how should service capacity be apportioned across a large number of geo-distributed datacenter regions? This problem is further complicated by the complex service dependency graphs that arise from microservice architectures, as well as capacity availability and hardware mix that can vary greatly by region.
Historically, regionalization has been solved through a slow-moving and manual process, whereby owners of large services directly negotiate capacity allocation and distribution with the cloud provider. However, as both service and cloud footprints continue to grow, these manual processes are becoming untenable, and tend to produce both a great amount of toil for everyone involved, as well as suboptimal results.
At Meta we have built a system, Flux, to automate capacity regionalization, moving it from a bottoms-up, manual process, to a top-down, automated one. Flux employs RPC tracing to identify service capacity models, and uses these to compute an optimal joint capacity and traffic distribution plan that spans 1000s of services across 10s of products, and involves millions of servers. These plans are orchestrated by a system that safely and efficiently rebalances service capacity and product traffic across 10s of regions on a continuous basis.
Justin J. Meza, Thote Gowda, Ahmed Eid, Tomiwa Ijaware, Dmitry Chernyshev, Yi Yu, Md Nazim Uddin, Rohan Das, Chad Nachiappan, Sari Tran, Shuyang Shi, Tina Luo, David Ke Hong, Sankaralingam Panneerselvam, Hans Ragas, Svetlin Manavski, Weidong Wang, and Francois Richard, Meta Platforms, Inc.
Every day, billions of people depend on Internet services for communication, commerce, and entertainment. Yet planetary-scale data center infrastructures consisting of millions of servers experience unplanned capacity outages and unexpected demand for resources; how can such infrastructures remain reliable in the face of capacity and workload flux?
In this paper, we introduce Defcon, a system for improving the availability of large-scale, globally-distributed Internet services using graceful feature degradation. In response to overload conditions, Defcon enables site operators to gradually disable less-critical features in order to reduce resource demand. Defcon presents a common interface to product developers to define feature knobs that represent degradation capabilities. Defcon automatically tests knobs to understand each knob’s product- and infrastructure-level trade-offs. At Meta, we have used Defcon to improve global product availability in the face of worldwide demand-surges in addition to large-scale infrastructure failures.
Romil Bhardwaj, UC Berkeley; Kirthevasan Kandasamy, University of Wisconsin-Madison; Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica, UC Berkeley
Traditional systems for allocating finite cluster resources among competing jobs have either aimed at providing fairness, relied on users to specify their resource requirements, or have estimated these requirements via surrogate metrics (e.g. CPU utilization). These approaches do not account for a job’s real world performance (e.g. P95 latency). Existing performance-aware systems use offline profiled data and/or are designed for specific allocation objectives. In this work, we argue that resource allocation systems should directly account for real-world performance and the varied allocation objectives of users. In this pursuit, we build Cilantro.
At the core of Cilantro is an online learning mechanism which forms feedback loops with the jobs to estimate the resource to performance mappings and load shifts. This relieves users from the onerous task of job profiling and collects reliable real-time feedback. This is then used to achieve a variety of user-specified scheduling objectives. Cilantro handles the uncertainty in the learned models by adapting the underlying policy to work with confidence bounds. We demonstrate this in two settings. First, in a multi-tenant 1000 CPU cluster with 20 independent jobs, three of Cilantro’s policies outperform 9 other baselines on three different performance-aware scheduling objectives, improving user utilities by up to 1.2 − 3.7x. Second, in a microservices setting, where 160 CPUs must be distributed between 19 inter-dependent microservices, Cilantro outperforms 3 other baselines, reducing the end-to-end P99 latency to x0.57 the next best baseline.
Midhul Vuppalapati, Giannis Fikioris, and Rachit Agarwal, Cornell University; Asaf Cidon, Columbia University; Anurag Khandelwal, Yale University; Éva Tardos, Cornell University
The classical max-min fairness algorithm for resource allocation provides many desirable properties, e.g., Pareto efficiency, strategy-proofness and fairness. This paper builds upon the observation that max-min fairness guarantees these properties under a strong assumption---user demands being static over time---and that, for the realistic case of dynamic user demands, max-min fairness loses one or more of these properties.
We present Karma, a generalization of max-min fairness for dynamic user demands. The key insight in Karma is to introduce "memory" into max-min fairness --- when allocating resources, Karma takes users' past allocations into account: in each quantum, users donate their unused resources and are assigned credits when other users borrow these resources; Karma carefully orchestrates exchange of credits across users (based on their instantaneous demands, donated resources and borrowed resources), and performs prioritized resource allocation based on users' credits. We prove theoretically that Karma guarantees Pareto efficiency, online strategy-proofness, and optimal fairness for dynamic user demands (without future knowledge of user demands). Empirical evaluations over production workloads show that these properties translate well into practice: Karma is able to reduce disparity in performance across users to a bare minimum while maintaining Pareto-optimal system-wide performance.
6:00 pm–7:30 pm
USENIX ATC '23 Poster Session and Reception
Back Bay Ballroom
The USENIX ATC '23 poster session and reception will feature posters by authors presenting their work in person at the conference. View the list of accepted posters.
Wednesday, July 12
8:00 am–9:00 am
Grand-Liberty Foyer/Constitution Foyer
9:00 am–10:10 am
Train Your Bits I
Session Chair: Peter Pietzuch, Imperial College London
Zhuohan Li and Lianmin Zheng, UC Berkeley; Yinmin Zhong, Peking University; Vincent Liu, University of Pennsylvania; Ying Sheng, Stanford University; Xin Jin, Peking University; Yanping Huang and Zhifeng Chen, Google; Hao Zhang, UC San Diego; Joseph E. Gonzalez and Ion Stoica, UC Berkeley
Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.
Chen Zhang, Tsinghua University; Lingxiao Ma and Jilong Xue, Microsoft Research; Yining Shi, Peking University & Microsoft Research; Ziming Miao and Fan Yang, Microsoft Research; Jidong Zhai, Tsinghua University; Zhi Yang, Peking University; Mao Yang, Microsoft Research
With the growing complexity of deep neural networks (DNNs), developing DNN programs with intricate control flow logic (e.g., loops, branches, and recursion) has become increasingly essential. However, executing such DNN programs efficiently on accelerators is challenging. Current DNN frameworks typically process control flow on the CPU, while offloading the remaining computations to accelerators like GPUs. This often introduces significant synchronization overhead between CPU and the accelerator, and prevents global optimization across control flow scopes.
To address this challenge, we propose Cocktailer, a new DNN compiler that co-optimizes the execution of control flow and data flow on hardware accelerators. Cocktailer provides the uTask abstraction to unify the representation of DNN models, including both control flow and data flow. This allows Cocktailer to expose a holistic scheduling space for rescheduling control flow to the lower-level hardware parallelism of accelerators. Cocktailer uses a heuristic policy to find efficient schedules and is able to automatically move control flow into kernels of accelerators, enabling optimization across control flow boundaries. Evaluations demonstrate that Cocktailer can accelerate DNN models with control flow by up to 8.2× over the fastest one of the state-of-the-art DNN frameworks and compilers.
Yining Shi, Peking University & Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou, Microsoft Research
With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern deep neural networks (DNNs) are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models. This inefficiency is caused by both the conventional treatment of DNNs as compute-intensive workloads and the lack of holistic memory access optimization in DNN models.
In this paper, we introduce Welder, a deep learning compiler that optimizes the execution efficiency from a holistic memory access perspective. The core of Welder is tile-graph, an abstraction that facilitates fine-grained data management at tile level. By leveraging the observation of optimization independence across memory layers, Welder is able to decompose the whole combinatorial DNN optimization space into several independent ones and effectively trade off between intra- and inter-operator data reuse using a tile traffic-based cost model. This allows Welder to unify previous ad-hoc memory optimizations into a single space, generate efficient execution plans with 89 more optimization patterns, and outperform state-of-the-art solutions significantly. Welder is also able to handle DNN models with arbitrarily large input by combining the existing accelerator memory and host memory as a whole system.
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
Jie Zhao, Information Engineering University; Siyuan Feng, Shanghai Jiao Tong University; Xiaoqiang Dan, Fei Liu, Chengke Wang, Sheng Yuan, Wenyuan Lv, and Qikai Xie, Stream Computing Inc.
Fully exploiting the computing power of an accelerator specialized for deep neural networks (DNNs) calls for the synergy between network and hardware architectures, but existing approaches partition a computational graph of DNN into multiple sub-graphs by abstracting away hardware architecture and assign resources to each sub-graph, not only producing redundant off-core data movements but also under-utilizing the hardware resources of a domain-specific architecture (DSA).
This paper introduces a systematic approach for effectively scheduling DNN computational graphs on DSA platforms. By fully taking into account hardware architecture when partitioning a computational graph into coarse-grained sub-graphs, our work enables the synergy between network and hardware architectures, addressing several challenges of prior work: (1) it produces larger but fewer kernels, converting a large number of off-core data movements into on-core data exchanges; (2) it exploits the imbalanced memory usage distribution across DNN network architecture, better saturating the DSA memory hierarchy; (3) it enables across-layer instruction scheduling not studied before, further exploiting the parallelism across different specialized compute units.
Results of seven DNN inference models on a DSA platform show that our work outperforms TVM and AStitch by 11.15× and 6.16×, respectively, and obtains throughput competitive to the vendor-crafted implementation. A case study on GPU also demonstrates that generating kernels for our sub-graphs can surpass CUTLASS with and without convolution fusion by 1.06× and 1.23×, respectively.
10:10 am–10:45 am
Break with Refreshments
10:45 am–12:10 pm
Train Your Bits II
Session Chair: Peter Pietzuch, Imperial College London
Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, and Shuhong Huang, Tsinghua University; Xupeng Miao, Carnegie Mellon University; Shizhi Tang and Kezhao Huang, Tsinghua University; Zhihao Jia, Carnegie Mellon University
Boosting the execution performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world applications. However, existing approaches to optimizing the tensor computation of DNNs only consider transformations representable by a fixed set of predefined tensor operators, resulting in a highly restricted optimization space. To address this issue, we propose EinNet, a derivation-based tensor program optimizer. EinNet optimizes tensor programs by leveraging transformations between general tensor algebra expressions and automatically creating new operators desired by transformations, enabling a significantly larger search space that includes those supported by prior works as special cases. Evaluation on seven DNNs shows that EinNet outperforms existing tensor program optimizers by up to 2.72× (1.52× on average) on NVIDIA A100 and up to 2.68× (1.55× on average) on NVIDIA V100. EinNet is publicly available at https://github.com/InfiniTensor/InfiniTensor.
Qinghao Hu, Nanyang Technological University, S-Lab, NTU, and Shanghai AI Laboratory; Zhisheng Ye, Shanghai AI Laboratory and Peking University; Meng Zhang, Nanyang Technological University, S-Lab, NTU, and Shanghai AI Laboratory; Qiaoling Chen, Shanghai AI Laboratory and National University of Singapore; Peng Sun, Shanghai AI Laboratory and SenseTime Research; Yonggang Wen and Tianwei Zhang, Nanyang Technological University
Hyperparameter tuning is an essential step in deep learning model development that provides better model performance at the cost of substantial resources. While existing systems can improve tuning efficiency, they still fail to handle large models with billions of parameters and efficiently leverage cluster resources. Motivated by these deficiencies, we introduce Hydro, a surrogate-based hyperparameter tuning service that optimizes tuning workloads in both the job-level and cluster-level granularities. Specifically, it consists of two key components: (1) Hydro Tuner automatically generates and optimizes surrogate models via scaling, parametrization and fusion; (2) Hydro Coordinator improves tuning efficiency and cluster-wide resource utilization by adaptively leveraging ephemeral and heterogeneous resources. Our comprehensive experiments on two tuning algorithms across six models show that Hydro Tuner can dramatically reduce tuning makespan by up to 78.5x compared with Ray Tune and no reduction in tuning quality. Hydro's source code is publicly available at https://github.com/S-Lab-System-Group/Hydro.
MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
Yuke Wang, Boyuan Feng, and Zheng Wang, University of California Santa Barbara; Tong Geng, University of Rochester; Kevin Barker and Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California Santa Barbara
The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery.
To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel fine-grained dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the GNN execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41×, 4.81×, and 10.83× faster than DGL, MGG-UVM, and ROC frameworks, respectively.
Weihao Cui, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research Asia; Lingji Ouyang, University of Science and Technology of China; Yichuan Wang, Shanghai Jiao Tong University; Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu, and Lidong Zhou, Microsoft Research Asia; Quan Chen, Shanghai Jiao Tong University; Haisheng Tan, University of Science and Technology of China; Minyi Guo, Shanghai Jiao Tong University
Dynamic neural networks (NNs), which can adapt sparsely activated sub-networks to inputs during inference, have shown significant advantages over static ones in terms of accuracy, computational efficiency, and adaptiveness. However, existing deep learning frameworks and compilers mainly focus on optimizing static NNs with deterministic execution, missing optimization opportunities brought by non-uniform distribution of activation in dynamic NNs. The key to optimizing dynamic NNs is the traceability of how data are dynamically dispatched to different paths at inference. Such dynamism often happens at sub-tensor level (e.g., conditional dispatching tokens of a tensor), thus hard for existing tensor-centric frameworks to trace due to misaligned expression granularity.
In this paper, we present Brainstorm, a deep learning framework for optimizing dynamic NNs, which bridges the gap by unifying how dynamism should be expressed. Brainstorm proposes (1) Cell, the key data abstraction that lets model developers express the data granularity where dynamism exists, and (2) Router, a unified interface to let model developers express how Cells should be dynamically dispatched. Brainstorm handles efficient execution of routing actions. This design allows Brainstorm to collect profiles of fine-grained dataflow at the correct granularity. The traceability further opens up a new space of dynamic optimization for dynamic NNs to specialize their execution to the runtime dynamism distribution. Extensive evaluations show Brainstorm brings up to 11.7× speedup (3.29× on average) or leads to 42% less memory consumption for popular dynamic neural networks with the proposed dynamic optimizations.
Fan Lai, University of Michigan; Wei Zhang, Rui Liu, William Tsai, Xiaohan Wei, Yuxi Hu, Sabin Devkota, Jianyu Huang, Jongsoo Park, Xing Liu, Zeliang Chen, Ellie Wen, Paul Rivera, Jie You, and Chun-cheng Jason Chen, Meta; Mosharaf Chowdhury, University of Michigan
Deep learning recommendation models (DLRMs) are using increasingly larger embedding tables to represent categorical sparse features such as video genres. Each embedding row of the table represents the trainable weight vector for a specific instance of that feature. While increasing the number of embedding rows typically improves model accuracy by considering more feature instances, it can lead to larger deployment costs and slower model execution.
Unlike existing efforts that primarily focus on optimizing DLRMs for the given embedding, we present a complementary system, AdaEmbed, to reduce the size of embeddings needed for the same DLRM accuracy via in-training embedding pruning. Our key insight is that the access patterns and weights of different embeddings are heterogeneous across embedding rows, and dynamically change over the training process, implying varying embedding importance with respect to model accuracy. However, identifying important embeddings and then enforcing pruning for modern DLRMs with up to billions of embeddings (terabytes) is challenging. Given the total embedding size, AdaEmbed considers embeddings with higher runtime access frequencies and larger training gradients to be more important, and it dynamically prunes less important embeddings at scale to automatically determine per-feature embeddings. Our evaluations in industrial settings show that AdaEmbed saves 35-60% embedding size needed in deployment and improves model execution speed by 11-34%, while achieving noticeable accuracy gains.
12:10 pm–1:40 pm
Lunch (on your own)
1:40 pm–3:05 pm
Verify Your Bits
Session Chair: Dilma Da Silva, Texas A&M University
Jiawei Wang, Huawei Dresden Research Center, Huawei Central Software Institute, Technische Universität Dresden; Bohdan Trach, Ming Fu, Diogo Behrens, Jonathan Schwender, Yutao Liu, and Jitang Lei, Huawei Dresden Research Center, Huawei Central Software Institute; Viktor Vafeiadis, MPI-SWS; Hermann Härtig, Technische Universität Dresden; Haibo Chen, Huawei Central Software Institute, Shanghai Jiao Tong University
Work stealing is a widely-used scheduling technique for parallel processing on multicore. Each core owns a queue of tasks and avoids idling by stealing tasks from other queues. Prior work mostly focuses on balancing workload among cores, disregarding whether stealing may adversely impact the owner's performance or hinder synchronization optimizations. Real-world industrial runtimes for parallel processing heavily rely on work-stealing queues for scalability, and such queues can become bottlenecks to their performance.
We present Block-based Work Stealing (BWoS), a novel and pragmatic design that splits per-core queues into multiple blocks. Thieves and owners rarely operate on the same blocks, greatly removing interferences and enabling aggressive optimizations on the owner's synchronization with thieves. Furthermore, BWoS enables a novel probabilistic stealing policy that guarantees thieves steal from longer queues with higher probability. In our evaluation, using BWoS improves performance by up to 1.25x in the Renaissance macrobenchmark when applied to Java GC, provides average 1.26x speedup in JSON processing when applied to Go runtime, and improves maximum throughput of Hyper HTTP server by 1.12x when applied to Rust Tokio runtime. In microbenchmarks, it provides 8-11x better performance than state-of-the-art designs. We have formally verified and optimized BWoS on weak memory models with a model-checking-based framework.
Xupeng Li, Xuheng Li, Wei Qiang, Ronghui Gu, and Jason Nieh, Columbia University
System software is often large and complex, resulting in many vulnerabilities that can potentially be exploited to compromise the security of a system. Formal verification offers a potential solution to creating bug-free software, but a key impediment to its adoption remains proof cost. We present Spoq, a highly automated verification framework to construct machine-checkable proofs in Coq for system software with much less proof cost. Spoq introduces a novel program structure reconstruction technique to leverage LLVM to translate C code into Coq, supporting full C semantics, including C macros, inline assembly, and compiler directives, so that source code no longer has to be manually modified to be verified. Spoq leverages a layering proof strategy and introduces novel Coq tactics and transformation rules to automatically generate layer specifications and refinement proofs to simplify verification of concurrent system software. Spoq also supports easy integration of manually written layer specifications and refinement proofs. We use Spoq to verify a multiprocessor KVM hypervisor implementation. Verification using Spoq required 70% less proof effort than the manually written specifications and proofs to verify an older implementation. Furthermore, the proofs using Spoq hold for the unmodified implementation that is directly compiled and executed.
Yun-Sheng Chang, MIT CSAIL; Ralf Jung, ETH Zurich; Upamanyu Sharma, MIT CSAIL; Joseph Tassarotti, New York University; M. Frans Kaashoek and Nickolai Zeldovich, MIT CSAIL
Multi-version concurrency control (MVCC) is a widely used, sophisticated approach for handling concurrent transactions. vMVCC is the first MVCC-based transaction library that comes with a machine-checked proof of correctness, providing clients with a guarantee that it will correctly handle all transactions despite a complicated design and implementation that might otherwise be error-prone. vMVCC is implemented in Go, stores data in memory, and uses several optimizations, such as RDTSC-based timestamps, to achieve high performance (25–96% the throughput of Silo, a state-of-the-art in-memory database, for YCSB and TPC-C workloads). Formally specifying and verifying vMVCC required adopting advanced proof techniques, such as logical atomicity and prophecy variables, owing to the fact that MVCC transactions can linearize at timestamp generation prior to transaction execution.
Haoran Ding, Zhaoguo Wang, and Zhuohao Shen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, and Shanghai AI Laboratory; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University
Serverless computing has become a popular cloud computing paradigm. By default, when a serverless function fails, the serverless platform re-executes the function to tolerate the failure. However, such a retry-based approach requires functions to be idempotent, which means that functions should expose the same behavior regardless of retries. This requirement is challenging for developers, especially when functions are stateful. Failures may cause functions to repeatedly read and update shared states, potentially corrupting data consistency.
This paper presents Flux, the first toolkit that automatically verifies the idempotence of serverless applications. It proposes a new correctness definition, idempotence consistency, which stipulates that a serverless function’s retry is transparent to users. To verify idempotence consistency, Flux defines a novel property, idempotence simulation, which decomposes the proof for a concurrent serverless application into the reasoning of individual functions. Furthermore, Flux extends existing verification techniques to realize automated reasoning, enabling Flux to identify idempotence-violating operations and fix them with existing log-based methods.
We demonstrate the efficacy of Flux with 27 representative serverless applications. Flux has successfully identified previously unknown issues in 12 applications. Developers have confirmed 8 issues. Compared to state-of-the-art systems (namely Beldi and Boki) that log every operation, Flux achieves up to 6× lower latency and 10× higher peak throughput, as it logs only the identified idempotence-violating ones.
Travis Hance and Yi Zhou, Carnegie Mellon University; Andrea Lattuada, VMware Research; Reto Achermann, University of British Columbia; Alex Conway, VMware Research; Ryan Stutsman, VMware Research and University of Utah; Gerd Zellweger, VMware Research; Chris Hawblitzel, Microsoft Research; Jon Howell, VMware Research; Bryan Parno, Carnegie Mellon University
We present IronSync, an automated verification framework for concurrent code with shared memory. IronSync scales to complex systems by splitting system-wide proofs into isolated concerns such that each can be substantially automated. As a starting point, IronSync's ownership type system allows a developer to straightforwardly prove both data safety and the logical correctness of thread-local operations. IronSync then introduces the concept of a Localized Transition System, which connects the correctness of local actions to the correctness of the entire system. We demonstrate IronSync by verifying two state-of-the-art concurrent systems comprising thousands of lines: a library for black-box replication on NUMA architectures, and a highly concurrent page cache.
3:05 pm–3:35 pm
Break with Refreshments
3:35 pm–5:00 pm
Transfer Your Bits
Session Chair: Ed Nightingale, Apple
Qiang Li, Alibaba Group; Yixiao Gao and Xiaoliang Wang, Nanjing University; Haonan Qiu, Alibaba Group; Yanfang Le, AMD; Derui Liu, Alibaba Group; Qiao Xiang, Xiamen University; Fei Feng, Peng Zhang, Bo Li, Jianbo Dong, Lingbo Tang, Hongqiang Harry Liu, Shaozong Liu, Weijie Li, Rui Miao, Yaohui Wu, Zhiwu Wu, Chao Han, Lei Yan, Zheng Cao, and Zhongjie Wu, Alibaba Group; Chen Tian and Guihai Chen, Nanjing University; Dennis Cai, Jinbo Wu, Jiaji Zhu, and Jiesheng Wu, Alibaba Group; Jiwu Shu, Xiamen University
Datacenter applications have been increasingly applying RDMA for the ultra-low latency and low CPU overhead. However, RDMA-capable NICs (RNICs) of different vendors and different generations from the same vendors do not cooperate well, which causes bandwidth imbalance in the production network. Our observation of the heterogeneous RNICs is that though the data path functions of these RNICs follow the same RoCEv2 specifications, their control path functions are vendor and version specific. To this end, we propose Flor, an open framework that provides a flexible control plane in software and a unified hardware plane by adopting heterogeneous RNICs. The hardware plane requires no changes of current specifications. The software plane can run in NPU of RNICs, DPUs and host CPUs, following which we build up strengthen reliable transport over the large-scale lossy Ethernet. We implemented and evaluated Flor in both testbed and production clusters over Intel E180, Mellanox CX-4 and CX-5 and Broadcom RNICs. Experiments show that Flor achieves comparable performance to vanilla RDMA in many scenarios including 1/4096 packet loss, 6000:1 incast, and large-scale cross-pod communication. Flor mitigates the performance gap of CX-4 and CX-5 RNICs from 24.3% to 1.3% when they are deployed together without PFC dependency.
Boris Pismenny, Technion and NVIDIA; Adam Morrison, Tel Aviv University; Dan Tsafrir, Technion & VMware Research
Multicore systems parallelize to accommodate incoming Ethernet traffic, allocating one receive (Rx) ring with ≥1Ki entries per core by default. This ring size is sufficient to absorb packet bursts of single-core workloads. But the combined size of all Rx buffers (pointed to by all Rx rings) can exceed the size of the last-level cache. We observe that, in this case, NIC and CPU memory accesses are increasingly served by main memory, which might incur nonnegligible overheads when scaling to hundreds of incoming gigabits per second.
To alleviate this problem, we propose "shRing," which shares each Rx ring among several cores when networking memory bandwidth consumption is high. ShRing thus adds software synchronization costs, but this overhead is offset by the smaller memory footprint. We show that, consequently, shRing increases the throughput of NFV workloads by up to 1.27x, and that it reduces their latency by up to 38x. The substantial latency reduction occurs when shRing shortens the per-packet processing time to a value smaller than the packet interarrival time, thereby preventing overload conditions.
Harshit Saokar, Meta; Soteris Demetriou, Meta and Imperial College London; Nick Magerko, Max Kontorovich, Josh Kirstein, and Margot Leibold, Meta; Dimitrios Skarlatos, Meta and Carnegie Mellon University; Hitesh Khandelwal and Chunqiang Tang, Meta
Datacenter applications are often structured as many interconnected microservices, and the service mesh has become a popular approach to route RPC traffic among services. This paper presents Meta's global service mesh, called ServiceRouter (SR), which has been in production since 2012. SR differs from publicly known service meshes in several important ways. First, SR is designed for hyperscale and currently uses millions of L7 routers to route tens of billions of requests per second across tens of thousands of services. Second, while SR adopts the common approach of using sidecar or remote proxies to route 1% of RPC requests in our fleet, it employs a routing library that is directly linked into service executables to route the remaining 99% directly from clients to servers, without the extra hop of going through a proxy. This approach significantly reduces the hardware costs of our hyperscale service mesh, saving hundreds of thousands of machines. Third, SR provides built-in support for sharded services, which account for 68% of RPC requests in our fleet, whereas existing general-purpose service meshes do not support sharded services. Finally, SR introduces the concept of locality rings to simultaneously minimize RPC latency and balance load across geo-distributed datacenter regions, which, to our knowledge, has not been attempted before.
Xingda Wei and Rongxin Cheng, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, and Shanghai AI Laboratory; Yuhan Yang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, and Shanghai AI Laboratory; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University
SmartNICs have recently emerged as an appealing device for accelerating distributed systems. However, there has not been a comprehensive characterization of SmartNICs, and existing designs typically only leverage a single communication path for workload offloading. This paper presents the first holistic study of a representative off-path SmartNIC, specifically the Bluefield-2, from a communication-path perspective. Our experimental study systematically explores the key performance characteristics of communication among the client, on-board SoC, and host, and offers insightful findings and advice for designers. Moreover, we propose the concurrent use of multiple communication paths of a SmartNIC and present a pioneering guideline to expose new optimization opportunities for various distributed systems. To demonstrate the effectiveness of our approach, we conducted case studies on a SmartNIC-based distributed file system (LineFS) and an RDMA-based disaggregated key-value store (DrTM-KV). Our experimental results show improvements of up to 30% and 25% for LineFS and DrTM-KV, respectively.
Hugo Sadok and Nirav Atre, Carnegie Mellon University; Zhipeng Zhao, Microsoft; Daniel S. Berger, Microsoft Research & University of Washington; James C. Hoe, Carnegie Mellon University; Aurojit Panda, New York University; Justine Sherry, Carnegie Mellon University; Ren Wang, Intel
Awarded Best Paper!
Today, most communication between the NIC and software involves exchanging fixed-size packet buffers. This packetized interface was designed for an era when NICs implemented few offloads and software implemented the logic for translating between application data and packets. However, both NICs and networked software have evolved: modern NICs implement hardware offloads, e.g., TSO, LRO, and serialization offloads that can more efficiently translate between application data and packets. Furthermore, modern software increasingly batches network I/O to reduce overheads. These changes have led to a mismatch between the packetized interface, which assumes that the NIC and software exchange fixed-size buffers, and the features provided by modern NICs and used by modern software. This incongruence between interface and data adds software complexity and I/O overheads, which in turn limits communication performance.
This paper proposes Ensō, a new streaming NIC-to-software interface designed to better support how NICs and software interact today. At its core, Ensō eschews fixed-size buffers, and instead structures communication as a stream that can be used to send arbitrary data sizes. We show that this change reduces software overheads, reduces PCIe bandwidth requirements, and leads to fewer cache misses. These improvements allow an Ensō-based NIC to saturate a 100 Gbps link with minimum-sized packets (forwarding at 148.8 Mpps) using a single core, improve throughput for high-performance network applications by 1.5-6x, and reduce latency by up to 43%.
5:00 pm–5:10 pm
Roxana Geambasu, Columbia University, and Ed Nightingale, Apple