USENIX ATC '22 Technical Sessions

All the times listed below are in Pacific Daylight Time (PDT).

Papers are available for download below to registered attendees now. The papers and the full proceedings will be available to everyone beginning Monday, July 11, 2022. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Attendee Files
USENIX ATC '22 Attendee List (PDF)
USENIX ATC '22 Monday Paper Archive (65 MB ZIP, includes Proceedings front matter and attendee list)
USENIX ATC '22 Tuesday Paper Archive (68 MB ZIP)
USENIX ATC '22 Wednesday Paper Archive (17 MB ZIP)
View mode:

Surprise-Inspired Networking

David Tennenhouse, Independent Researcher

Available Media

The Information Theory concept of surprisal suggests that, in the future, the most valuable information will be the "new" information entering the cloud from the edge rather than the "prior" information that is stored and processed deeper within the cloud. This talk will discuss the implications of this relatively simple concept for future systems and networks. In particular, it will focus on the role that "near the edge" computing and storage can play in synthesizing the "new" and "old" information.

David Tennenhouse, Independent Researcher

David is passionate about research and innovation and has a track record of embracing high-risk initiatives, such as software-defined networking, software radio, IoT, and data-intensive computing. He has worked in academia, as a faculty member at MIT; in government, at DARPA; in industry at Intel, Amazon/A9.com, Microsoft, and VMware; and as a partner in a venture capital firm. Dr. Tennenhouse has championed research related to a wide range of technologies, including networking, distributed computing, blockchain, computer architecture, storage, machine learning, robotics, and nano/biotechnology. David holds a BASc and MASc in Electrical Engineering from the University of Toronto and obtained his Ph.D. at the University of Cambridge.

Opening Remarks and Awards

Jiri Schindler, Tranquil Data, and Noa Zilberman, University of Oxford

Track 1

Storage 1

Session Chair: Youjip Won, Korea Advanced Institute of Science and Technology (KAIST)

Shai Bergman, Technion; Niklas Cassel and Matias Bjørling, Western Digital; Mark Silberstein, Technion

Available Media

We introduce ZNSwap, a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS's explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10 times lower 99th percentile latency and 5 times higher throughput for memcached key-value store under realistic usage scenarios.

Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio

Xiangyu Zou and Wen Xia, Harbin Institute of Technology, Shenzhen; Philip Shilane, Dell Technologies; Haijun Zhang and Xuan Wang, Harbin Institute of Technology, Shenzhen

Available Media

Fine-grained deduplication, which first removes identical chunks and then eliminates redundancies between similar but non-identical chunks (i.e., delta compression), could exploit workloads' compressibility to achieve a very high deduplication ratio but suffers from poor backup/restore performance. This makes it not as popular as chunk-level deduplication thus far. This is because allowing workloads to share more references among similar chunks further reduces spatial/temporal locality, causes more I/O overhead, and leads to worse backup/restore performance.

In this paper, we address issues for different forms of poor locality with several techniques, and propose MeGA, which achieves backup and restore speed close to chunk-level deduplication while preserving fine-grained deduplication's significant deduplication ratio advantage. Specifically, MeGA applies (1) a backup-workflow-oriented delta selector to address poor locality when reading base chunks, and (2) a delta-friendly data layout and "Always-Forward-Reference" traversing in the restore workflow to deal with the poor spatial/temporal locality of deduplicated data.

Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. In particular, compared with the traditional greedy approach, MeGA achieves a 4.47–34.45 times higher backup performance and a 30–105 times higher restore performance while maintaining a very high deduplication ratio.

Secure and Lightweight Deduplicated Storage via Shielded Deduplication-Before-Encryption

Zuoru Yang, The Chinese University of Hong Kong; Jingwei Li, University of Electronic Science and Technology of China; Patrick P. C. Lee, The Chinese University of Hong Kong

Available Media

Outsourced storage should fulfill confidentiality and storage efficiency for large-scale data management. Conventional approaches often combine encryption and deduplication based on deduplication-after-encryption (DaE), which first performs encryption followed by deduplication on encrypted data. We argue that DaE has fundamental limitations that lead to various drawbacks in performance, storage savings, and security in secure deduplication systems. In this paper, we study an unexplored paradigm called deduplication-before-encryption (DbE), which first performs deduplication and encrypts only non-duplicate data. DbE has the benefits of mitigating the performance and storage penalties caused by the management of duplicate data, but its deduplication process is no longer protected by encryption. To this end, we design DEBE, a shielded DbE-based deduplicated storage system that protects deduplication via Intel SGX. DEBE builds on frequency-based deduplication that first removes duplicates of frequent data in a space-constrained SGX enclave and then removes all remaining duplicates outside the enclave. Experiments show that DEBE outperforms state-of-the-art DaE approaches.

Track 2

RunD: A Lightweight Secure Container Runtime for High-density Deployment and High-concurrency Startup in Serverless Computing

Zijun Li, Jiagan Cheng, and Quan Chen, Shanghai Jiao Tong University; Eryu Guan, Zizheng Bian, Yi Tao, Bin Zha, Qiang Wang, and Weidong Han, Alibaba Group; Minyi Guo, Shanghai Jiao Tong University

Available Media

The secure container that hosts a single container in a micro virtual machine (VM) is now used in serverless computing, as the containers are isolated through the microVMs. There are high demands on the high-density container deployment and high-concurrency container startup to improve both the resource utilization and user experience, as user functions are fine-grained in serverless platforms. Our investigation shows that the entire software stacks, containing the cgroups in the host operating system, the guest operating system, and the container rootfs for the function workload, together result in low deployment density and slow startup performance at high-concurrency. We therefore propose and implement a lightweight secure container runtime, named RunD, to resolve the above problems through a holistic guest-to-host solution. With RunD, over 200 secure containers can be started in a second, and over 2500 secure containers can be deployed on a node with 384GB of memory. RunD is adopted as Alibaba serverless container runtime to support high-density deployment and high-concurrency startup.

Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through Inter-Function Container Sharing

Zijun Li, Linsong Guo, Quan Chen, Jiagan Cheng, and Chuhao Xu, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Zhuo Song, Tao Ma, and Yong Yang, Alibaba Cloud; Chao Li and Minyi Guo, Shanghai Jiao Tong University

Available Media

In serverless computing, each function invocation is executed in a container (or a Virtual Machine), and container cold startup results in long response latency. We observe that some functions suffer from cold container startup, while the warm containers of other functions are idle. Based on the observation, other than booting a new container for a function from scratch, we propose to alleviate the cold startup by re-purposing a warm but idle container from another function. We implement a container management scheme, named Pagurus, to achieve the purpose. Pagurus comprises an intra-function manager for replacing an idle warm container to be a container that other functions can use without introducing additional security issues, an inter-function scheduler for scheduling containers between functions, and a sharing-aware function balancer at the cluster-level for balancing the workload across different nodes. Experiments using Azure serverless traces show that Pagurus alleviates 84.6\% of the cold startup, and the cold startup latency is reduced from hundreds of milliseconds to 16 milliseconds if alleviated.

RRC: Responsive Replicated Containers

Diyu Zhou, UCLA and EPFL; Yuval Tamir, UCLA

Available Media

Replication is the basic mechanism for providing application-transparent reliability through fault tolerance. The design and implementation of replication mechanisms is particularly challenging for general multithreaded services, where high latency overhead is not acceptable. Most of the existing replication mechanisms fail to meet this challenge.

RRC is a fully-operational fault tolerance mechanism for multiprocessor workloads, based on container replication. It minimizes the latency overhead during normal operation by addressing two key sources of this overhead: (1) it decouples the latency overhead from checkpointing frequency using a hybrid of checkpointing and replay, and (2) it minimizes the pause time for checkpointing by forking a clone of the container to be checkpointed, thus allowing execution to proceed in parallel with checkpointing. The fact that RRC is based on checkpointing makes it inherently less vulnerable to data races than active replication. In addition, RRC includes mechanisms that further reduce the vulnerability to data races, resulting in high recovery rates, as long as the rate of manifested data races is low. The evaluation includes measurement of the recovery rate and recovery latency based on thousands of fault injections. On average, RRC delays responses to clients by less than 400mu's and recovers in less than 1s. The average pause latency is less than 3.3ms. For a set of eight real-world benchmarks, if data races are eliminated, the performance overhead of RRC is under 48%.

Track 1

Distributed Systems 1

Session Chair: Jiri Schindler, Tranquil Data

uKharon: A Membership Service for Microsecond Applications

Rachid Guerraoui and Antoine Murat, EPFL; Javier Picorel, Huawei Technologies; Athanasios Xygkis, EPFL; Huabing Yan and Pengfei Zuo, Huawei Technologies

Available Media

Modern data center fabrics open the possibility of microsecond distributed applications, such as data stores and message queues. A challenging aspect of their development is to ensure that, besides being fast in the common case, these applications react fast to changes in their membership, e.g., due to reconfiguration and failures. This is especially important as they form the backbone of numerous cloud-powered services, such as analytics and trading systems, trying to meet ever-stringent tail latency requirements. As the microservices-oriented architecture is the de facto standard for building cloud services, a single user request translates to a wide fan-out of microservices interactions sitting on the critical path. The outcome is implacable: the traditionally uncommon events of reconfiguration and failures are exacerbated by the fan-out of communication, making user requests commonly experience such events and quickly impacting the tail latency of the service.

We present uKharon, a microsecond-scale membership service that detects changes in the membership of applications and lets them failover in as little as 50us. uKharon consists of (1) a multi-level failure detector, (2) a consensus engine that relies on one-sided RDMA CAS, and (3) minimal-overhead membership leases, all exploiting RDMA to operate at the microsecond scale. We showcase the power of uKharon by building uKharon-KV, a replicated Key-Value cache based on HERD. uKharon-KV processes PUT requests as fast as the state-of-the-art and improves upon it by (1) removing the need for replicating GET requests and (2) bringing the end-to-end failover down to 53us, a 10x improvement.

KRCORE: A Microsecond-scale RDMA Control Plane for Elastic Computing

Xingda Wei, Shanghai Jiao Tong University, Shanghai AI Laboratory; Fangming Lu, Shanghai Jiao Tong University; Rong Chen, Shanghai Jiao Tong University, Shanghai AI Laboratory; Haibo Chen, Shanghai Jiao Tong University

Available Media

We present KRCORE, an RDMA library with a microsecond-scale control plane on commodity RDMA hardware for elastic computing. KRCORE can establish a full-fledged RDMA connection within 10μs (hundreds or thousands of times faster than verbs), while only maintaining a (small) fixed-sized connection metadata at each node, regardless of the cluster scale. The key ideas include virtualizing pre-initialized kernel-space RDMA connections instead of creating one from scratch, and retrofitting advanced RDMA dynamic connected transport with static transport for both low connection overhead and high networking speed. Under load spikes, KRCORE can shorten the worker bootstrap time of an existing disaggregated key-value store (namely RACE Hashing) by 83%. In serverless computing (namely Fn), KRCORE can also reduce the latency for transferring data through RDMA by 99%.

Zero-Change Object Transmission for Distributed Big Data Analytics

Mingyu Wu, Shuaiwei Wang, Haibo Chen, and Binyu Zang, Shanghai Jiao Tong University

Available Media

Distributed big-data analytics heavily rely on high-level languages like Java and Scala for their reliability and versatility. However, those high-level languages also create obstacles for data exchange. To transfer data across managed runtimes like Java Virtual Machines (JVMs), objects should be transformed into byte arrays by the sender (serialization) and transformed back into objects by the receiver (deserialization). The object serialization and deserialization (OSD) phase introduces considerable performance overhead. Prior efforts mainly focus on optimizing some phases in OSD, so object transformation is still inevitable. Furthermore, they require extra programming efforts to integrate with existing applications, and their transformation also leads to duplicated object transmission. This work proposes Zero-Change Object Transmission (ZCOT), where objects are directly copied among JVMs without any transformations. ZCOT can be used in existing applications with minimal efforts, and its object-based transmission can be used for deduplication. The evaluation on state-of-the-art data analytics frameworks indicates that ZCOT can greatly boost the performance of data exchange and thus improve the application performance by up to 23.6%.

Sift: Using Refinement-guided Automation to Verify Complex Distributed Systems

Haojun Ma, Hammad Ahmad, Aman Goel, Eli Goldweber, Jean-Baptiste Jeannin, Manos Kapritsos, and Baris Kasikci, University of Michigan

Available Media

Distributed systems are hard to design and implement correctly. Recent work has tried to use formal verification techniques to provide rigorous correctness guarantees. These works present a hard choice, though. One must either opt for the power of refinement-based approaches like IronFleet and Verdi, at the cost of large amounts of manual effort; or choose the more automated approach of I4, IC3PO, SWISS and DistAI which give up the ability to prove refinement and the power and scalability that come with it.

We propose an alternative approach, Sift, that combines the power of refinement with the ability to automate proofs. Sift is a two-tier methodology that uses a new technique, refinement-guided automation, to leverage automation in a refinement proof and a divide-and-conquer technique to split a system into more refinement layers when necessary. This combination advances the frontier of what systems can be proven correct using a high degree of automation. Contrary to what was possible before, our evaluation shows that our novel approach allows us to prove the correctness of a number of systems with little manual effort, and to extend our proofs to include not just the protocols, but also an executable distributed implementation of these systems.

Track 2

Machine Learning 1

Session Chair: Saurabh Bagchi, Purdue University

Faith: An Efficient Framework for Transformer Verification on GPUs

Boyuan Feng, Tianqi Tang, Yuke Wang, Zhaodong Chen, Zheng Wang, Shu Yang, Yuan Xie, Yufei Ding, University of California, Santa Barbara

Available Media

Transformer verification draws increasing attention in machine learning research and industry. It formally verifies the robustness of transformers against adversarial attacks such as exchanging words in a sentence with synonyms. However, the performance of transformer verification is still not satisfactory due to bound-centric computation which is significantly different from standard neural networks. In this paper, we propose \textbf{Faith}, an efficient framework for transformer verification on GPUs. We first propose a semantic-aware computation graph transformation to identify semantic information such as bound computation in transformer verification. We exploit such semantic information to enable efficient kernel fusion at the computation graph level. Second, we propose a verification-specialized kernel crafter to efficiently map transformer verification to modern GPUs. This crafter exploits a set of GPU hardware supports to accelerate verification-specialized operations which are usually memory-intensive. Third, we propose an expert-guided autotuning to incorporate expert knowledge on GPU backends to facilitate large search space exploration. Extensive evaluations show that Faith achieves 2.1 times to 3.4 times (2.6 times on average) speedup over state-of-the-art frameworks.

DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, and Zirui Li, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Chao Li and Minyi Guo, Shanghai Jiao Tong University

Available Media

The DNN inferences are often batched for better utilizing the hardware in existing DNN serving systems. However, DNN serving exhibits diversity in many aspects, such as input, operator, and load. The unawareness of these diversities results in inefficient processing. Our investigation shows that the inefficiency roots in the feature of existing batching mechanism: one entry and one exit. Therefore, we propose DVABatch, a runtime batching system that enables the multi-entry multi-exit batching scheme for existing DNN serving system. We first abstract three meta operations for adjusting the ongoing batch of queries to achieve the multi-entry multi-exit scheme. The meta operations could be used to form different scheduling logics for different diversities. To deliver the meta operations to an ongoing batch, we slice the DNN models into multiple stages. Each stage corresponds to one executor, which is managed by a state transition diagram. Compared with state-of-the-art solutions, our experimental results show that DVABatch reduces 46.4% average latency and achieves up to 2.12× throughput improvement.

Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh, KAIST

Available Media

As machine learning (ML) techniques are applied to a widening range of applications, high throughput ML inference serving has become critical for online services. Such ML inference servers with multiple GPUs pose new challenges in the scheduler design. First, they must provide a bounded latency for each request to support a consistent service-level objective (SLO). Second, they must be able to serve multiple heterogeneous ML models in a system, as cloud-based consolidation improves system utilization. To address the two requirements of ML inference servers, this paper proposes a new inference scheduling framework for multi-model ML inference servers. The paper shows that with SLO constraints, GPUs with growing parallelism are not fully utilized for ML inference tasks. To maximize the resource efficiency of GPUs, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With spatio-temporal sharing, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpulets, with the most effective amount of resources. The scheduler explores the three-dimensional search space with different batch sizes, temporal sharing, and spatial sharing efficiently. To minimize the cost for cloud-based inference servers, the framework auto-scales the required number of GPUs for a given workload. To consider the potential interference overheads when two ML tasks are running concurrently by spatially sharing a GPU, the scheduling decision is made with an interference prediction model. Our prototype implementation proves that the proposed spatio-temporal scheduling enhances throughput by 61.7% on average compared to the prior temporal scheduler, while satisfying SLOs.

PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training

Wei Zhang and Binghao Chen, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research; Quan Chen, Shanghai Jiao Tong University; Peng Cheng, Fan Yang, Ran Shu, and Yuqing Yang, Microsoft Research; Minyi Guo, Shanghai Jiao Tong University

Available Media

Cloud gaming services have become important workloads in cloud datacenter. However, our investigation shows that a cloud gaming service cannot saturate the modern cloud GPUs. One way to improve the GPU utilization is to co-locate multiple workloads within one GPU, which is challenging for cloud gaming due to its highly fluctuated and unpredictable GPU usage pattern. In this paper, we present PilotFish, a high-performance system that harvests the free GPU cycles of cloud gaming with deep learning (DL) training, while incurring almost zero interference to cloud gaming. We co-locate DL training jobs with cloud gaming because they have stable and predictable workloads and have no strict latency requirement. In more detail, Pilotfish captures the idle periods of the game's GPU usage with its low-overhead instrumentation to graphic libraries in sub-millisecond granularity. To avoid the potential interference to cloud gaming, PilotFish schedules training computation kernels only when they can finish before the idle GPU periods, and preempts straggler kernels running longer than expected. Our evaluation on popular cloud games and DL models shows PilotFish can harvest up to 85.1% of the idle GPU time from cloud gaming with no interference.

Track 1

Operating Systems 1

Session Chair: Malte Schwarzkopf, Brown University

Privbox: Faster System Calls Through Sandboxed Privileged Execution

Dmitry Kuznetsov and Adam Morrison, Tel Aviv University

Available Media

System calls are the main method for applications to request services from the operating system, but their invocation incurs considerable overhead, which has been aggravated by mitigation mechanisms for transient execution attacks. Proposed approaches for reducing system call overhead all break the semantic equivalence between system calls and regular function calls (e.g., by making system calls asynchronous), and so their adoption requires rearchitecting applications.

This paper proposes \emph{Privbox}, a new approach for lightweight system calls that maintains the familiar synchronous, function-like system call model. Privbox allows an application to execute system call-intensive code in a \emph{semi-privileged, sandboxed} execution mode, called a 'privbox'. Semi-privileged execution is architecturally similar to the kernel's privileged execution, which enables faster invocation of system calls, but the code is sandboxed to ensure that it cannot use its elevated privileges to compromise the system. We further propose \emph{semi-privileged access prevention} (SPAP), a simple hardware architectural feature that alleviates much of Privbox's instrumentation overhead.

We implement Privbox based on Linux and LLVM. Our evaluation on x86 (Intel Skylake) hardware shows that Privbox (1) speeds up system call invocation by 2.2 times; (2) can increase throughput of I/O-threaded applications by up to 1.7 times; and (3) can increase the throughput of real-world workloads such as Redis by up to 7.6% and 11%, without and with SPAP, respectively.

BBQ: A Block-based Bounded Queue for Exchanging Data and Profiling

Jiawei Wang, Huawei Dresden Research Center, Huawei OS Kernel Lab, Technische Universität Dresden; Diogo Behrens, Ming Fu, Lilith Oberhauser, Jonas Oberhauser, and Jitang Lei, Huawei Dresden Research Center, Huawei OS Kernel Lab; Geng Chen, Huawei OS Kernel Lab; Hermann Härtig, Technische Universität Dresden; Haibo Chen, Huawei OS Kernel Lab, Shanghai Jiao Tong University

Available Media

Concurrent bounded queues have been widely used for exchanging data and profiling in operating systems, databases, and multithreaded applications. The performance of state-of-the-art queues is limited by the interference between multiple enqueues (enq-enq), multiple dequeues (deq-deq), or enqueues and dequeues (enq-deq), negatively affecting their latency and scalability. Although some existing designs employ optimizations to reduce deq-deq and enq-enq interference, they often neglect the enq-deq case. In fact, such partial optimizations may inadvertently increase interference elsewhere and result in performance degradation.

We present Block-based Bounded Queue (BBQ), a novel ringbuffer design that splits the entire buffer into multiple blocks. This eliminates enq-deq interference on concurrency control variables when producers and consumers operate on different blocks. Furthermore, the block-based design is amenable to existing optimizations, e.g., using the more scalable fetch-and-add instruction. Our evaluation shows that BBQ outperforms several industrial ringbuffers. For example, in single-producer/single-consumer micro-benchmarks, BBQ yields 11.3x to 42.4x higher throughput than the ringbuffers from Linux kernel, DPDK, Boost, and Folly libraries. In real-world scenarios, BBQ achieves up to 1.5x, 50.5x, and 11.1x performance improvements in benchmarks of DPDK, Linux io_uring, and Disruptor, respectively. We verified and optimized BBQ on weak memory models with a model-checking-based framework.

Track 2

Disaggregated Systems

Session Chair: Noa Zilberman, University of Oxford

Sibylla: To Retry or Not To Retry on Deep Learning Job Failure

Taeyoon Kim, Suyeon Jeong, Jongseop Lee, Soobee Lee, and Myeongjae Jeon, UNIST

Available Media

GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our analysis with a real-world trace reveals that a non-negligible number of jobs running on the cluster undergo failures and are blindly retried by the job scheduler. Unfortunately, these job failures often repeat and waste GPU resources, limiting effective GPU utilization across the cluster. In this paper, we introduce Sibylla which informs whether an observed failure of DL training will repeat or not upon retry on the failure. Sibylla employs a machine learning model based on RNNs that trains on stdout and stderr logs of failed jobs and can continuously update the model on new log messages without hand-constructing labels for the new training samples. With Sibylla, the job scheduler is learning-enhanced, performing a retry for a failed job only when it is highly likely to succeed with the retry. We evaluate the effectiveness of Sibylla under a variety of scenarios using trace-driven simulations. Sibylla improves cluster utilization and reduces job completion time (JCT) by up to 15%.

Speculative Recovery: Cheap, Highly Available Fault Tolerance with Disaggregated Storage

Nanqinqin Li, Anja Kalaba, Michael J. Freedman, Wyatt Lloyd, and Amit Levy, Princeton University

Available Media

The ubiquity of disaggregated storage in cloud computing has led to a nascent technique for fault tolerance: instead of utilizing application-level replication, newly-launched backup instances recover application state from disaggregated storage (REDS) after a primary's failure. Attractively, REDS provides fault tolerance at a much lower cost than traditional replication schemes, wherein at least two instances are running. Failover in REDS is slow, however, because it sequentially first detects primary failure and only then starts recovery on a backup.

We propose speculative recovery to accelerate failover and thus increase the availability of applications using REDS. Instead of proceeding with failover sequentially, speculative recovery safely and efficiently parallelizes detecting primary failure and running recovery on a backup, by employing our new super and collapse primitives for disaggregated storage. Our implementation and evaluation of speculative recovery demonstrate that it considerably reduces failover time.

Direct Access, High-Performance Memory Disaggregation with DirectCXL

Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung, KAIST

Available Media

New cache coherent interconnects such as CXL have recently attracted great attention thanks to their excellent hardware heterogeneity management and resource disaggregation capabilities. Even though there is yet no real product or platform integrating CXL into memory disaggregation, it is expected to make memory resources practically and efficiently disaggregated much better than ever before.

In this paper, we propose directly accessible memory disaggregation, DirectCXL that straight connects a host processor complex and remote memory resources over CXL’s memory protocol (CXL.mem). To this end, we explore a practical design for CXL-based memory disaggregation and make it real. As there is no operating system that supports CXL, we also offer CXL software runtime that allows users to utilize the underlying disaggregated memory resources via sheer load/store instructions. Since DirectCXL does not require any data copies between the host memory and remote memory, it can expose the true performance of remote-side disaggregated memory resources to the users.

Track 1

Networking 1

Session Chair: Rik Farrow, Sirius Computing

Not that Simple: Email Delivery in the 21st Century

Florian Holzbauer, SBA Research; Johanna Ullrich, University of Vienna; Martina Lindorfer, TU Wien; Tobias Fiebig, Max-Planck-Institut für Informatik

Available Media

Over the past two decades, the number of RFCs related to email and its security has exploded from below 100 to nearly 500. This embedded the Simple Mail Transfer Protocol (SMTP) into a tree of interdependent and delivery-relevant standards. In this paper, we investigate how far real-world deployments keep up with this increasing complexity of delivery- and security options. To gain an in-depth picture of email delivery apart from the giants in the ecosystem (Gmail, Outlook, etc.), we engage people to send emails to eleven differently configured target domains. Our measurements allow us to evaluate core aspects of email delivery, including security features, DNS configuration, and IP version support on the sending side across different types of providers.

We find that novel technologies are often insufficiently supported, even by large providers. For example, while 65.4\% of email providers can resolve hosts via IPv6, only 44.3\% can also deliver emails via IPv6. Concerning security features, we observe that less than half (41.5\%) of all providers rely on DNSSEC validating resolvers, and encryption is mostly opportunistic, with 89.7\% of providers accepting invalid certificates. TLSA, as a DNS-based certificate verification method, is only used by 31.7\% of the providers in our study. Finally, we turned our eye to the impact modern standards have on unsolicited bulk email (SPAM). We found that greylisting is effective, reducing the SPAM volume by roughly half while not impacting regular delivery. However, and interestingly, SPAM delivery currently seems to focus on plaintext IPv4 connections, making IPv6-only, TLS-enforcing inbound email servers a more effective anti-SPAM measure – even though it also means rejecting a major portion of legitimate emails.

Guanglei Song, Jiahai Yang, Lin He, Zhiliang Wang, Guo Li, Chenxin Duan, and Yaozhong Liu, Tsinghua University; Zhongxiang Sun, Beijing Jiaotong University

Available Media

Fast Internet-wide scanning is essential for network situational awareness and asset evaluation. However, the vast IPv6 address space makes brute-force scanning infeasible. Although state-of-the-art techniques have made effective attempts, these methods do not work in seedless regions, while the detection efficiency is low in regions with seeds. Moreover, the constructed hitlists with low coverage cannot truly represent the active IPv6 address landscape of the Internet.

Co-opting Linux Processes for High-Performance Network Simulation

Rob Jansen, U.S. Naval Research Laboratory; Jim Newsome, Tor Project; Ryan Wails, Georgetown University, U.S. Naval Research Laboratory

Awarded Best Paper!

Available Media

Network experimentation tools are vitally important to the process of developing, evaluating, and testing distributed systems. The state-of-the-art simulation tools are either prohibitively inefficient at large scales or are limited by nontrivial architectural challenges, inhibiting their widespread adoption. In this paper, we present the design and implementation of Phantom, a novel tool for conducting distributed system experiments. In Phantom, a discrete-event network simulator directly executes unmodified applications as Linux processes and innovatively synthesizes efficient process control, system call interposition, and data transfer methods to co-opt the processes into the simulation environment. Our evaluation demonstrates that Phantom is up to 2.2× faster than Shadow, up to 3.4× faster than NS-3, and up to 43× faster than gRaIL in large P2P benchmarks while offering performance comparable to Shadow in large Tor network simulations.

Track 2

Finding Bugs

Session Chair: Geoff Kuenning, Harvey Mudd College

KSG: Augmenting Kernel Fuzzing with System Call Specification Generation

Hao Sun, Yuheng Shen, Jianzhong Liu, Yiru Xu, and Yu Jiang, Tsinghua University

Available Media

Kernel fuzzing is a dynamic testing technique that has successfully found numerous kernel vulnerabilities. However, existing kernel fuzzers, such as Syzkaller, depend on system call specifications to generate test cases. Writing such specifications requires an immense amount of domain knowledge while being extremely laborious. Meanwhile, automated generation of the specification is still an open problem due to the complexity of the kernel, including entry function extraction and input type identification. As a result, the current amount of system call information is insufficient to test the entire kernel code base thoroughly. Syzkaller covers an average of 38% of Linux kernel code with current Syzlang specifications for a prolonged time of fuzzing.

In this paper, we propose KSG to generate system call specifications for kernel fuzzers automatically. First, it utilizes probe-based tracing to extract entry functions accurately. Then, it uses path-sensitive analysis to collect precise input types and range constraints in each execution path of entry functions. Based on the aforementioned information, KSG generates specifications in the domain language Syzlang, which is used by most kernel fuzzers. We evaluated KSG on several versions of the Linux kernel. It automatically generated 2433 unique specifications. Leveraging the newly generated specifications, Syzkaller and Moonshine achieved coverage improvements of 22% and 23% respectively. Furthermore, our approach assisted fuzzers to discover 26 previously unknown bugs, where 13 and 6 bugs were fixed and assigned with CVEs, respectively.

DLOS: Effective Static Detection of Deadlocks in OS Kernels

Jia-Ju Bai, Tuo Li, and Shi-Min Hu, Tsinghua University

Available Media

Deadlocks in OS kernels can cause critical problems like performance degradation and system hangs. However, detecting deadlocks in OS kernels is quite challenging, due to high complexity of concurrent execution and large code bases of OS kernels. In this paper, we design a practical static analysis approach named DLOS, to effectively detect deadlocks in OS kernels. DLOS consists of three key techniques: (1) a summary-based lock-usage analysis to efficiently extract the code paths containing distinct locking constraints from kernel code; (2) a reachability-based comparison method to efficiently detect locking cycles from locking constraints; (3) a two-dimensional filtering strategy to effectively drop false positives by validating code-path feasibility and concurrency. We have evaluated DLOS on Linux 5.10, and find 54 real deadlocks, with a false positive rate of 17%. We have reported these deadlocks to Linux kernel developers, and 31 of them have been confirmed.

Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models

Beom Heyn Kim, Samsung Research, University of Toronto; Taesoo Kim, Samsung Research, Georgia Institute of Technology; David Lie, University of Toronto

Available Media

While there exist many consistency models for distributed systems, most of those models seek to provide the basic guarantee of convergence: given enough time and no further inputs, all replicas in the system should eventually converge to the same state. However, because of Convergence Failure Bugs (CFBs), many distributed systems do not provide even this basic guarantee. The violation of the convergence property can be crucial to safety-critical applications collectively working together with a shared distributed system. Indeed, many CFBs are reported as major issues by developers. Our key insight is that CFBs are caused by divergence, or differences between the state of replicas, and that a focused exploration of divergence states can reveal bugs in the convergence logic of real distributed systems while avoiding state explosion. Based on this insight, we have designed and implemented Modulo, the first Model-Based Testing tool using Divergence Resync Models (DRMs) to systematically explore divergence and convergence in real distributed systems. Modulo uses DRMs to explore an abstract state machine of the system and derive schedules, the intermediate representation of test cases, which are then translated into test inputs and injected into systems under test (SUTs). We ran Modulo to check ZooKeeper, MongoDB, and Redis and found 11 bugs (including 6 previously unknown ones)

OSDI '22 Poster Session and Reception

Would you like to share a provocative opinion, interesting preliminary work, or a cool idea that will spark discussion at this year's OSDI? The poster session is the perfect venue to introduce such new or ongoing work. Poster presenters will have the opportunity to discuss their work, get exposure, and receive feedback from other attendees during the in-person evening reception. View the list of accepted posters.

Trustworthy Open Source: The Consequences of Success

Eric Brewer, VP Infrastructure, Google Fellow and Professor Emeritus, UC Berkeley

Available Media

Wide-spread use of open-source software is a remarkable achievement, but also creates a tremendous responsibility. How can we collectively step up to ensure open-source software is worthy of the trust the world now expects and deserves? We cover a range of structural and security challenges and how we might address them, including our hopes for a more sustainable future.

Eric Brewer, VP Infrastructure, Google Fellow and Professor Emeritus, UC Berkeley

Eric joined Google in 2011 and leads technical areas including Kubernetes, Serverless, and Anthos. A recent focus is security for open-source software, including supply-chain risks and helping start the OpenSSF.

At Berkeley, he led work on cloud computing, network infrastructure, IoT, and the CAP Theorem. He has also led work on technology for developing regions, with projects in India, the Philippines, and Kenya among others, including communications, power, and health care.

In 1996, he co-founded Inktomi Corporation and helped lead it onto the NASDAQ 100. In 2000, working with President Clinton, Professor Brewer helped to create USA.gov, the official portal of the Federal government.

Major awards include membership in the NAE, AAAS, and AAA(&)S, the ACM Prize in Computing, and the ACM SIGOPS Mark Weiser Award.

Track 1

Security

Session Chair: Pedro Fonseca, Purdue University

SoftTRR: Protect Page Tables against Rowhammer Attacks using Software-only Target Row Refresh

Zhi Zhang, CSIRO’s Data61, Australia; Yueqiang Cheng, NIO Security Research; Minghua Wang, Baidu Security; Wei He and Wenhao Wang, State Key Laboratory of Information Security, Institute of Information Engineering, CAS and University of Chinese Academy of Sciences; Surya Nepal, CSIRO’s Data61, Australia; Yansong Gao, Nanjing University of Science and Technology, China; Kang Li, Baidu Security; Zhe Wang and Chenggang Wu, State Key Laboratory of Computer Architecture, Institute of Computing Technology, CAS and University of Chinese Academy of Sciences

Available Media

Rowhammer attacks that corrupt level-1 page tables to gain kernel privilege are the most detrimental to system security and hard to mitigate. However, recently proposed software-only mitigations are not effective against such kernel privilege escalation attacks. In this paper, we propose an effective and practical software-only defense, called SoftTRR, to protect page tables from all existing rowhammer attacks on x86. The key idea of SoftTRR is to refresh the rows occupied by page tables when a suspicious rowhammer activity is detected. SoftTRR is motivated by DRAM-chip-based target row refresh (ChipTRR) but eliminates its main security limitation (i.e., ChipTRR tracks a limited number of rows and thus can be bypassed by many-sided hammer [17]). Specifically, SoftTRR protects an unlimited number of page tables by tracking memory accesses to the rows that are in close proximity to page-table rows and refreshing the page-table rows once the tracked access count exceeds a pre-defined threshold. We implement a prototype of SoftTRR as a loadable kernel module, and evaluate its security effectiveness, performance overhead, and memory consumption. The experimental results show that SoftTRR protects page tables from real-world rowhammer attacks and incurs small performance overhead as well as memory cost.

Hardening Hypervisors with Ombro

Ethan Johnson, Colin Pronovost, and John Criswell, University of Rochester

Available Media

This paper presents Ombro, a low-level virtual instruction set architecture (vISA) which enforces compiler-based security policies on real-world commodity hypervisors. We extend the Secure Virtual Architecture (which itself extends the LLVM compiler’s Intermediate Representation) to support the full set of hardware operations needed to run an x86 commodity hypervisor used in some of the world’s largest public clouds, namely, the Xen 4.12 hypervisor, running in full hardware-accelerated mode using Intel’s Virtual Machine Extensions (VMX). We have ported Xen 4.12 to the Ombro vISA and demonstrated that it can run unmodified guest VMs of real-world relevance (namely, Linux guests under Xen’s HVM and PVH modes). Furthermore, to demonstrate Ombro’s ability to harden hypervisors from attack, Ombro implements control flow integrity and the first protected shadow (split) stack for x86 hypervisors. Our performance results show that Ombro achieves this protection without imposing measurable overheads on most application benchmarks.

HyperEnclave: An Open and Cross-platform Trusted Execution Environment

Yuekai Jia, Tsinghua University; Shuang Liu, Ant Group; Wenhao Wang, Institute of Information Engineering, CAS; Yu Chen, Tsinghua University; Zhengde Zhai, Shoumeng Yan, and Zhengyu He, Ant Group

Available Media

A number of trusted execution environments (TEEs) have been proposed by both academia and industry. However, most of them require specific hardware or firmware changes and are bound to specific hardware vendors (such as Intel, AMD, ARM, and IBM). In this paper, we propose HyperEnclave, an open and cross-platform process-based TEE that relies on the widely-available virtualization extension to create the isolated execution environment. In particular, HyperEnclave is designed to support the flexible enclave operation modes to fulfill the security and performance demands under various enclave workloads. We provide the enclave SDK to run existing SGX programs on HyperEnclave with little or no source code changes. We have implemented HyperEnclave on commodity AMD servers and deployed the system in a world-leading FinTech company to support real-world privacy-preserving computations. The evaluation on both micro-benchmarks and application benchmarks shows the design of HyperEnclave introduces only a small overhead.

PRIDWEN: Universally Hardening SGX Programs via Load-Time Synthesis

Fan Sang, Georgia Institute of Technology; Ming-Wei Shih, Microsoft; Sangho Lee, Microsoft Research; Xiaokuan Zhang, Georgia Institute of Technology; Michael Steiner, Intel; Mona Vij, Intel Labs; Taesoo Kim, Georgia Institute of Technology

Available Media

A growing class of threats to Intel Software Guard Extensions (SGX) is Side-Channel Attacks (SCAs). As a response, numerous countermeasures have been proposed. However, it is hard to incorporate them to protect SGX programs against multiple SCAs simultaneously. A naive combination of distinct countermeasures does not work in practice because some of them are 1) undeployable in target environments lacking dependent hardware features, 2) redundant if there are already defenses with similar functionalities, and 3) incompatible with each other by design or implementation. Identifying all of such conditions and preparing potential workarounds before deployment are challenging, primarily when an SGX program targets multiple platforms that abstract or manipulate their configurations.

Pridwen is a framework that selectively applies essential SCA countermeasures when loading an SGX program based on the configurations of the target execution platform. Pridwen allows a developer to deploy a program in the form of WebAssembly (Wasm). Upon receiving a Wasm binary, Pridwen probes the current hardware configuration, synthesizes a program (i.e., a native binary) with an optimal set of countermeasures, and validates the final binary. Pridwen supports both software-only and hardware-assisted countermeasures, and our evaluations show Pridwen efficiently, faithfully synthesizes multiple benchmark programs and real-world applications while securing them against multiple SCAs.

Track 2

Machine Learning 2

Session Chair: Thanumalayan Sankaranarayana Pillai, Google

Tetris: Memory-efficient Serverless Inference through Tensor Sharing

Jie Li, Laiping Zhao, and Yanan Yang, Tianjin University; Kunlin Zhan, 58.com; Keqiu Li, Tianjin University

Available Media

Executing complex, memory-intensive deep learning inference services poses a major challenge for serverless computing frameworks, which would densely deploy and maintain inference models at high throughput. We observe the excessive memory consumption problem in serverless inference systems, due to the large-sized models and high data redundancy.

We present Tetris, a serverless platform catered to inference services with an order of magnitude lower memory footprint. Tetris’s design carefully considers the extensive memory sharing of runtime and tensors. It supports minimizing the runtime redundancy through a combined optimization of batching and concurrent execution and eliminates tensor redundancy across instances from either the same or different functions using a lightweight and safe tensor mapping mechanism. Our comprehensive evaluation demonstrates that Tetris saves up to 93% memory footprint for inference services, and increases the function density by 30× without impairing the latency.

PetS: A Unified Framework for Parameter-Efficient Transformers Serving

Zhe Zhou, Peking University; Xuechao Wei, Peking University, Alibaba Group; Jiejing Zhang, Alibaba Group; Guangyu Sun, Peking University

Available Media

Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training

Xin He, CSEE, Hunan University & Xidian University; Jianhua Sun and Hao Chen, CSEE, Hunan University; Dong Li, University of California, Merced

Available Media

Mixed precision training uses a mixture of full and lower precisions for neural network (NN) training. Applying mixed precision must cast tensors in NN from float32 (FP32) to float16 (FP16) or vice versa. The existing strategy greedily applies FP16 to performance-critical operations without quantifying and considering the casting cost. However, we reveal that the casting cost can take more than 21% of NN operation execution time, and in some cases surpasses the performance benefit of using low precision. In this paper, we introduce Campo, a tool that improves performance of mixed-precision NN training with the awareness of casting costs. Campo is built upon performance modeling that predicts the casting cost and operation performance with low precision, and introduces a cost-aware graph rewriting strategy. Campo is user-transparent, and enables high performance NN training using mixed precision without training accuracy loss. Evaluating Campo with six NN models, we show that compared to TensorFlow using TF_AMP (a state-of-the-art performance optimizer for mixed precision training from Nvidia), Campo improves training throughput by 20.8% on average (up to 24.5%) on RTX 2080 Ti GPU and by 20.9% on average (up to 23.4%) on V100 GPU, without training accuracy loss. Because of using the cost-aware mixed precision training, Campo also improves energy efficiency by 21.4% on average (up to 24.2%), compared to TensorFlow using TF_AMP.

Primo: Practical Learning-Augmented Systems with Interpretable Models

Qinghao Hu, Nanyang Technological University; Harsha Nori, Microsoft; Peng Sun, SenseTime; Yonggang Wen and Tianwei Zhang, Nanyang Technological University

Available Media

While machine learning has demonstrated remarkable performance in various computer systems, some substantial flaws can prohibit its deployment in practice, including opaque decision processes, poor generalization and robustness, as well as exorbitant training and inference overhead. Motivated by these deficiencies, we introduce Primo, a unified framework for developers to design practical learning-augmented systems. Specifically, (1) Primo provides two interpretable models (PrAM and PrDT), as well as a Distill Engine, to support different system scenarios and deployment requirements. (2) It adopts Bayes Optimization to automatically identify the optimal model pruning strategy and hyperparameter configuration. (3) It also implements two tools, Monotonic Constraint and Counterfactual Explanation, to achieve transparent debugging and guided model adjustment. Primo can be applied to different types of learning-augmented systems. Evaluations on three state-of-the-art systems show that Primo can provide clear model interpretations, better system performance, and lower deployment costs.

Track 1

Distributed Systems 2

Session Chair: Fred Douglis, Peraton Labs

Meces: Latency-efficient Rescaling via Prioritized State Migration for Stateful Distributed Stream Processing Systems

Rong Gu, Han Yin, Weichang Zhong, Chunfeng Yuan, and Yihua Huang, State Key Laboratory for Novel Software Technology, Nanjing University

Available Media

Stateful distributed stream processing engines (SPEs) usually call for dynamic rescaling due to varying workloads. However, existing state migration approaches suffer from latency spikes, or high resource usage, or major disruptions as they ignore the order of state migration during rescaling. This paper reveals the importance of state migration order to the latency performance in SPEs. Based on that, we propose Meces, an on-the-fly state migration mechanism which prioritizes the state migration of hot keys (those being processed or about to be processed by downstream operator tasks) to achieve smooth rescaling. Meces leverages a fetch-on-demand design which migrates operator states at record-granularity for state consistency. We further devise a hierarchical state data structure and gradual strategy for migration efficiency. Meces is implemented on Apache Flink and evaluated with diversified benchmarks and scenarios. Compared to state-of-the-art approaches, Meces improves stream processing performance in terms of latency and throughput during rescaling by orders of magnitude, with negligible overhead and no disruption to non-rescaling periods.

DepFast: Orchestrating Code of Quorum Systems

Xuhao Luo, University of Illinois at Urbana-Champaign; Weihai Shen and Shuai Mu, Stony Brook University; Tianyin Xu, University of Illinois at Urbana-Champaign

Available Media

Quorum systems (e.g., replicated state machines) are critical distributed systems. Building correct, high-performance quorum systems is known to be hard. A major reason is that the protocols in quorum systems lead to non-deterministic state changes and complex branching conditions based on different events (e.g., timeouts). Traditionally, these systems are built with an asynchronous coding style with event-driven callbacks, but often lead to “callback hell” that makes code hard to follow and maintain. Converting to synchronous coding styles (e.g., using coroutines) is challenging because of the complex branching conditions. In this paper, we present Dependably Fast (DepFast), an effective, expressive framework for developing quorum systems. DepFast provides a unique QuorumEvent abstraction to enable building quorum systems in a synchronous style. It also supports composition of multiple events, e.g., timeouts, different quorums. To evaluate DepFast, we use it to implement two quorum systems, Raft and Copilot. We show that complex quorum systems implemented by DepFast are easy to write and have high performance. Specifically, it takes 25%–35% fewer lines of code to implement Raft and Copilot using DepFast, and the DepFast-based implementations have comparable performance with the state-of-the-art systems.

High Throughput Replication with Integrated Membership Management

Pedro Fouto, Nuno Preguiça, and João Leitão, NOVA LINCS & NOVA University Lisbon

Available Media

This paper introduces ChainPaxos, a new distributed consensus algorithm for high throughput replication. ChainPaxos organizes nodes in a chain, allowing for a pipeline communication pattern that maximizes throughput, by minimizing the number of messages transmitted. While other proposals have explored such patterns, ChainPaxos is the first that can execute linearizable reads in any replica with no communication overhead, relying only on information used to process updates. These techniques build on a fully specified integrated membership management solution, allowing ChainPaxos’s fault-tolerance to be independent of an external coordination service, often used in other solutions, which can lead to possible safety violations in the presence of network partitions.

Our evaluation shows that, when compared with alternative Paxos variants, ChainPaxos exhibits significantly higher throughput and scalability with negligible latency impact. Compared to other solutions with similar communication patterns, besides avoiding the costs of an external coordination service, ChainPaxos’s high throughput tends to increase with the ratio of read-only operations.

Track 2

Operating Systems 2

Session Chair: Reto Achermann, University of British Columbia

CBMM: Financial Advice for Kernel Memory Managers

Mark Mansi, Bijan Tabatabai, and Michael M. Swift, University of Wisconsin - Madison

Available Media

First-party datacenter workloads present new challenges to kernel memory management (MM), which allocates and maps memory and must balance competing performance concerns in an increasingly complex environment. In a datacenter, performance must be both good \textit{and} consistent to satisfy service-level objectives. Unfortunately, current MM designs often exhibit inconsistent, opaque behavior that is difficult to reproduce, decipher, or fix, stemming from (1) a lack of high-quality information for policymaking, (2) the cost-unawareness of many current MM policies, and (3) opaque and distributed policy implementations that are hard to debug. For example, the Linux huge page implementation is distributed across 8 files and can lead to page fault latencies in the 100s of ms.

In search of a MM design that has consistent behavior, we designed Cost-Benefit MM (CBMM), which uses empirically based cost-benefit models and pre-aggregated profiling information to make MM policy decisions. In CBMM, policy decisions follow the guiding principle that \textit{userspace benefits must outweigh userspace costs}. This approach balances the performance benefits obtained by a kernel policy against the cost of applying it. CBMM has competitive performance with Linux and HawkEye, a recent research system, for all the workloads we ran, and in the presence of fragmentation, CBMM is 35% faster than Linux on average. Meanwhile, CBMM nearly always has better tail latency than Linux or HawkEye, particularly on fragmented systems. It reduces the cost of the most expensive soft page faults by 2-3 orders of magnitude for most of our workloads, and reduces the frequency of 10-1000 mu s-long faults by around 2 orders of magnitude for multiple workloads.

EPK: Scalable and Efficient Memory Protection Keys

Jinyu Gu, Hao Li, Wentai Li, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University

Available Media

As a hardware mechanism for facilitating intra-process memory isolation, Intel Memory Protection Keys (MPK) has been leveraged to efﬁciently improve the isolation, security, or performance of the software. However, it can only support 16 isolated memory domains, which signiﬁcantly limits its applicability in many scenarios.

In this paper, we present EPK which leverages off-the-shelf virtualization hardware features to extend the number of available protection domains in MPK. To demonstrate the effectiveness of EPK, we apply it in three scenarios, including better memory isolation for server applications as well as Non-Volatile Memory (NVM) applications, and a fast InterProcess Communication (IPC) mechanism for microkernels. The evaluation results show that EPK can scale to provide hundreds of isolated domains. It can outperform the state-of-the-art (libmpk) by up to two orders of magnitude and usually achieve 95% of the performance of the system with no memory isolation.

Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory

Sangjin Choi and Taeksoo Kim, KAIST; Jinwoo Jeong, Ajou University; Rachata Ausavarungnirun, King Mongkut's University of Technology North Bangkok; Myeongjae Jeon, UNIST; Youngjin Kwon, KAIST; Jeongseob Ahn, Ajou University

Available Media

With the ever-growing demands for GPUs, most organizations allow users to share the multi-GPU servers. However, we observe that the memory space across GPUs is not effectively utilized enough when consolidating various workloads that exhibit highly varying resource demands. This is because the current memory management techniques were designed solely for individual GPUs rather than shared multi-GPU environments.

This study introduces a novel approach to provide an illusion of virtual memory space for GPUs, called hierarchical unified virtual memory (HUVM), by incorporating the temporarily idle memory of neighbor GPUs. Since modern GPUs are connected to each other through a fast interconnect, it provides lower access latency to neighbor GPU's memory compared to the host memory via PCIe. On top of HUVM, we design a new memory manager, called memHarvester, to effectively and efficiently harvest the temporarily available neighbor GPUs’ memory. For diverse consolidation scenarios with DNN training and graph analytics workloads, our experimental result shows up to 2.71x performance improvement compared to the prior approach in multi-GPU environments.

Track 1

Deployed Systems 1

Session Chair: Noa Zilberman, University of Oxford

Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

Zhe Wang, Shanghai Jiao Tong University; Teng Ma, Alibaba Group; Linghe Kong, Shanghai Jiao Tong University; Zhenzao Wen, Jingxuan Li, Zhuo Song, Yang Lu, Yong Yang, and Tao Ma, Alibaba Group; Guihai Chen, Shanghai Jiao Tong University; Wei Cao, Alibaba Group

Available Media

Cloud services have recently undergone a major shift from monolithic designs to microservices running on the cloud-native infrastructure, where monitoring systems are widely deployed to ensure the service level agreement (SLA). Nevertheless, the traditional monitoring system no longer fulfills the demands of cloud-native monitoring, which is observed from our practical experience in Alibaba cloud. Specifically, the monitor occupies resources (e.g., CPU) of the monitored infrastructure, disturbing services running on it. For example, enabling monitor causes jitters/declines of online services in Alibaba's ''double eleven'' shopping festival with high loads. On the other hand, the quality of service (QoS) of monitoring itself, which is vital to track and ensure SLA, is not guaranteed with the high loaded system.

In this paper, we design and implement a novel monitoring system, named Zero, for cloud-native monitoring. First, Zero achieves zero overhead to collect raw metrics from the monitored hosts using \textit{one-sided} remote direct memory access (RDMA) operations, thus avoiding any interferences to cloud services. Second, Zero adopts receiver-driven model to collect monitoring metrics with high QoS, where credit-based flow control and hybrid I/O model are proposed to mitigate network congestion/interference and CPU bottlenecks. Zero has been deployed and evaluated in Alibaba cloud. Deployment results show that Zero achieves no CPU occupation at the monitored host and supports 1 sim 10k hosts with 0.1 sim 1s sampling interval using single thread for network I/O.

CRISP: Critical Path Analysis of Large-Scale Microservice Architectures

Zhizhou Zhang, UC Santa Barbara; Murali Krishna Ramanathan, Prithvi Raj, and Abhishek Parwal, Uber Technologies Inc.; Timothy Sherwood, UC Santa Barbara; Milind Chabbi, Uber Technologies Inc.

Available Media

Microservice architectures have become the lifeblood of modern service-oriented software systems. Remote Procedure Calls (RPCs) among microservices are deeply nested, asynchronous, and large in number, thus making it very hard to identify the underlying service(s) that contribute to the overall end-to-end latency experienced by a top-level request. State-of-the-art RPC tracing tools collect a deluge of data but provide little insight. We need sophisticated tools to bubble-up signals from a myriad of RPC traces to assist developers in identifying optimization opportunities, pinpointing common bottlenecks, setting appropriate time outs, diagnosing error conditions, and planning and managing compute capacity, to name a few.

In this paper, we present \tool{} --- a tool to perform critical path analysis (CPA) over a large number of traces collected from RPCs in microservices environments. \tool{} provides three critical performance analysis capabilities: a) a \emph{top-down} CPA of any specific endpoint, which is tailored for service owners to drill down the root causes of latency issues, b) a \emph{bottom-up} CPA over all endpoints in the system --- tailored for infrastructure and performance engineers --- to bubble up those (interior) APIs that have a high impact across many endpoints, and c) an on-the-fly anomaly detection to alert potential problems.

We have applied \tool{}'s capabilities on \company{}'s entire backend system composed of sim 40K endpoints that cater to real-time requests from more than 100 million active daily users worldwide. Using the critical path as the basis of performance analysis has a) helped us identify five performance issues and optimization opportunities across two business-critical microservices, b) guided us in our future hardware choice that reduces end-to-end latencies, and c) reduced the false positives in anomaly detection by up to 50% while speeding up the training and inference by up to 28 times and up to 67 times, respectively, over the state of the art.

Whale: Efficient Giant Model Training over Heterogeneous GPUs

Xianyan Jia, Le Jiang, Ang Wang, and Wencong Xiao, Alibaba Group; Ziji Shi, National University of Singapore & Alibaba Group; Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin, Alibaba Group

Available Media

The scaling up of deep neural networks has been demonstrated to be effective in improving model quality, but also encompasses several training challenges in terms of training efficiency, programmability, and resource adaptability. We present Whale, a general and efficient distributed training framework for giant models. To support various parallel strategies and their hybrids, Whale generalizes the programming interface by defining two new primitives in the form of model annotations, allowing for incorporating user hints. The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution. Whale further introduces a novel hardware-aware parallel strategy, which improves the performance of model training on heterogeneous GPUs in a balanced manner. Deployed in a production cluster with 512 GPUs, Whale successfully trains an industry-scale multimodal model with over ten trillion model parameters, named M6, demonstrating great scalability and efficiency.

Track 2

Machine Learning 3

Session Chair: Somali Chaterji, Purdue University

Cachew: Machine Learning Input Data Processing as a Service

Dan Graur, Damien Aymon, Dan Kluser, and Tanguy Albrici, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich

Available Media

Processing input data plays a vital role in ML training, impacting accuracy, throughput, and cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with training examples, is a common bottleneck. Alleviating data stalls is critical yet challenging for users. While today's frameworks provide mechanisms to maximize input pipeline throughput (e.g., distributing data processing on remote CPU workers and/or reusing cached data transformations), leveraging these mechanisms to jointly optimize training time and cost is non-trivial. Users face two key challenges. First, ML schedulers focus on GPU/TPU resources, leaving users on their own to optimize multi-dimensional resource allocations for data processing. Second, input pipelines often consume excessive compute power to repeatedly transform the same data. Deciding which source or transformed data to cache is non-trivial: large datasets are expensive to store, the compute time saved by caching is not always the bottleneck for end-to-end training, and transformations may not be deterministic, hence reusing transformed data can impact accuracy.

We propose Cachew, a fully-managed service for ML data processing. Cachew dynamically scales distributed resources for data processing to avoid stalls in training jobs. The service also automatically applies caching when and where it is performance/cost-effective to reuse preprocessed data within and across jobs. Our key contributions are autoscaling and autocaching policies, which leverage domain-specific metrics collected at data workers and training clients (rather than generic resource utilization metrics) to minimize training time and cost. Compared to scaling workers with Kubernetes, Cachew's policies reduce training time by up to 4.1x and training cost by 1.1x to 3.8x.

CoVA: Exploiting Compressed-Domain Analysis to Accelerate Video Analytics

Jinwoo Hwang, Minsu Kim, Daeun Kim, Seungho Nam, Yoonsung Kim, and Dohee Kim, KAIST; Hardik Sharma, Google; Jongse Park, KAIST

Available Media

Modern retrospective analytics systems leverage cascade architecture to mitigate bottleneck for computing deep neural networks (DNNs). However, the existing cascades suffer two limitations: (1) decoding bottleneck is either neglected or circumvented, paying significant compute and storage cost for pre-processing; and (2) the systems are specialized for temporal queries and lack spatial query support. This paper presents CoVA, a novel cascade architecture that splits the cascade computation between compressed domain and pixel domain to address the decoding bottleneck, supporting both temporal and spatial queries. CoVA cascades analysis into three major stages where the first two stages are performed in compressed domain while the last one in pixel domain. First, CoVA detects occurrences of moving objects (called blobs) over a set of compressed frames (called tracks). Then, using the track results, CoVA prudently selects a minimal set of frames to obtain the label information and only decode them to compute the full DNNs, alleviating the decoding bottleneck. Lastly, CoVA associates tracks with labels to produce the final analysis results on which users can process both temporal and spatial queries. Our experiments demonstrate that CoVA offers 4.8× throughput improvement over modern cascade systems, while imposing modest accuracy loss.

SOTER: Guarding Black-box Inference for General Neural Networks at the Edge

Tianxiang Shen, Ji Qi, Jianyu Jiang, Xian Wang, Siyuan Wen, Xusheng Chen, and Shixiong Zhao, The University of Hong Kong; Sen Wang and Li Chen, Huawei Technologies; Xiapu Luo, The Hong Kong Polytechnic University; Fengwei Zhang, Southern University of Science and Technology (SUSTech); Heming Cui, The University of Hong Kong

Available Media

The prosperity of AI and edge computing has pushed more and more well-trained DNN models to be deployed on third-party edge devices to compose mission-critical applications. This necessitates protecting model confidentiality at untrusted devices, and using a co-located accelerator (e.g., GPU) to speed up model inference locally. Recently, the community has sought to improve the security with CPU trusted execution environments (TEE). However, existing solutions either run an entire model in TEE, suffering from extremely high inference latency, or take a partition-based approach to handcraft partial model via parameter obfuscation techniques to run on an untrusted GPU, achieving lower inference latency at the expense of both the integrity of partitioned computations outside TEE and accuracy of obfuscated parameters.

We propose SOTER, the first system that can achieve model confidentiality, integrity, low inference latency and high accuracy in the partition-based approach. Our key observation is that there is often an \textit{associativity} property among many inference operators in DNN models. Therefore, SOTER automatically transforms a major fraction of associative operators into \textit{parameter-morphed}, thus \textit{confidentiality-preserved} operators to execute on untrusted GPU, and fully restores the execution results to accurate results with associativity in TEE. Based on these steps, SOTER further designs an \textit{oblivious fingerprinting} technique to safely detect integrity breaches of morphed operators outside TEE to ensure correct executions of inferences. Experimental results on six prevalent models in the three most popular categories show that, even with stronger model protection, SOTER achieves comparable performance with partition-based baselines while retaining the same high accuracy as insecure inference.

Track 1

Storage 2

Session Chair: William Jannen, Williams College

IPLFS: Log-Structured File System without Garbage Collection

Juwon Kim, Minsu Jang, Muhammad Danish Tehseen, Joontaek Oh, and YouJip Won, KAIST

Available Media

In this work, we develop the log-structured filesystem that is free from garbage collection. There are two key technical ingredients: \emph{IPLFS}, a log-structured filesystem for infinite partition, and \emph{Interval Mapping}, a space-efficient LBA-to-PBA mapping for infinite filesystem partition. In IPLFS, we separate the filesystem partition size from the physical storage size and set the size of the logical partition large enough so that there is no lack of free segments in the logical partition during SSD’s lifespan. This allows the filesystem to write the updates in append-only fashion without reclaiming the invalid filesystem blocks. We revise the metadata structure of the baseline filesystem, F2FS, so that it can efficiently handle the storage partition with 2^{64} sectors. We develop Interval Mapping to minimize the memory requirement for the LBA-to-PBA translation in FTL. Interval Mapping is a three level mapping tree. It maintains mapping only for actively used filesystem region. With Interval Mapping, the FTL can maintain the mapping for the 2^{64} sector range with almost identical memory requirement with the page mapping whose LBA range is limited by the size of the storage capacity. We implement the IPLFS on Linux kernel 5.11.0 and prototype the Interval Mapping in OpenSSD. By eliminating the filesystem level garbage collection, IPLFS outperforms F2FS by up to 12.8 times (FIO) and 3.73 times (MySQL YCSB A), respectively.

Vigil-KV: Hardware-Software Co-Design to Integrate Strong Latency Determinism into Log-Structured Merge Key-Value Stores

Miryeong Kwon, Seungjun Lee, and Hyunkyu Choi, KAIST; Jooyoung Hwang, Samsung Electronics Co., Ltd.; Myoungsoo Jung, KAIST

Available Media

We propose \emph{Vigil-KV}, a hardware and software co-designed framework that eliminates long-tail latency almost perfectly by introducing strong latency determinism. To make Get latency deterministic, Vigil-KV first enables a predictable latency mode (PLM) interface on a real datacenter-scale NVMe SSD, having knowledge about the nature of the underlying flash technologies. Vigil-KV at the system-level then hides the non-deterministic time window (associated with SSD's internal tasks and/or write services) by internally scheduling the different device states of PLM across multiple physical functions. Vigil-KV further schedules compaction/flush operations and client requests being aware of PLM's restrictions thereby integrating strong latency determinism into LSM KVs. We implement Vigil-KV upon a 1.92TB NVMe SSD prototype and Linux 4.19.91, but other LSM KVs can adopt its concept. We evaluate diverse Facebook and Yahoo scenarios with Vigil-KV, and the results show that Vigil-KV can reduce the tail latency of a baseline KV system by 3.19 times while reducing the average latency by 34%, on average.

Pacman: An Efficient Compaction Approach for Log-Structured Key-Value Store on Persistent Memory

Jing Wang, Youyou Lu, Qing Wang, and Minhui Xie, Tsinghua University; Keji Huang, Huawei Technologies Co., Ltd; Jiwu Shu, Tsinghua University

Available Media

Recent persistent memory (PM) key-value (KV) stores adopt the log-structured approach to reap PM’s full potential. However, they fail to sustain high performance at high capacity utilization due to inefficient compaction. The inefficiency results from the unawareness of PM’s characteristics. This paper proposes Pacman, an efficient PM-aware compaction approach for log-structured KV stores on PM. Pacman (1) offloads reference search during compaction to service threads, so as to mitigate the onerous index traversal overhead, (2) leverages tagged pointer and DRAM-resident compaction information to avoid excessive PM accesses introduced by garbage collection, (3) redesigns the compaction pipeline based on the PM peculiarities to lower the persistence overhead, and (4) separates hot and cold objects in a lightweight manner to reduce PM data copying in compaction. We apply Pacman to state-of-the-art PM-based log-structured KV stores and evaluate Pacman using various benchmarks. Our evaluations show that Pacman curtails the performance degradation at high capacity utilization, increases the compaction bandwidth by 2-4×, and boosts the performance of the state-of-the-art systems by 1.5-1.8× under write-intensive workloads.

Track 2

Networking 2

Session Chair: Aurojit Panda, NYU

Towards Latency Awareness for Content Delivery Network Caching

Gang Yan and Jian Li, SUNY-Binghamton University

Available Media

Caches are pervasively used in content delivery networks (CDNs) to serve requests close to users and thus reduce content access latency. However, designing latency-optimal caches are challenging in the presence of delayed hits, which occur in high-throughput systems when multiple requests for the same content occur before the content is fetched from the remote server. In this paper, we propose a novel timer-based mechanism that provably optimizes the mean caching latency, providing a theoretical basis for the understanding and design of latency-aware (LA) caching that is fundamental to content delivery in latency-sensitive systems. Our timer-based model is able to derive a simple ranking function which quickly informs us the priority of a content for our goal to minimize latency. Based on that we propose a lightweight latency-aware caching algorithm named LA-Cache. We have implemented a prototype within Apache Traffic Server, a popular CDN server. The latency achieved by our implementations agrees closely with theoretical predictions of our model. Our experimental results using production traces show that LA-Cache consistently reduces latencies by 5%-15% compared to state-of-the-art methods depending on the backend RTTs.

Yunhong Xu, Texas A&M University; Keqiang He and Rui Wang, Google; Minlan Yu, Harvard University; Nick Duffield, Texas A&M University; Hassan Wassel, Shidong Zhang, Leon Poutievski, Junlan Zhou, and Amin Vahdat, Google

Available Media

Traffic load balancing across multiple paths is a critical task for modern networks to reduce network congestion and improve network efficiency. Hashing which is the foundation of traffic load balancing still faces practical challenges. The key problem is there is a growing need for more hash functions because networks are getting larger with more switches, more stages, and increased path diversity. Meanwhile, topology and routing become more agile in order to efficiently serve traffic demands with stricter throughput and latency SLAs. On the other hand, current generation switch chips only provide a limited number of uncorrelated hash functions. We first demonstrate why the limited number of hashing functions is a practical challenge in today's datacenter network (DCN) and wide-area network (WAN) designs. Then, to mitigate the problem, we propose a novel approach named color recombining which enables hash functions to reuse via leveraging topology traits of multi-stage DCN networks. We also describe a novel framework based on coprime theory to mitigate hash correlation in generic mesh topologies (i.e., spineless DCN and WAN). Our evaluation using real network trace data and topologies demonstrate that we can reduce the extent of load imbalance (measured by the coefficient of variation) by an order of magnitude.

Firebolt: Finding Bugs in Programmable Data Plane Generators

Jiamin Cao, Tsinghua University; Yu Zhou and Chen Sun, Alibaba Group; Lin He, Zhaowei Xi, and Ying Liu, Tsinghua University

Available Media

Programmable data planes (DP) enable flexible customization of packet processing logic with domain-specific languages such as P4. To relieve developers from lengthy codes and tedious hardware details, many researches propose DP program generators that take high-level intents as input and automatically convert intents into DP programs. Generators must be correct, otherwise they may produce buggy programs or DP logic that is inconsistent with intents. Nevertheless, existing verification tools are designed to verify individual DP programs, not generators. They either cannot achieve high bug coverage or cannot debug generators with high scalability.

This paper presents Firebolt, a blackbox testing tool designed to dig out faults in DP program generators, including security vulnerabilities, intent violations, and generator crash. Firebolt achieves high bug coverage by using syntax-guided intent generation to construct a comprehensive, syntactically correct, and semantically valid intent set. To avoid intent explosion, Firebolt designs an intent space pruning approach that eliminates redundant intents while preserving representative ones. For high scalability, Firebolt automatically formalizes DP programs and intents for verification. We apply Firebolt to three popular open-source DP generators. Evaluation results demonstrate that Firebolt can detect 2x bugs with 0.1% to 0.01% human efforts compared to existing tools.

USENIX ATC '22 Poster Session and Reception

The USENIX ATC '22 poster session and reception will feature posters by authors presenting their work in person at the conference. View the list of accepted posters.

The Computing and Information Science and Engineering Landscape: A Look Forward

Margaret Martonosi, National Science Foundation

Available Media

Compilers and PL

Session Chair: Mihai Budiu, VMware Research

Investigating Managed Language Runtime Performance: Why JavaScript and Python are 8x and 29x slower than C++, yet Java and Go can be Faster?

David Lion, University of Toronto and YScope Inc.; Adrian Chiu and Michael Stumm, University of Toronto; Ding Yuan, University of Toronto and YScope Inc.

Available Media

The most widely used programming languages today are managed languages. They are popular because their vast features improve many aspects of code development, including increased productivity and safety. However, as a product or service scales in usage, performance issues become a problem. Developers are then often faced with complex choices as they must decide whether the desired performance can be squeezed from existing code, or whether their language has reached its performance limits, requiring years of code to be ported to a new more-performant language. To make matters worse, runtime performance is shrouded in mystery as it involves complex interactions of different components, such as interpreter, just-in-time (JIT) compiler, thread library, and Garbage Collection (GC) system.

We present an in-depth performance analysis and comparison of Java, Go, JavaScript, and Python, using C++ as a baseline. We carefully instrumented the different language runtimes, so that developers can precisely measure the number of cycles taken to execute any bytecode instruction, or the overhead of dynamic type checking in JavaScript. This allows us to accurately identify sources of overhead. We further created 6 applications from the ground up to establish the LangBench benchmark; the applications cover a range of complexity, and they cover a variety of application scenarios differing in compute intensity, memory usage, network and disk I/O intensity, and available concurrency. We comprehensively analyze their completion times, resource usage, and scalability.

Overall, we found that V8/Node.js and CPython exhibit excessive overheads, executing applications 8.01x and 29.50x slower on average than their C++ counterparts, respectively. Making matters worse, applications on these two runtimes scale poorly in that they cannot effectively utilize more than one core. In contrast, OpenJDK and Go applications are performance competitive to C++, running only 1.43x and 1.30x slower, respectively, and they can easily scale to multiple cores. There are applications where OpenJDK and Go outperform their C++ counterparts.

Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level

Yufei Du, University of North Carolina at Chapel Hill; Ryan Court and Kevin Snow, Zeropoint Dynamics; Fabian Monrose, University of North Carolina at Chapel Hill

Available Media

Identifying a binary’s compiler configuration enables developers and analysts to locate potential security issues caused by optimization side-effects, identify binary clones, and build compatible binary patches. Existing work focuses on identifying compiler family, version and optimization level of a binary using semantic features and deep learning techniques. Unfortunately, in practice, binaries are an amalgamation of objects and functions that can be compiled at different optimization levels with a variety of individual, fine-grained, optimizations that may be applied depending on the structure of the code. Hence, rather than recovering high-level artifacts, i.e., compiler family, version, and optimization level, we explore the recovery of individual, fine-grained, optimization passes for each function in a binary. To do so, we develop an approach using specially crafted features alongside intuitive and understandable machine learning models. Our evaluation on 15 popular open-source repositories shows that our approach compares favorable with the state-of-the-art deep learning approach in compiler family, compiler version and optimization level identification. For fine-grained optimization passes, our evaluation on 149,814 functions from 552 binaries in four popular open-source repositories shows that our approach achieves an average F-1 score of 92.1% for all optimization passes and an average F-1 score of 89.8% for optimization passes that could have negative impacts on security. Moreover, our approach includes experimental support for dynamic feature extraction via binary emulation, and our results shows that such features offer promising potential in improving the accuracy of optimization pass identification.

JITServer: Disaggregated Caching JIT Compiler for the JVM in the Cloud

Alexey Khrabrov, University of Toronto; Marius Pirvu and Vijay Sundaresan, IBM; Eyal de Lara, University of Toronto

Available Media

Managed runtimes such as the Java virtual machine (JVM) rely on just-in-time (JIT) compilers to improve application performance by converting bytecodes into optimized machine code. Unfortunately, JIT compilation introduces significant CPU and memory runtime overheads. JIT compiler disaggregation is a technique that decouples the JIT from the JVM and ships compilation to a separate remote process. JIT disaggregation reduces overall memory usage; however, its communication overheads result in higher system-wide CPU usage. JITServer is a disaggregated caching JIT compiler we implemented in the Eclipse OpenJ9 JVM. It improves system-wide resource utilization by enabling the caching of compiled native code and its reuse in JVMs running on different machines. JITServer is transparent to the application developer, and supports all the dynamic features in the JVM specification. In our experiments, JITServer reduced overall CPU cost by up to 77%, overall memory usage by up to 62%, application start time by up to 58% and warm-up time by up to 87%.

Riker: Always-Correct and Fast Incremental Builds from Simple Specifications

Charlie Curtsinger, Grinnell College; Daniel W. Barowy, Williams College

Awarded Best Paper!

Available Media

Build systems are responsible for building software correctly and quickly. Unfortunately, traditional build tools like make are correct and fast only when developers precisely enumerate dependencies for every incremental build step. Forward build systems improve correctness over traditional build tools by discovering dependencies automatically, but existing forward build tools have two fundamental flaws. First, they are incorrect; existing forward build tools miss dependencies because their models of system state are incomplete. Second, they rely on users to manually specify incremental build steps, increasing the programmer burden for fast builds.

This paper introduces Riker, a forward build system that guarantees fast, correct builds. Riker builds are easy to specify; in many cases a single command such as gcc *.c suffices. From these simple specifications, Riker automatically discovers fast incremental rebuild opportunities. Riker models the entire POSIX filesystem—not just files, but directories, pipes, and so on. This model guarantees that every dependency is checked on every build so every output is correct.

We use Riker to build 14 open source packages including LLVM and memcached. Riker incurs a median overhead of 8.8% on the initial full build. On average, Riker's incremental builds realize 94% of make's incremental speedup with no manual effort and no risk of errors.

Storage 3

Session Chair: Changwoo Min, Virginia Tech

FlatFS: Flatten Hierarchical File System Namespace on Non-volatile Memories

Miao Cai, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University; School of Computer and Information, Hohai University; State Key Laboratory for Novel Software Technology, Nanjing University; Junru Shen, School of Computer and Information, Hohai University; Bin Tang, Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University; Hao Huang, State Key Laboratory for Novel Software Technology, Nanjing University; Baoliu Ye, State Key Laboratory for Novel Software Technology, Nanjing University; Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University; School of Computer and Information, Hohai University

Available Media

The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and good sequential performance provided by non-volatile memory systems. This paper proposes FlatFS, a NVM file system that features a flat namespace architecture while provides a compatible hierarchical namespace view. FlatFS incorporates three novel techniques: coordinated file path walk model, range-optimized NVM-friendly B^{r} tree, and write-optimized compressed index key layout, to fully exploit flat namespace structure to improve file system namespace performance on high-performance NVMs. Evaluation results demonstrate that FlatFS achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.

StRAID: Stripe-threaded Architecture for Parity-based RAIDs with Ultra-fast SSDs

Shucheng Wang, Qiang Cao, and Ziyi Lu, Wuhan National Laboratory for Optoelectronics, HUST; Hong Jiang, Department of Computer Science and Engineering, UT Arlington; Jie Yao, School of Computer Science and Technology, HUST; Yuanyuan Dong, Alibaba Group

Available Media

Popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads.

In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes, StRAID presents a two-phase stripe write mechanism to opportunistically aggregate stripe-associated writes to minimize write I/Os; and designs a parity cache to reduce write-induced read I/Os on parity disks. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to consistently outperform MD by up to 5.8 times in write throughput without affecting the read performance.

Vinter: Automatic Non-Volatile Memory Crash Consistency Testing for Full Systems

Samuel Kalbfleisch, Lukas Werling, and Frank Bellosa, Karlsruhe Institute of Technology

Available Media

Non-volatile memory (NVM) is a new byte-addressable storage technology that is part of the processor's memory hierarchy. NVM is often exposed to applications via an in-kernel file system. To prevent data loss in the case of crashes, the file system implementation needs to be crash-consistent. Achieving crash consistency is difficult however, as special primitives need to be inserted at appropriate places in the program to ensure persistency in the presence of volatile caches.

We introduce Vinter, a new approach to automated NVM crash consistency testing designed for full systems, including unmodified kernel software such as file systems. By tracing NVM accesses of a full system via dynamic binary translation, we capture interactions between user and kernel space code. With such traces, our system efficiently generates relevant crash states using a heuristic that determines NVM locations significant for crash consistency. Finally, it extracts the semantic representation of each crash state. This makes the automatic detection of operation-spanning violations of crash consistency properties such as atomicity feasible. Our approach further aids in fixing detected bugs by representing how bugs originate from simulated crashes which are annotated by trace metadata.

Our evaluation on NVM file systems uncovers several previously unknown bugs, including bugs in the state-of-the-art file systems NOVA and NOVA-Fortis that lead to atomicity violations and data loss.

NICs

Session Chair: Fred Douglis, Peraton Labs

AlNiCo: SmartNIC-accelerated Contention-aware Request Scheduling for Transaction Processing

Junru Li, Youyou Lu, Qing Wang, Jiazhen Lin, Zhe Yang, and Jiwu Shu, Tsinghua University

Available Media

High-performance transaction processing needs to schedule numerous requests from the network. However, such request scheduling comes with costs of complex information gathering and considerable computation. We observe that emerging SmartNICs pose opportunities for transaction scheduling with low overhead. In this paper, we propose AlNiCo, which leverages SmartNICs to intelligently schedule incoming transaction requests to CPU cores, minimizing inter-transaction contention with low latency. AlNiCo describes the contention according to system states in a way that SmartNICs can efficiently process, and co-designs hardware and software to enable flexible and adaptive scheduling. We implement AlNiCo using FPGA-equipped Innova-2 SmartNICs, and our evaluation shows that AlNiCo boosts the throughput by 1.30 times sim 2.68 times and reduces the latency by up to 48.8%.

FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs

Zeke Wang, Hongjing Huang, Jie Zhang, and Fei Wu, Zhejiang University; Gustavo Alonso, ETH Zurich

Available Media

Given that the increasing rate of network bandwidth is far ahead of that of the compute capacity of host CPU, which by default processes network packets, SmartNIC has been introduced to offload packet processing, even application logic. Modern GPUs have already become a key computing engine in a broad range of applications such as Artificial Intelligence (AI) and High Performance Computing (HPC), which always need massive computing power that a single GPU server cannot afford. When processing these applications, the distributed pool of GPUs generates the majority of network traffic in a cluster of such servers. The commercially available multi-core SmartNICs, such as BlueFiled-2, fail to process 100Gb network traffic at line rate with its embedded CPU, which is capable of doing control-plane management only. The commercially available FPGA-based SmartNICs are mainly optimized for network applications that run on the host CPU. We intend to develop a GPU-oriented SmartNIC to accelerate a broad range of distributed applications on distributed GPUs.

To this end, we present FpgaNIC, an FPGA-based GPU-centric versatile SmartNIC that enables direct PCIe P2P communication with local GPUs using GPU virtual address, and that provides reliable 100Gb hardware network transport to communicate with remote GPUs. The above two components consume fewer than 20% FPGA resources on a mid-size FPGA, so FpgaNIC has sufficient resources for a customized data-path accelerator that allows offloading complex compute tasks on the FPGA for line-rate in-network computing, thereby complementing the processing at the GPU. FpgaNIC has been designed to explore the design space of SmartNICs, e.g., direct, on-path, and off-path models, each of which benefits different types of applications. We believe that FpgaNIC opens up a wealth of research opportunities, e.g., accelerating a broad range of distributed applications by combining GPUs and FPGAs and exploring a larger design space of SmartNICs by making them easily accessible from local GPUs. FpgaNIC allows to prototype applications that can eventually be migrated to hardened SmartNICs.

Faster Software Packet Processing on FPGA NICs with eBPF Program Warping

Marco Bonola, CNIT/Axbryd; Giacomo Belocchi, Angelo Tulumello, and Marco Spaziani Brunella, Axbryd/University of Rome Tor Vergata; Giuseppe Siracusano, NEC Laboratories Europe; Giuseppe Bianchi, University of Rome Tor Vergata/CNIT; Roberto Bifulco, NEC Laboratories Europe

Available Media

FPGA NICs can improve packet processing performance, however, programming them is difficult, and existing solutions to enable software packet processing on FPGA either provide limited packet processing speed, or require changes to programs and to their development/deployment life cycle.

We address the issue with program warping, a new technique that improves throughput replacing several instructions of a packet processing program with an equivalent runtime programmable hardware implementation. Program warping performs static analysis of a packet processing program, described with Linux's eBPF, to identify subsets of the program that can be implemented by an optimized FPGA pipeline, the warp engine. Packets handled by the warp engine are eventually delivered to a regular eBPF program executor, along with their program context (registers, stack), to complete execution of those program parts that cannot be efficiently pipelined.

We prototype program warping on a 100Gbps FPGA NIC, extending hXDP, a state-of-the-art eBPF processor for FPGA, and measure its performance running 6 unmodified real-world eBPF programs, including deployed applications such as Katran and Suricata. Our prototype runs at 250MHz, uses less than 15% of the FPGA resources, and improves hXDP throughput by 1.2-3x in most cases, and up to 18x.

Deployed Systems 2

Session Chair: Jiri Schindler, Tranquil Data

NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, PDL; Yiming Zhang, Xiamen University; Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Minglu Li, Shanghai Jiao Tong University & Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Available Media

NVMe SSD has become a staple in modern datacenters thanks to its high throughput and ultra-low latency. Despite its popularity, the reliability of NVMe SSD under mass deployment remains unknown. In this paper, we collect logs from over one million NVMe SSDs deployed at Alibaba, and conduct extensive analysis. From the study, we identify a series of major reliability changes in NVMe SSD. On the good side, NVMe SSD becomes more resilient to early failures and variances of access patterns. On the bad side, NVMe SSD becomes more vulnerable to complicated correlated failures. More importantly, we discover that the ultra-low latency nature makes NVMe SSD much more likely to be impacted by fail-slow failures.

Tzu-Wei Yang, Seth Pollen, Mustafa Uysal, Arif Merchant, and Homer Wolfmeister, Google

Available Media

This paper describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache (LARC) algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 6.5% improvement of the total operational cost, as well as significant improvements in disk reads and flash wearout.

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam ,Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, Akshat Vig, Amazon Web Services

Available Media

Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale. Hundreds of thousands of customers rely on DynamoDB for its fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience. In 2021, during the 66 hour Amazon Prime Day shopping event, Amazon systems including Alexa, the Amazon.com sites, and Amazon fulfillment centers, made trillions of API calls to DynamoDB, peaking at 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance. Since its launch in 2012, DynamoDB's design and implementation have evolved in response to our experiences operating it. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance. Reliability is essential, as even the slightest disruption can significantly impact customers. This paper presents our experience operating DynamoDB at a massive scale and how the architecture continues to evolve to meet the ever-increasing demands of customer workloads.

Closing Remarks

Jiri Schindler, Tranquil Data, and Noa Zilberman, University of Oxford