USENIX ATC '22 Technical Sessions

All the times listed below are in Pacific Daylight Time (PDT).

Papers are available for download below to registered attendees now. The papers and the full proceedings will be available to everyone beginning Monday, July 11, 2022. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Full Proceedings PDFs
USENIX ATC '22 Full Proceedings (PDF, 85 MB)
USENIX ATC '22 Proceedings Interior (PDF, 85 MB, best for mobile devices)
USENIX ATC '22 Errata Slip #1 (PDF)

Attendee Files

USENIX ATC '22 Attendee List (PDF)

USENIX ATC '22 Monday Paper Archive (65 MB ZIP, includes Proceedings front matter and attendee list)

USENIX ATC '22 Tuesday Paper Archive (68 MB ZIP)

USENIX ATC '22 Wednesday Paper Archive (17 MB ZIP)

Monday, July 11

8:00 am–9:00 am

Continental Breakfast

9:00 am–10:00 am

USENIX ATC '22 and OSDI '22 Joint Keynote Address

Surprise-Inspired Networking

David Tennenhouse, Independent Researcher

Available Media

The Information Theory concept of surprisal suggests that, in the future, the most valuable information will be the "new" information entering the cloud from the edge rather than the "prior" information that is stored and processed deeper within the cloud. This talk will discuss the implications of this relatively simple concept for future systems and networks. In particular, it will focus on the role that "near the edge" computing and storage can play in synthesizing the "new" and "old" information.

David is passionate about research and innovation and has a track record of embracing high-risk initiatives, such as software-defined networking, software radio, IoT, and data-intensive computing. He has worked in academia, as a faculty member at MIT; in government, at DARPA; in industry at Intel, Amazon/A9.com, Microsoft, and VMware; and as a partner in a venture capital firm. Dr. Tennenhouse has championed research related to a wide range of technologies, including networking, distributed computing, blockchain, computer architecture, storage, machine learning, robotics, and nano/biotechnology. David holds a BASc and MASc in Electrical Engineering from the University of Toronto and obtained his Ph.D. at the University of Cambridge.

10:00 am–10:30 am

Break with Refreshments

10:30 am–10:45 am

Opening Remarks and Awards

Jiri Schindler, Tranquil Data, and Noa Zilberman, University of Oxford

10:45 am–12:00 pm

Track 1

Storage 1

Session Chair: Youjip Won, Korea Advanced Institute of Science and Technology (KAIST)

ZNSwap: un-Block your Swap

Shai Bergman, Technion; Niklas Cassel and Matias Bjørling, Western Digital; Mark Silberstein, Technion

Available Media

We introduce ZNSwap, a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS's explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10 times lower 99th percentile latency and 5 times higher throughput for memcached key-value store under realistic usage scenarios.

Building a High-performance Fine-grained Deduplication Framework for Backup Storage with High Deduplication Ratio

Xiangyu Zou and Wen Xia, Harbin Institute of Technology, Shenzhen; Philip Shilane, Dell Technologies; Haijun Zhang and Xuan Wang, Harbin Institute of Technology, Shenzhen

Available Media

Fine-grained deduplication, which first removes identical chunks and then eliminates redundancies between similar but non-identical chunks (i.e., delta compression), could exploit workloads' compressibility to achieve a very high deduplication ratio but suffers from poor backup/restore performance. This makes it not as popular as chunk-level deduplication thus far. This is because allowing workloads to share more references among similar chunks further reduces spatial/temporal locality, causes more I/O overhead, and leads to worse backup/restore performance.

In this paper, we address issues for different forms of poor locality with several techniques, and propose MeGA, which achieves backup and restore speed close to chunk-level deduplication while preserving fine-grained deduplication's significant deduplication ratio advantage. Specifically, MeGA applies (1) a backup-workflow-oriented delta selector to address poor locality when reading base chunks, and (2) a delta-friendly data layout and "Always-Forward-Reference" traversing in the restore workflow to deal with the poor spatial/temporal locality of deduplicated data.

Evaluations on four datasets show that MeGA achieves a better performance than other fine-grained deduplication approaches. In particular, compared with the traditional greedy approach, MeGA achieves a 4.47–34.45 times higher backup performance and a 30–105 times higher restore performance while maintaining a very high deduplication ratio.

Secure and Lightweight Deduplicated Storage via Shielded Deduplication-Before-Encryption

Zuoru Yang, The Chinese University of Hong Kong; Jingwei Li, University of Electronic Science and Technology of China; Patrick P. C. Lee, The Chinese University of Hong Kong

Available Media

Outsourced storage should fulfill confidentiality and storage efficiency for large-scale data management. Conventional approaches often combine encryption and deduplication based on deduplication-after-encryption (DaE), which first performs encryption followed by deduplication on encrypted data. We argue that DaE has fundamental limitations that lead to various drawbacks in performance, storage savings, and security in secure deduplication systems. In this paper, we study an unexplored paradigm called deduplication-before-encryption (DbE), which first performs deduplication and encrypts only non-duplicate data. DbE has the benefits of mitigating the performance and storage penalties caused by the management of duplicate data, but its deduplication process is no longer protected by encryption. To this end, we design DEBE, a shielded DbE-based deduplicated storage system that protects deduplication via Intel SGX. DEBE builds on frequency-based deduplication that first removes duplicates of frequent data in a space-constrained SGX enclave and then removes all remaining duplicates outside the enclave. Experiments show that DEBE outperforms state-of-the-art DaE approaches.

Track 2

Containers

Session Chair: Ric Wheeler, Facebook

RunD: A Lightweight Secure Container Runtime for High-density Deployment and High-concurrency Startup in Serverless Computing

Zijun Li, Jiagan Cheng, and Quan Chen, Shanghai Jiao Tong University; Eryu Guan, Zizheng Bian, Yi Tao, Bin Zha, Qiang Wang, and Weidong Han, Alibaba Group; Minyi Guo, Shanghai Jiao Tong University

Available Media

The secure container that hosts a single container in a micro virtual machine (VM) is now used in serverless computing, as the containers are isolated through the microVMs. There are high demands on the high-density container deployment and high-concurrency container startup to improve both the resource utilization and user experience, as user functions are fine-grained in serverless platforms. Our investigation shows that the entire software stacks, containing the cgroups in the host operating system, the guest operating system, and the container rootfs for the function workload, together result in low deployment density and slow startup performance at high-concurrency. We therefore propose and implement a lightweight secure container runtime, named RunD, to resolve the above problems through a holistic guest-to-host solution. With RunD, over 200 secure containers can be started in a second, and over 2500 secure containers can be deployed on a node with 384GB of memory. RunD is adopted as Alibaba serverless container runtime to support high-density deployment and high-concurrency startup.

Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through Inter-Function Container Sharing

Zijun Li, Linsong Guo, Quan Chen, Jiagan Cheng, and Chuhao Xu, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Zhuo Song, Tao Ma, and Yong Yang, Alibaba Cloud; Chao Li and Minyi Guo, Shanghai Jiao Tong University

Available Media

In serverless computing, each function invocation is executed in a container (or a Virtual Machine), and container cold startup results in long response latency. We observe that some functions suffer from cold container startup, while the warm containers of other functions are idle. Based on the observation, other than booting a new container for a function from scratch, we propose to alleviate the cold startup by re-purposing a warm but idle container from another function. We implement a container management scheme, named Pagurus, to achieve the purpose. Pagurus comprises an intra-function manager for replacing an idle warm container to be a container that other functions can use without introducing additional security issues, an inter-function scheduler for scheduling containers between functions, and a sharing-aware function balancer at the cluster-level for balancing the workload across different nodes. Experiments using Azure serverless traces show that Pagurus alleviates 84.6\% of the cold startup, and the cold startup latency is reduced from hundreds of milliseconds to 16 milliseconds if alleviated.

RRC: Responsive Replicated Containers

Diyu Zhou, UCLA and EPFL; Yuval Tamir, UCLA

Available Media

Replication is the basic mechanism for providing application-transparent reliability through fault tolerance. The design and implementation of replication mechanisms is particularly challenging for general multithreaded services, where high latency overhead is not acceptable. Most of the existing replication mechanisms fail to meet this challenge.

RRC is a fully-operational fault tolerance mechanism for multiprocessor workloads, based on container replication. It minimizes the latency overhead during normal operation by addressing two key sources of this overhead: (1) it decouples the latency overhead from checkpointing frequency using a hybrid of checkpointing and replay, and (2) it minimizes the pause time for checkpointing by forking a clone of the container to be checkpointed, thus allowing execution to proceed in parallel with checkpointing. The fact that RRC is based on checkpointing makes it inherently less vulnerable to data races than active replication. In addition, RRC includes mechanisms that further reduce the vulnerability to data races, resulting in high recovery rates, as long as the rate of manifested data races is low. The evaluation includes measurement of the recovery rate and recovery latency based on thousands of fault injections. On average, RRC delays responses to clients by less than 400mu's and recovers in less than 1s. The average pause latency is less than 3.3ms. For a set of eight real-world benchmarks, if data races are eliminated, the performance overhead of RRC is under 48%.

12:00 pm–1:20 pm

Lunch

1:20 pm–3:00 pm

Track 1

Distributed Systems 1

Session Chair: Jiri Schindler, Tranquil Data

uKharon: A Membership Service for Microsecond Applications

Rachid Guerraoui and Antoine Murat, EPFL; Javier Picorel, Huawei Technologies; Athanasios Xygkis, EPFL; Huabing Yan and Pengfei Zuo, Huawei Technologies

Available Media

Modern data center fabrics open the possibility of microsecond distributed applications, such as data stores and message queues. A challenging aspect of their development is to ensure that, besides being fast in the common case, these applications react fast to changes in their membership, e.g., due to reconfiguration and failures. This is especially important as they form the backbone of numerous cloud-powered services, such as analytics and trading systems, trying to meet ever-stringent tail latency requirements. As the microservices-oriented architecture is the de facto standard for building cloud services, a single user request translates to a wide fan-out of microservices interactions sitting on the critical path. The outcome is implacable: the traditionally uncommon events of reconfiguration and failures are exacerbated by the fan-out of communication, making user requests commonly experience such events and quickly impacting the tail latency of the service.

We present uKharon, a microsecond-scale membership service that detects changes in the membership of applications and lets them failover in as little as 50us. uKharon consists of (1) a multi-level failure detector, (2) a consensus engine that relies on one-sided RDMA CAS, and (3) minimal-overhead membership leases, all exploiting RDMA to operate at the microsecond scale. We showcase the power of uKharon by building uKharon-KV, a replicated Key-Value cache based on HERD. uKharon-KV processes PUT requests as fast as the state-of-the-art and improves upon it by (1) removing the need for replicating GET requests and (2) bringing the end-to-end failover down to 53us, a 10x improvement.

KRCORE: A Microsecond-scale RDMA Control Plane for Elastic Computing

Xingda Wei, Shanghai Jiao Tong University, Shanghai AI Laboratory; Fangming Lu, Shanghai Jiao Tong University; Rong Chen, Shanghai Jiao Tong University, Shanghai AI Laboratory; Haibo Chen, Shanghai Jiao Tong University

Available Media

We present KRCORE, an RDMA library with a microsecond-scale control plane on commodity RDMA hardware for elastic computing. KRCORE can establish a full-fledged RDMA connection within 10μs (hundreds or thousands of times faster than verbs), while only maintaining a (small) fixed-sized connection metadata at each node, regardless of the cluster scale. The key ideas include virtualizing pre-initialized kernel-space RDMA connections instead of creating one from scratch, and retrofitting advanced RDMA dynamic connected transport with static transport for both low connection overhead and high networking speed. Under load spikes, KRCORE can shorten the worker bootstrap time of an existing disaggregated key-value store (namely RACE Hashing) by 83%. In serverless computing (namely Fn), KRCORE can also reduce the latency for transferring data through RDMA by 99%.

Zero-Change Object Transmission for Distributed Big Data Analytics

Mingyu Wu, Shuaiwei Wang, Haibo Chen, and Binyu Zang, Shanghai Jiao Tong University

Available Media

Distributed big-data analytics heavily rely on high-level languages like Java and Scala for their reliability and versatility. However, those high-level languages also create obstacles for data exchange. To transfer data across managed runtimes like Java Virtual Machines (JVMs), objects should be transformed into byte arrays by the sender (serialization) and transformed back into objects by the receiver (deserialization). The object serialization and deserialization (OSD) phase introduces considerable performance overhead. Prior efforts mainly focus on optimizing some phases in OSD, so object transformation is still inevitable. Furthermore, they require extra programming efforts to integrate with existing applications, and their transformation also leads to duplicated object transmission. This work proposes Zero-Change Object Transmission (ZCOT), where objects are directly copied among JVMs without any transformations. ZCOT can be used in existing applications with minimal efforts, and its object-based transmission can be used for deduplication. The evaluation on state-of-the-art data analytics frameworks indicates that ZCOT can greatly boost the performance of data exchange and thus improve the application performance by up to 23.6%.

Sift: Using Refinement-guided Automation to Verify Complex Distributed Systems

Haojun Ma, Hammad Ahmad, Aman Goel, Eli Goldweber, Jean-Baptiste Jeannin, Manos Kapritsos, and Baris Kasikci, University of Michigan

Available Media

Distributed systems are hard to design and implement correctly. Recent work has tried to use formal verification techniques to provide rigorous correctness guarantees. These works present a hard choice, though. One must either opt for the power of refinement-based approaches like IronFleet and Verdi, at the cost of large amounts of manual effort; or choose the more automated approach of I4, IC3PO, SWISS and DistAI which give up the ability to prove refinement and the power and scalability that come with it.

We propose an alternative approach, Sift, that combines the power of refinement with the ability to automate proofs. Sift is a two-tier methodology that uses a new technique, refinement-guided automation, to leverage automation in a refinement proof and a divide-and-conquer technique to split a system into more refinement layers when necessary. This combination advances the frontier of what systems can be proven correct using a high degree of automation. Contrary to what was possible before, our evaluation shows that our novel approach allows us to prove the correctness of a number of systems with little manual effort, and to extend our proofs to include not just the protocols, but also an executable distributed implementation of these systems.

Track 2

Machine Learning 1

Session Chair: Saurabh Bagchi, Purdue University

Faith: An Efficient Framework for Transformer Verification on GPUs

Boyuan Feng, Tianqi Tang, Yuke Wang, Zhaodong Chen, Zheng Wang, Shu Yang, Yuan Xie, Yufei Ding, University of California, Santa Barbara

Available Media

Transformer verification draws increasing attention in machine learning research and industry. It formally verifies the robustness of transformers against adversarial attacks such as exchanging words in a sentence with synonyms. However, the performance of transformer verification is still not satisfactory due to bound-centric computation which is significantly different from standard neural networks. In this paper, we propose \textbf{Faith}, an efficient framework for transformer verification on GPUs. We first propose a semantic-aware computation graph transformation to identify semantic information such as bound computation in transformer verification. We exploit such semantic information to enable efficient kernel fusion at the computation graph level. Second, we propose a verification-specialized kernel crafter to efficiently map transformer verification to modern GPUs. This crafter exploits a set of GPU hardware supports to accelerate verification-specialized operations which are usually memory-intensive. Third, we propose an expert-guided autotuning to incorporate expert knowledge on GPU backends to facilitate large search space exploration. Extensive evaluations show that Faith achieves 2.1 times to 3.4 times (2.6 times on average) speedup over state-of-the-art frameworks.

DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, and Zirui Li, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Chao Li and Minyi Guo, Shanghai Jiao Tong University

Available Media

The DNN inferences are often batched for better utilizing the hardware in existing DNN serving systems. However, DNN serving exhibits diversity in many aspects, such as input, operator, and load. The unawareness of these diversities results in inefficient processing. Our investigation shows that the inefficiency roots in the feature of existing batching mechanism: one entry and one exit. Therefore, we propose DVABatch, a runtime batching system that enables the multi-entry multi-exit batching scheme for existing DNN serving system. We first abstract three meta operations for adjusting the ongoing batch of queries to achieve the multi-entry multi-exit scheme. The meta operations could be used to form different scheduling logics for different diversities. To deliver the meta operations to an ongoing batch, we slice the DNN models into multiple stages. Each stage corresponds to one executor, which is managed by a state transition diagram. Compared with state-of-the-art solutions, our experimental results show that DVABatch reduces 46.4% average latency and achieves up to 2.12× throughput improvement.

Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh, KAIST

Available Media

As machine learning (ML) techniques are applied to a widening range of applications, high throughput ML inference serving has become critical for online services. Such ML inference servers with multiple GPUs pose new challenges in the scheduler design. First, they must provide a bounded latency for each request to support a consistent service-level objective (SLO). Second, they must be able to serve multiple heterogeneous ML models in a system, as cloud-based consolidation improves system utilization. To address the two requirements of ML inference servers, this paper proposes a new inference scheduling framework for multi-model ML inference servers. The paper shows that with SLO constraints, GPUs with growing parallelism are not fully utilized for ML inference tasks. To maximize the resource efficiency of GPUs, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With spatio-temporal sharing, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpulets, with the most effective amount of resources. The scheduler explores the three-dimensional search space with different batch sizes, temporal sharing, and spatial sharing efficiently. To minimize the cost for cloud-based inference servers, the framework auto-scales the required number of GPUs for a given workload. To consider the potential interference overheads when two ML tasks are running concurrently by spatially sharing a GPU, the scheduling decision is made with an interference prediction model. Our prototype implementation proves that the proposed spatio-temporal scheduling enhances throughput by 61.7% on average compared to the prior temporal scheduler, while satisfying SLOs.

PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training

Wei Zhang and Binghao Chen, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research; Quan Chen, Shanghai Jiao Tong University; Peng Cheng, Fan Yang, Ran Shu, and Yuqing Yang, Microsoft Research; Minyi Guo, Shanghai Jiao Tong University

Available Media

Cloud gaming services have become important workloads in cloud datacenter. However, our investigation shows that a cloud gaming service cannot saturate the modern cloud GPUs. One way to improve the GPU utilization is to co-locate multiple workloads within one GPU, which is challenging for cloud gaming due to its highly fluctuated and unpredictable GPU usage pattern. In this paper, we present PilotFish, a high-performance system that harvests the free GPU cycles of cloud gaming with deep learning (DL) training, while incurring almost zero interference to cloud gaming. We co-locate DL training jobs with cloud gaming because they have stable and predictable workloads and have no strict latency requirement. In more detail, Pilotfish captures the idle periods of the game's GPU usage with its low-overhead instrumentation to graphic libraries in sub-millisecond granularity. To avoid the potential interference to cloud gaming, PilotFish schedules training computation kernels only when they can finish before the idle GPU periods, and preempts straggler kernels running longer than expected. Our evaluation on popular cloud games and DL models shows PilotFish can harvest up to 85.1% of the idle GPU time from cloud gaming with no interference.

3:00 pm–3:30 pm

Break with Refreshments

3:30 pm–4:25 pm

Track 1

Operating Systems 1

Session Chair: Malte Schwarzkopf, Brown University

Privbox: Faster System Calls Through Sandboxed Privileged Execution

Dmitry Kuznetsov and Adam Morrison, Tel Aviv University

Available Media

System calls are the main method for applications to request services from the operating system, but their invocation incurs considerable overhead, which has been aggravated by mitigation mechanisms for transient execution attacks. Proposed approaches for reducing system call overhead all break the semantic equivalence between system calls and regular function calls (e.g., by making system calls asynchronous), and so their adoption requires rearchitecting applications.

This paper proposes \emph{Privbox}, a new approach for lightweight system calls that maintains the familiar synchronous, function-like system call model. Privbox allows an application to execute system call-intensive code in a \emph{semi-privileged, sandboxed} execution mode, called a 'privbox'. Semi-privileged execution is architecturally similar to the kernel's privileged execution, which enables faster invocation of system calls, but the code is sandboxed to ensure that it cannot use its elevated privileges to compromise the system. We further propose \emph{semi-privileged access prevention} (SPAP), a simple hardware architectural feature that alleviates much of Privbox's instrumentation overhead.

We implement Privbox based on Linux and LLVM. Our evaluation on x86 (Intel Skylake) hardware shows that Privbox (1) speeds up system call invocation by 2.2 times; (2) can increase throughput of I/O-threaded applications by up to 1.7 times; and (3) can increase the throughput of real-world workloads such as Redis by up to 7.6% and 11%, without and with SPAP, respectively.

BBQ: A Block-based Bounded Queue for Exchanging Data and Profiling

Jiawei Wang, Huawei Dresden Research Center, Huawei OS Kernel Lab, Technische Universität Dresden; Diogo Behrens, Ming Fu, Lilith Oberhauser, Jonas Oberhauser, and Jitang Lei, Huawei Dresden Research Center, Huawei OS Kernel Lab; Geng Chen, Huawei OS Kernel Lab; Hermann Härtig, Technische Universität Dresden; Haibo Chen, Huawei OS Kernel Lab, Shanghai Jiao Tong University

Available Media

Concurrent bounded queues have been widely used for exchanging data and profiling in operating systems, databases, and multithreaded applications. The performance of state-of-the-art queues is limited by the interference between multiple enqueues (enq-enq), multiple dequeues (deq-deq), or enqueues and dequeues (enq-deq), negatively affecting their latency and scalability. Although some existing designs employ optimizations to reduce deq-deq and enq-enq interference, they often neglect the enq-deq case. In fact, such partial optimizations may inadvertently increase interference elsewhere and result in performance degradation.

We present Block-based Bounded Queue (BBQ), a novel ringbuffer design that splits the entire buffer into multiple blocks. This eliminates enq-deq interference on concurrency control variables when producers and consumers operate on different blocks. Furthermore, the block-based design is amenable to existing optimizations, e.g., using the more scalable fetch-and-add instruction. Our evaluation shows that BBQ outperforms several industrial ringbuffers. For example, in single-producer/single-consumer micro-benchmarks, BBQ yields 11.3x to 42.4x higher throughput than the ringbuffers from Linux kernel, DPDK, Boost, and Folly libraries. In real-world scenarios, BBQ achieves up to 1.5x, 50.5x, and 11.1x performance improvements in benchmarks of DPDK, Linux io_uring, and Disruptor, respectively. We verified and optimized BBQ on weak memory models with a model-checking-based framework.

Track 2

Disaggregated Systems

Session Chair: Noa Zilberman, University of Oxford

Sibylla: To Retry or Not To Retry on Deep Learning Job Failure

Taeyoon Kim, Suyeon Jeong, Jongseop Lee, Soobee Lee, and Myeongjae Jeon, UNIST

Available Media

GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our analysis with a real-world trace reveals that a non-negligible number of jobs running on the cluster undergo failures and are blindly retried by the job scheduler. Unfortunately, these job failures often repeat and waste GPU resources, limiting effective GPU utilization across the cluster. In this paper, we introduce Sibylla which informs whether an observed failure of DL training will repeat or not upon retry on the failure. Sibylla employs a machine learning model based on RNNs that trains on stdout and stderr logs of failed jobs and can continuously update the model on new log messages without hand-constructing labels for the new training samples. With Sibylla, the job scheduler is learning-enhanced, performing a retry for a failed job only when it is highly likely to succeed with the retry. We evaluate the effectiveness of Sibylla under a variety of scenarios using trace-driven simulations. Sibylla improves cluster utilization and reduces job completion time (JCT) by up to 15%.

Speculative Recovery: Cheap, Highly Available Fault Tolerance with Disaggregated Storage

Nanqinqin Li, Anja Kalaba, Michael J. Freedman, Wyatt Lloyd, and Amit Levy, Princeton University

Available Media

The ubiquity of disaggregated storage in cloud computing has led to a nascent technique for fault tolerance: instead of utilizing application-level replication, newly-launched backup instances recover application state from disaggregated storage (REDS) after a primary's failure. Attractively, REDS provides fault tolerance at a much lower cost than traditional replication schemes, wherein at least two instances are running. Failover in REDS is slow, however, because it sequentially first detects primary failure and only then starts recovery on a backup.

We propose speculative recovery to accelerate failover and thus increase the availability of applications using REDS. Instead of proceeding with failover sequentially, speculative recovery safely and efficiently parallelizes detecting primary failure and running recovery on a backup, by employing our new super and collapse primitives for disaggregated storage. Our implementation and evaluation of speculative recovery demonstrate that it considerably reduces failover time.

Direct Access, High-Performance Memory Disaggregation with DirectCXL

Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung, KAIST

Available Media

New cache coherent interconnects such as CXL have recently attracted great attention thanks to their excellent hardware heterogeneity management and resource disaggregation capabilities. Even though there is yet no real product or platform integrating CXL into memory disaggregation, it is expected to make memory resources practically and efficiently disaggregated much better than ever before.

In this paper, we propose directly accessible memory disaggregation, DirectCXL that straight connects a host processor complex and remote memory resources over CXL’s memory protocol (CXL.mem). To this end, we explore a practical design for CXL-based memory disaggregation and make it real. As there is no operating system that supports CXL, we also offer CXL software runtime that allows users to utilize the underlying disaggregated memory resources via sheer load/store instructions. Since DirectCXL does not require any data copies between the host memory and remote memory, it can expose the true performance of remote-side disaggregated memory resources to the users.

4:25 pm–4:45 pm

Short Break

4:45 pm–6:00 pm

Track 1

Networking 1

Session Chair: Rik Farrow, Sirius Computing

Not that Simple: Email Delivery in the 21st Century

Florian Holzbauer, SBA Research; Johanna Ullrich, University of Vienna; Martina Lindorfer, TU Wien; Tobias Fiebig, Max-Planck-Institut für Informatik

Available Media

Over the past two decades, the number of RFCs related to email and its security has exploded from below 100 to nearly 500. This embedded the Simple Mail Transfer Protocol (SMTP) into a tree of interdependent and delivery-relevant standards. In this paper, we investigate how far real-world deployments keep up with this increasing complexity of delivery- and security options. To gain an in-depth picture of email delivery apart from the giants in the ecosystem (Gmail, Outlook, etc.), we engage people to send emails to eleven differently configured target domains. Our measurements allow us to evaluate core aspects of email delivery, including security features, DNS configuration, and IP version support on the sending side across different types of providers.

We find that novel technologies are often insufficiently supported, even by large providers. For example, while 65.4\% of email providers can resolve hosts via IPv6, only 44.3\% can also deliver emails via IPv6. Concerning security features, we observe that less than half (41.5\%) of all providers rely on DNSSEC validating resolvers, and encryption is mostly opportunistic, with 89.7\% of providers accepting invalid certificates. TLSA, as a DNS-based certificate verification method, is only used by 31.7\% of the providers in our study. Finally, we turned our eye to the impact modern standards have on unsolicited bulk email (SPAM). We found that greylisting is effective, reducing the SPAM volume by roughly half while not impacting regular delivery. However, and interestingly, SPAM delivery currently seems to focus on plaintext IPv4 connections, making IPv6-only, TLS-enforcing inbound email servers a more effective anti-SPAM measure – even though it also means rejecting a major portion of legitimate emails.

AddrMiner: A Comprehensive Global Active IPv6 Address Discovery System

Guanglei Song, Jiahai Yang, Lin He, Zhiliang Wang, Guo Li, Chenxin Duan, and Yaozhong Liu, Tsinghua University; Zhongxiang Sun, Beijing Jiaotong University

Available Media

Fast Internet-wide scanning is essential for network situational awareness and asset evaluation. However, the vast IPv6 address space makes brute-force scanning infeasible. Although state-of-the-art techniques have made effective attempts, these methods do not work in seedless regions, while the detection efficiency is low in regions with seeds. Moreover, the constructed hitlists with low coverage cannot truly represent the active IPv6 address landscape of the Internet.

This paper introduces AddrMiner, a global active IPv6 address probing system, making IPv6 active address probing systematic, comprehensive, and economical. We divide the IPv6 address space regions into three kinds according to the number of seed addresses and propose a probing algorithm for each of them. For the regions with no seeds, we propose AddrMiner-N, leveraging an organization association strategy to mine active addresses. It finds active addresses covering 86.4K BGP prefixes, accounting for 81.6\% of the probed BGP prefixes. For the regions with few seeds, we propose AddrMiner-F, utilizing a similarity matching strategy to probe active addresses further. The hit rate of active address probing is improved by 70\%-150\% compared to existing algorithms. For the regions with sufficient seeds, we propose AddrMiner-S to generate target addresses based on reinforcement learning dynamically. It nearly doubles the hit rate compared to the state-of-the-art algorithms. Finally, we deploy AddrMiner and discover 2.1 billion active IPv6 addresses, including 1.7 billion de-aliased active addresses and 0.4 billion aliased addresses, through continuous probing for 13 months. We would like to further open the door of IPv6 measurement studies by publicly releasing AddrMiner and sharing our data.

Co-opting Linux Processes for High-Performance Network Simulation

Rob Jansen, U.S. Naval Research Laboratory; Jim Newsome, Tor Project; Ryan Wails, Georgetown University, U.S. Naval Research Laboratory

Awarded Best Paper!

Available Media

Network experimentation tools are vitally important to the process of developing, evaluating, and testing distributed systems. The state-of-the-art simulation tools are either prohibitively inefficient at large scales or are limited by nontrivial architectural challenges, inhibiting their widespread adoption. In this paper, we present the design and implementation of Phantom, a novel tool for conducting distributed system experiments. In Phantom, a discrete-event network simulator directly executes unmodified applications as Linux processes and innovatively synthesizes efficient process control, system call interposition, and data transfer methods to co-opt the processes into the simulation environment. Our evaluation demonstrates that Phantom is up to 2.2× faster than Shadow, up to 3.4× faster than NS-3, and up to 43× faster than gRaIL in large P2P benchmarks while offering performance comparable to Shadow in large Tor network simulations.

Track 2

Finding Bugs

Session Chair: Geoff Kuenning, Harvey Mudd College

KSG: Augmenting Kernel Fuzzing with System Call Specification Generation

Hao Sun, Yuheng Shen, Jianzhong Liu, Yiru Xu, and Yu Jiang, Tsinghua University

Available Media

Kernel fuzzing is a dynamic testing technique that has successfully found numerous kernel vulnerabilities. However, existing kernel fuzzers, such as Syzkaller, depend on system call specifications to generate test cases. Writing such specifications requires an immense amount of domain knowledge while being extremely laborious. Meanwhile, automated generation of the specification is still an open problem due to the complexity of the kernel, including entry function extraction and input type identification. As a result, the current amount of system call information is insufficient to test the entire kernel code base thoroughly. Syzkaller covers an average of 38% of Linux kernel code with current Syzlang specifications for a prolonged time of fuzzing.

In this paper, we propose KSG to generate system call specifications for kernel fuzzers automatically. First, it utilizes probe-based tracing to extract entry functions accurately. Then, it uses path-sensitive analysis to collect precise input types and range constraints in each execution path of entry functions. Based on the aforementioned information, KSG generates specifications in the domain language Syzlang, which is used by most kernel fuzzers. We evaluated KSG on several versions of the Linux kernel. It automatically generated 2433 unique specifications. Leveraging the newly generated specifications, Syzkaller and Moonshine achieved coverage improvements of 22% and 23% respectively. Furthermore, our approach assisted fuzzers to discover 26 previously unknown bugs, where 13 and 6 bugs were fixed and assigned with CVEs, respectively.

DLOS: Effective Static Detection of Deadlocks in OS Kernels

Jia-Ju Bai, Tuo Li, and Shi-Min Hu, Tsinghua University

Available Media

Deadlocks in OS kernels can cause critical problems like performance degradation and system hangs. However, detecting deadlocks in OS kernels is quite challenging, due to high complexity of concurrent execution and large code bases of OS kernels. In this paper, we design a practical static analysis approach named DLOS, to effectively detect deadlocks in OS kernels. DLOS consists of three key techniques: (1) a summary-based lock-usage analysis to efficiently extract the code paths containing distinct locking constraints from kernel code; (2) a reachability-based comparison method to efficiently detect locking cycles from locking constraints; (3) a two-dimensional filtering strategy to effectively drop false positives by validating code-path feasibility and concurrency. We have evaluated DLOS on Linux 5.10, and find 54 real deadlocks, with a false positive rate of 17%. We have reported these deadlocks to Linux kernel developers, and 31 of them have been confirmed.

Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models

Beom Heyn Kim, Samsung Research, University of Toronto; Taesoo Kim, Samsung Research, Georgia Institute of Technology; David Lie, University of Toronto

Available Media

While there exist many consistency models for distributed systems, most of those models seek to provide the basic guarantee of convergence: given enough time and no further inputs, all replicas in the system should eventually converge to the same state. However, because of Convergence Failure Bugs (CFBs), many distributed systems do not provide even this basic guarantee. The violation of the convergence property can be crucial to safety-critical applications collectively working together with a shared distributed system. Indeed, many CFBs are reported as major issues by developers. Our key insight is that CFBs are caused by divergence, or differences between the state of replicas, and that a focused exploration of divergence states can reveal bugs in the convergence logic of real distributed systems while avoiding state explosion. Based on this insight, we have designed and implemented Modulo, the first Model-Based Testing tool using Divergence Resync Models (DRMs) to systematically explore divergence and convergence in real distributed systems. Modulo uses DRMs to explore an abstract state machine of the system and derive schedules, the intermediate representation of test cases, which are then translated into test inputs and injected into systems under test (SUTs). We ran Modulo to check ZooKeeper, MongoDB, and Redis and found 11 bugs (including 6 previously unknown ones)

6:30 pm–8:00 pm

OSDI '22 Poster Session and Reception

Sponsored by Amazon

Would you like to share a provocative opinion, interesting preliminary work, or a cool idea that will spark discussion at this year's OSDI? The poster session is the perfect venue to introduce such new or ongoing work. Poster presenters will have the opportunity to discuss their work, get exposure, and receive feedback from other attendees during the in-person evening reception. View the list of accepted posters.