HotStorage '20 Workshop Program

All the times listed below are in Pacific Daylight Time (PDT).

Papers are available for download below to registered attendees now and to everyone beginning July 13, 2020. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files

HotStorage '20 Paper Archive (ZIP)

HotStorage '20 Attendee List (PDF)

Monday, July 13, 2020

7:00 am–7:15 am

Opening Remarks

Program Co-Chairs: Anirudh Badam, Microsoft, and Vijay Chidambaram, The University of Texas at Austin and VMware Research

7:15 am–8:30 am

Solid State Drives and Caching

Session Chairs: Jian Huang, University of Illinois at Urbana–Champaign, and Ramnatthan Alagappan, University of Wisconsin—Madison

A New LSM-style Garbage Collection Scheme for ZNS SSDs

Gunhee Choi, Kwanghee Lee, Myunghoon Oh, and Jongmoo Choi, Dankook University; Jhuyeong Jhin and Yongseok Oh, SK Hynix

Available Media

This paper explores how to design a garbage collection scheme for ZNS (Zoned NameSpace) SSDs (Solid State Drives). We first show that a naive garbage collection based on a zone unit incurs a long latency due to the huge size of a zone. To overcome this problem, we devise a new scheme, we refer to it as LSM_ZGC (Log-Structured Merge style Zone Garbage Collection), that makes use of the following three features: segment based fine-grained garbage collection, reading both valid and invalid data in a group manner, and merging different data into separate zones. Our proposal can exploit the internal parallelism of a zone and reduce the utilization of a candidate zone by segregating hot and cold data. Real implementation based experimental results show that our scheme can enhance performance by 1.9 times on average.

Ultra-Low Latency SSDs' Impact on Overall Energy Efficiency

Bryan Harris and Nihat Altiparmak, University of Louisville

Available Media

Recent technological advancements have enabled a generation of Ultra-Low Latency (ULL) SSDs that blurs the performance gap between primary and secondary storage devices. However, their power consumption characteristics are largely unknown. In addition, ULL performance in a block device is expected to put extra pressure on operating system components, significantly affecting energy efficiency of the entire system. In this work, we empirically study overall energy efficiency using a real ULL storage device, Optane SSD, a power meter, and a wide range of IO workload behaviors. We present a comparative analysis by laying out several critical observations related to idle vs. active behavior, read vs. write behavior, energy proportionality, impact on system software, as well as impact on overall energy efficiency. To the best of our knowledge, this is the first published study of a ULL SSD's impact on the system's overall power consumption, which can hopefully lead to future energy-efficient designs.

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages

Arpith K and K. Gopinath, Indian Institute of Science, Bangalore

Available Media

In the case of dense NAND flash such as TLC, the LSB, CSB and MSB pages in a wordline can be combined to form a larger logical page called melded-page. In this paper, we propose melding TLC/QLC pages to improve the performance of SSD by mitigating the overheads involved in the read operation. Melded-pages are read in their entirety. The controller schedules the write requests in such a way that, during reads later, requests for data present in LSB, CSB and MSB pages are guaranteed to be present in the request queue. This method works reliably when the workload performs its read operations in large chunks or has a predictable I/O pattern. We obtain performance improvements of up to 45% on workloads that use large block sizes such as the Hadoop Distributed File System (HDFS). Big data solutions that exhibit such read patterns can vastly benefit from melding pages.

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O

Iacovos G. Kolokasis, Anastasios Papagiannis, Polyvios Pratikakis, and Angelos Bilas, Institute of Computer Science (ICS), Foundation for Research and Technology – Hellas (FORTH), Greece; Foivos Zakkak, Red Hat, Inc.
Awarded Best Presentation!

Available Media

Many analytics computations are dominated by iterative processing stages, executed until a convergence condition is met. To accelerate such workloads while keeping up with the exponential growth of data and the slow scaling of DRAM capacity, Spark employs off-memory caching of intermediate results. However, off-heap caching requires the serialization and deserialization (serdes) of data, which add significant overhead especially with growing datasets.

This paper proposes TeraCache, an extension of the Spark data cache that avoids the need of serdes by keeping all cached data on-heap but off-memory, using memory-mapped I/O (mmio). To achieve this, TeraCache extends the original JVM heap with a managed heap that resides on a memory-mapped fast storage device and is exclusively used for cached data. Preliminary results show that the TeraCache prototype can speed up Machine Learning (ML) workloads that cache intermediate results by up to 37% compared to the state-of-the-art serdes approach.

Differentiating Cache Files for Fine-grain Management to Improve Mobile Performance and Lifetime

Yu Liang and Jinheng Li, City University of Hong Kong; Xianzhang Chen, Chongqing University; Rachata Ausavarungnirun, King Mongkut's University of Technology North Bangkok; Riwei Pan, City University of Hong Kong; Tei-Wei Kuo, City University of Hong Kong and National Taiwan University; Chun Jason Xue, City University of Hong Kong

Available Media

Most mobile applications need to download data from the network. The Android system temporarily stores these data as cache files in the local flash storage to improve their re-access performance. For example, using Facebook for two hours in one case generated 1.2GB cache files. Writing all cache files to the flash storage has a negative impact on the overall I/O performance and deteriorates the lifetime of mobile flash storage. In this paper, we analyze the access characteristics of cache files of typical mobile applications. Our observations reveal that the access patterns of caches files are different from application-level to file-level. While existing solutions treat all cache files equally, this paper differentiates cache files into three categories, burn-after-reading, transient, and long-living. A Fine-grain Cache File Management (FCFM) framework is proposed to manage different cache files differently to improve the performance and lifetime of the mobile system. Evaluations using YouTube show that FCFM can significantly improve the performance and lifetime of mobile devices.

Desperately Seeking ... Optimal Multi-Tier Cache Configurations

Tyler Estro, Stony Brook University; Pranav Bhandari and Avani Wildani, Emory University; Erez Zadok, Stony Brook University

Available Media

Modern cache hierarchies are tangled webs of complexity. Multiple tiers of heterogeneous physical and virtual devices, with many configurable parameters, all contend to optimally serve swarms of requests between local and remote applications. The challenge of effectively designing these systems is exacerbated by continuous advances in hardware, firmware, innovation in cache eviction algorithms, and evolving workloads and access patterns. This rapidly expanding configuration space has made it costly and time-consuming to physically experiment with numerous cache configurations for even a single stable workload. Current cache evaluation techniques (e.g., Miss Ratio Curves) are short-sighted: they analyze only a single tier of cache, focus primarily on performance, and fail to examine the critical relationships between metrics like throughput and monetary cost. Publicly available I/O cache simulators are also lacking: they can only simulate a fixed or limited number of cache tiers, are missing key features, or offer limited analyses.

It is our position that best practices in cache analysis should include the evaluation of multi-tier configurations, coupled with more comprehensive metrics that reveal critical design trade-offs, especially monetary costs. We are developing an n-level I/O cache simulator that is general enough to model any cache hierarchy, captures many metrics, provides a robust set of analysis features, and is easily extendable to facilitate experimental research or production level provisioning. To demonstrate the value of our proposed metrics and simulator, we extended an existing cache simulator (PyMimircache). We present several interesting and counter-intuitive results in this paper.

Reinforcement Learning-Based SLC Cache Technique for Enhancing SSD Write Performance

Sangjin Yoo, Sungkyunkwan University, Samsung Electronics; Dongkun Shin, Sungkyunkwan University

Available Media

Although quad-level-cell (QLC) NAND flash memory can provide high density, its lower write performance and endurance compared to triple-level-cell (TLC) flash memory are critical obstacles to the proliferation of QLC flash memory. To resolve such problems of QLC flash, hybrid architectures, which program a part of QLC blocks in the single-level-cell (SLC) mode and utilizes the blocks as a cache of remaining QLC blocks, are widely adopted in the commercial solid-state disks (SSDs). However, it is challenging to optimize various parameters of hybrid SSDs such as the SLC cache size and the hot/cold separation threshold. In particular, the parameters must be adjusted dynamically by monitoring the change on I/O workloads. However, current techniques use heuristically determined fixed parameters. This paper proposes a reinforcement learning (RL)-based SLC cache management technique, which observes workload patterns and internal status of hybrid SSD and decides the optimal SLC cache parameters maximizing the efficiency of hybrid SSD. Experimental results show that the proposed technique improves write throughput and write amplification factor by 77.6% and 20.3% on average, respectively, over the previous techniques.

8:30 am–9:00 am

Break

9:00 am–10:15 am

Coding and Persistent Memory

Session Chairs: Vasiliki Kalavri, Boston University, and Sudarsun Kannan, Rutgers University

SelectiveEC: Selective Reconstruction in Erasure-coded Storage Systems

Liangliang Xu, Min Lyu, Qiliang Li, Lingjiang Xie, and Yinlong Xu, University of Science and Technology of China

Available Media

Erasure coding has been a commonly used approach to provide high reliability with low storage cost. But the skewed load in a recovery batch severely slows down the failure recovery process in storage systems. To this end, we propose a balanced scheduling module, SelectiveEC, which schedules reconstruction tasks out of order by dynamically selecting some stripes to be reconstructed into a batch and selecting source nodes and replacement nodes for each reconstruction task. So it achieves balanced network recovery traffic, computing resources and disk I/Os against single node failure in erasure-coded storage systems. Compared with conventional random reconstruction, SelectiveEC increases the parallelism of recovery process up to 106% and averagely bigger than 97% in our simulation. Therefore, SelectiveEC not only speeds up recovery process, but also reduces the interference of failure recovery with the front-end applications.

Rethinking WOM Codes to Enhance the Lifetime in New SSD Generations

Shehbaz Jaffer, Kaveh Mahdaviani, and Bianca Schroeder, University of Toronto

Available Media

New generations of Solid State Drives (SSDs) offer increased storage density with higher bits per cell, but an order of magnitude lower Program and Erase (P/E) cycles. This decreases the number of times one can rewrite on the SSD, and hence, the overall lifetime of the drive. One way of improving drive lifetime is by applying Write-Once Memory (WOM) codes which can rewrite on top of pre-existing data without erasing previous data. This increases the total logical data that can be written on the physical medium before an erase operation is required. Traditional WOM codes are not scalable and only offer up to 50% increase in total writable logical data between any two erase operations. In this paper we present a novel, simple and highly efficient family of WOM codes. Although our design is generic and could be applied to any N-Level cell drive, we focus on QLC drives to demonstrate and evaluate the proposed scheme and show that it can increase the total logical writable data before an erase in a range of 50-375% the physical medium capacity with varying storage overheads. Next, we argue that it is possible to further increase the total logical writable data between two erase operations by up to 500% with the help of a carefully chosen internal error-correcting code (ECC) already present in SSDs.

StripeFinder: Erasure Coding of Small Objects Over Key-Value Storage Devices (An Uphill Battle)

Umesh Maheshwari, Chiku Research

Available Media

Emerging key-value storage devices are promising because they rid the storage stack of the intervening block namespace and reduce IO amplification. However, pushing the key-value interface down to the device level creates a challenge: erasure coding must be performed over key-value namespaces.

We expose a fundamental problem in employing parity-based erasure coding over key-value namespaces. Namely, the system must store a lot of per-stripe metadata that includes the keys of all objects in the stripe. Furthermore, this metadata must be find-able using the key of each object in the stripe.

A state-of-the-art design, KVMD, does not quantify this metadata overhead. We clarify that, when storing D data and P parity objects, KVMD stores DxP metadata objects, each of which stores D+P object keys. This nullifies the benefit of parity coding over replication in object count. For small objects, it might also nullify the benefit in byte count; e.g., to protect 256~byte objects with 16~byte keys against two failures (P=2), KVMD would cause byte amplification of 2.8x (D=4) and 3.3x (D=8) vs. 3x with plain replication.

We present an optimized version, StripeFinder, that reduces the metadata byte count by a factor of D and the metadata object count by a configurable factor; e.g., to protect 256~byte objects against two failures (P=2), StripeFinder reduces byte amplification to 2.2x (D=4) and 1.9x (D=8). However, StripeFinder does not provide enough savings for 128~byte objects to justify its complexity over replication. Ultimately, parity coding of small objects over key-value devices is an uphill battle, and tiny objects are best replicated.

Prefetching in Hybrid Main Memory Systems

Subisha V, Varun Gohil, and Nisarg Ujjainkar, Indian Institute of Technology Gandhinagar; Manu Awasthi, Ashoka University
Best Presentation Award Finalist

Available Media

The architecture of main memory has seen a paradigm shift in the recent years, with non volatile memory technologies (NVM) like Phase Change Memory (PCM) being incorporated into the hierarchy at the same level as DRAM. This is being done by either splitting the memory address across two or more memory technologies, or using the faster technology with higher lifetimes, typically the DRAM, as a cache for the higher capacity, albeit slower main memory made up of an NVM.

Design of such hybrid architectures remains an active area of research from the perspective of DRAM cache design, which could quickly become the bottleneck, since cache lookups require multiple DRAM accesses for reading tag and data. In this paper, we argue for a hybrid memory hierarchy where DRAM serves as a cache for some NVM. In this paper, we present a novel DRAM cache prefetcher that builds on state of the art Alloy Cache architectures, allowing for caching data at both cacheline and page granularities, and as a result, providing upto a maximum of 2× performance improvement over a state of the art baseline.

It's Time to Revisit LRU vs. FIFO

Ohad Eytan, Danny Harnik, and Effi Ofer, IBM Research; Roy Friedman, Technion; Ronen Kat, IBM Research

Available Media

We revisit the question of the effectiveness of the popular LRU cache eviction policy versus the FIFO heuristic which attempts to give an LRU like behavior. Several past works have considered this question and commonly stipulated that while FIFO is much easier to implement, the improved hit ratio of LRU outweighs this. We claim that two main trends call for a reevaluation: new caches such as front-ends to cloud storage have very large scales and this makes managing cache metadata in RAM no longer feasible; and new workloads have emerged that possess different characteristics.

We model the overall cost of running LRU and FIFO in a very large scale cache and evaluate this cost using a number of publicly available traces. Our main evaluation workload is a new set of traces that we collected from a large public cloud object storage service and on this new trace FIFO exhibits better overall cost than LRU. We hope that these observations reignite the evaluation of cache eviction policies under new circumstances and that the new traces, that we intend to make public, serve as a testing ground for such work.

Processing in Storage Class Memory

Joel Nider, Craig Mustard, Andrada Zoltan, and Alexandra Fedorova, University of British Columbia
Outstanding New Research Direction Award Finalist

Available Media

Storage and memory technologies are experiencing unprecedented transformation. Storage-class memory (SCM) delivers near-DRAM performance in non-volatile storage media and became commercially available in 2019. Unfortunately, software is not yet able to fully benefit from such high-performance storage. Processing-in-memory (PIM) aims to overcome the notorious memory wall; at the time of writing, hardware is close to being commercially available. This paper takes a position that PIM will become an integral part of future storage-class memories, so data processing can be performed in-storage, saving memory bandwidth and CPU cycles. Under that assumption, we identify typical data-processing tasks poised for in-storage processing, such as compression, encryption and format conversion. We present evidence supporting our assumption and present some feasibility experiments on new PIM hardware to show the potential.

10:15 am–11:00 am

Break

11:00 am–12:00 pm

Joint Keynote Address with HotCloud '20

Systems and ML at RISELab

Ion Stoica, University of California at Berkeley

Available Media

In this talk, I will present several of the projects we are developing at RISELab, a three-year old lab at UC Berkeley that focuses on building platforms and algorithms for real-time intelligent decisions, decisions that are secure and explainable. These projects include both systems to better support machine learning (ML) workloads, and leveraging ML to build better systems. In the first category, I will present Ray, a general-purpose distributed system which provides both task-parallel and actor abstractions. Ray already supports several popular libraries, including a reinforcement learning library (RLlib) and a hyperparameter search library (Tune), and it is deployed in production at tens of organizations. In the second category, I will present Autopandas, a system that synthesizes snippets of API calls from input-output examples for Pandas, the most popular data science library today, and NeuroCuts, a tool to generate decision trees that implement network packet classifiers.

Ion Stoica is a Professor in the EECS Department at University of California at Berkeley, and the Director of RISELab. He is currently doing research on cloud computing and AI systems. Past work includes Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow and has received numerous awards, including the ACM SIGOPS Mark Weiser award 2019, SIGOPS Hall of Fame Award (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). He is Executive Chairman at Databricks, a company he co-founded in 2013 to commercialize Apache Spark. In 2006 he also co-founded Conviva, a startup to commercialize technologies for large scale video distribution.

12:15 pm–1:30 pm

Poster Session

Platform-Agnostic Lightweight Deep Learning for Garbage Collection Scheduling in SSDs

Junhyeok Jang, Donghyun Gouk, Jinwoo Shin, and Myoungsoo Jung, KAIST

Available Media

GramFS: The Graph Model-based Namespace Management of Large-scale Distributed File Systems

Hongbo Li, Yubo Liu, Zhiguang Chen, and Nong Xiao, Sun Yat-Sen University

Available Media

Position: The Key-Value SSD as a First-Class Citizen in the Operating System

Carl Duffy, Seoul National University; Sang-Hoon Kim, Ajou University; Jin-Soo Kim, Seoul National University

Available Media

Position: Synergetic effects of Software and Hardware Parameters on the LSM system

Jinghuan Yu, City University of Hong Kong; Heejin Yoon, Ulsan National Institute of Science & Technology; Sam H. Noh, UNIST (Ulsan National Institute of Science and Technology); Young-ri Choi, Ulsan National Institute of Science & Technology; Chun Jason Xue, City University of Hong Kong

Available Media

TensorPRAM: Designing a Scalable Heterogeneous Deep Learning Accelerator with Byte-addressable PRAMs

Sangwon Lee, Gyuyoung Park, and Myoungsoo Jung, KAIST

Available Media

Position: Can Microservices Drive a Renaissance in Workload-Aware Storage Management?

Pranav Bhandari and Avani Wildani, Emory University; Dimitris Skourtis, IBM Research - Almaden; Vasily Tarasov, IBM Research- Almaden; Deepavali Bhagwat, IBM Storage Research; Lukas Rupprecht, IBM Research - Almaden; Ali Anwar, IBM Research

Available Media

Position: GPUKV: Towards a GPU-Driven Computing on Key-Value SSD

Min-Gyo Jeong, Chang-Gyu Lee, Donggyu Park, Sungyong Park, and Youngjae Kim, Sogang University; Jungki Noh, Woosuk Chung, and Kyoung Park, SK hynix

Available Media

Position: SGX-SSD: A Policy-based Versioning SSD with Intel SGX

Jinwoo Ahn, Seungjin Lee, Jinhoon Lee, Yungwoo Ko, and Donghyun Min, Sogang University, Seoul, Republic of Korea; Junghee Lee, Korea University, Seoul, Republic of Korea; Youngjae Kim, Sogang University, Seoul, Republic of Korea

Available Media

Position: On Failure Diagnosis of the Storage Stack

Duo Zhang, Om Rameshwar Gatla, Runzhou Han, and Mai Zheng, Iowa State University

Available Media

Tuesday, July 14, 2020

7:00 am–8:30 am

Key-Value Stores, Databases, and Misc

Session Chairs: Neeraja J. Yadwadkar, Stanford University, and Miguel Matos, INESC-ID and Universidade de Lisboa

CompoundFS: Compounding I/O Operations in Firmware File Systems

Yujie Ren, Rutgers University; Jian Zhang, ShanghaiTech University; Sudarsun Kannan, Rutgers University

Available Media

We introduce CompoundFS, a firmware-level file system that combines multiple filesystem I/O operations into a single compound operation to reduce software overheads. The overheads include frequent interaction (e.g., system calls), data copy, and the VFS overheads between user-level application and the storage stack. Further, to exploit the compute capability of modern storage, CompoundFS also provides a capability to offload simple I/O data processing operations to the device-level CPUs, which further provides an opportunity to reduce interaction with the filesystem, move data, and free-up host CPU for other operations. Preliminary evaluation of CompoundFS against the state-of-the-art user-level, kernel-level, and firmware-level file systems using microbenchmarks and a real-world application shows up to 178% and 75% performance gains, respectively.

Can We Store the Whole World's Data in DNA Storage?

Bingzhe Li, Nae Young Song, Li Ou, and David H.C. Du, University of Minnesota, Twin Cities

Available Media

The total amount of data in the world has been increasing rapidly. However, the increase of data storage capacity is much slower than that of data generation. How to store and archive such a huge amount of data becomes critical and challenging. Synthetic Deoxyribonucleic Acid (DNA) storage is one of the promising candidates with high density and long-term preservation for archival storage systems. The existing works have focused on the achievable feasibility of a small amount of data when using DNA as storage. In this paper, we investigate the scalability and potentials of DNA storage when a huge amount of data, like all available data from the world, is to be stored. First, we investigate the feasible storage capability that can be achieved in a single DNA pool/tube based on current and future technologies. Then, the indexing of DNA storage is explored. Finally, the metadata overhead based on future technology trends is also investigated.

Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs

Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman, University of Wisconsin – Madison
Best Presentation Award Finalist

Available Media

To achieve high performance, recent research has shown that it is important to automatically tune the configuration knobs present in database systems. However, as database systems usually have 100s of knobs, auto-tuning frameworks spend a significant amount of time exploring the large configuration space and need to repeat this as workloads change. Given this challenge, we ask a more fundamental question of how many knobs do we need to tune in order to achieve good performance. Surprisingly, we find that with YCSB workload-A on Cassandra, tuning just five knobs can achieve 99% of the performance achieved by the best configuration that is obtained by tuning many knobs. We also show that our results hold across workloads and applies to other systems like PostgreSQL, motivating the need for tools that can automatically filter out the knobs that need to be tuned. Based on our results, we propose an initial design for accelerating auto-tuners and detail some future research directions.

Neural Trees: Using Neural Networks as an Alternative to Binary Comparison in Classical Search Trees

Douglas Santry, NetApp

Available Media

Binary comparison, the basis of the venerable B Tree, is per-haps the most successful operator for indexing data on sec-ondary storage. We introduce a different technique, called Neural Trees, that is based on neural networks. Neural Trees increase the fan-out per byte of a search tree by up to 40% compared to B Trees. Increasing fan-out reduces memory demands and leads to increased cacheability while decreasing height and media accesses. A Neural Tree also permits search path layout policies that divorce a key’s value from its physi-cal location in a data structure. This is an advantage over the total ordering required by binary comparison, which totally determines the physical location of keys in a tree. Previous attempts to apply machine learning to indices are based on learning the data directly, which renders insertion too expen-sive to be supported. The Neural Tree is a hybrid scheme using a hierarchy (tree) of small neural networks to learn search paths instead of the data directly. Neural Trees can efficiently handle a general read/write workload. We evaluate Neural Trees with weeks of traces from production storage traces and SPC1 workloads to demonstrate their viability.

SplitKV: Splitting IO Paths for Different Sized Key-Value Items with Advanced Storage Devices

Shukai Han, Dejun Jiang, and Jin Xiong, SKL Computer Architecture, ICT, CAS, University of Chinese Academy of Science

Available Media

Modern advanced storage devices, such as modern NVMe SSD and non-volatile memory based persistent memory (PM), provide different access features when data size varies. Existing key-value stores adopt unified IO path for all key-value items, which cannot fully exploit the advantages of different advanced storage devices. In this paper, we propose to split IO paths for different sized key-value items. We let small key-value items be written directly to PM and then migrated to SSD. Meanwhile, large key-value items are directly written to SSD. We present and discuss design choices towards challenging issues of splitting IO paths. We build SplitKV, a key-value store prototype to show the benefits of splitting IO paths. The preliminary results show SplitKV outperforms RocksDB, KVell, and NoveLSM under small large KV mixed workloads.

In support of workload-aware streaming state management

Vasiliki Kalavri, Boston University; John Liagouris, Boston University & Hariri Institute for Computing
Outstanding New Research Direction Award Winner!
Best Presentation Award Finalist

Available Media

Modern distributed stream processors predominantly rely on LSM-based key-value stores to manage the state of long-running computations. We question the suitability of such general-purpose stores for streaming workloads and argue that they incur unnecessary overheads in exchange for state management capabilities. Since streaming operators are instantiated once and are long-running, state types, sizes, and access patterns, can either be inferred at compile time or learned during execution. This paper surfaces the limitations of established practices for streaming state management and advocates for configurable streaming backends, tailored to the state requirements of each operator. Using workload-aware state management, we achieve an order of magnitude improvement in p99 latency and 2x higher throughput.

Re-think Data Management Software Design Upon the Arrival of Storage Hardware with Built-in Transparent Compression

Ning Zheng, ScaleFlux Inc.; Xubin Chen, RPI; Jiangpeng Li, Qi Wu, Yang Liu, Yong Peng, Fei Sun, and Hao Zhong, ScaleFlux Inc.; Tong Zhang, ScaleFlux Inc. and RPI
Outstanding New Research Direction Award Finalist

Available Media

This position paper advocates that storage hardware with built-in transparent compression brings promising but largely unexplored opportunities to innovate data storage management software (e.g., database and filesystem). Modern storage appliances (e.g., all-flash array) and some latest SSDs (solid-state drives) can perform data compression transparently from OS and user applications. Such storage hardware decouples logical storage space utilization efficiency from true physical storage space utilization efficiency. This allows data storage management software intentionally waste logical storage space in return for employing simpler data structures, leading to lower implementation complexity and higher performance. Following this theme, we carried out three preliminary case studies in the context of relational database and key-value (KV) store. Initial experimental results well demonstrate the promising potential, and it is our hope that this preliminary study will attract more interest and activities towards exploring this new research area.

8:30 am–9:00 am

Break

9:00 am–10:15 am

File Systems and Cloud Scale

Session Chairs: Amy Tai, VMware Research, and Avani Wildani, Emory University

Designing a Storage Software Stack for Accelerators

Shinichi Awamoto, NEC Labs Europe; Erich Focht, NEC Deutschland; Michio Honda, University of Edinburgh

Available Media

Although modern accelerator devices, such as vector engines and SmartNICs, are equipped with general purpose CPUs, access to the storage needs the mediation of the host kernel and CPUs, resulting in latency and throughput penalties. In this paper, we explore the case for direct storage access inside the accelerator applications, and discuss the problem, design options and benefits of this architecture. We demonstrate that our architecture can improve throughputs of LevelDB by 12–89%, and reduce the execution time by 33–46 % in a bioinformatics application in comparison to the baseline where the host system mediates the storage accesses.

A file system for safely interacting with untrusted USB flash drives

Ke Zhong, University of Pennsylvania; Zhihao Jiang and Ke Ma, Shanghai Jiao Tong University; Sebastian Angel, University of Pennsylvania
Outstanding New Research Direction Award Finalist

Available Media

This paper introduces RBFuse, a system for interacting with USB flash drives safely in commodity OSes while bypassing the complex and bug-prone USB stack on the host. RBFuse prevents attacks in which malicious USB flash drives exploit low-level USB driver vulnerabilities to compromise the host machine. The simple idea behind RBFuse is to remap the USB stack to a virtual machine and export the flash drive’s file system as a mountable virtual file system. The result of this decomposition is that the host can access all the files in the flash drive without speaking USB. This is beneficial from a security standpoint, since the VFS interface is small, has well-defined semantics, and can be formally verified. RBFuse requires no hardware modifications and introduces low overhead.

Understanding and Finding Crash-Consistency Bugs in Parallel File Systems

Jinghan Sun, Chen Wang, Jian Huang, and Marc Snir, University of Illinois at Urbana-Champaign

Available Media

Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC)infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCheck. PFSCheck is easy to use with low-performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCheck is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states.

The Case for Benchmarking Control Operations in Cloud Native Storage

Alex Merenstein, Stony Brook University; Vasily Tarasov, Ali Anwar, Deepavali Bhagwat, Lukas Rupprecht, and Dimitris Skourtis, IBM Research–Almaden; Erez Zadok, Stony Brook University

Available Media

Storage benchmarking tools and methodologies today suffer from a glaring gap—they are incomplete because they omit storage control operations, such as volume creation and deletion, snapshotting, and volume reattachment and resizing. Control operations are becoming a critical part of cloud storage systems, especially in containerized environments like Kubernetes, where such operations can be executed by regular non-privileged users. While plenty of tools exist that simulate realistic data and metadata workloads, control operations are largely overlooked by the community and existing storage benchmarks do not generate control operations. Therefore, for cloud native environments, modern storage benchmarks fall short of serving their main purpose—holistic and realistic performance evaluation. Different storage provisioning solutions implement control operations indifferent ways, which means we need a unified storage benchmark to contrast and comprehend their performance and expected behaviors. In this position paper, we motivate the need for a cloud native storage benchmark by demonstrating the effect of control operations on storage provisioning solutions and workloads. We identify the challenges and requirements when implementing such benchmark and present our initial ideas for its design.

Could cloud storage be disrupted in the next decade?

Andromachi Chatzieleftheriou, Ioan Stefanovici, Dushyanth Narayanan, Benn Thomsen, and Antony Rowstron, Microsoft Research
Best Presentation Award Finalist

Available Media

If you were asked today "What are the three dominant persistent storage technologies used in the cloud in 2020?'', you would probably answer HDD, flash and tape. If you were asked this question in 2010 you would have probably answered HDD, flash and tape. Will this answer change when you are asked this question in 2030?

MicroMon: A Monitoring Framework for Tackling Distributed Heterogeneity

Baber Khalid, Rutgers University; Nolan Rudolph and Ramakrishnan Durairajan, University of Oregon; Sudarsun Kannan, Rutgers University

Available Media

We present MicroMon, a multi-dimensional monitoring frame-work for geo-distributed applications using heterogeneous hardware. In MicroMon, we introduce micrometrics, which is a set of fine-grained hardware and software metrics required to study the combined impact of heterogeneous resources on application performance. Besides collecting micrometrics, in MicroMon, we propose anomaly reports and concerted effort between the programmable switches and host OSes to reduce the overhead of collecting and disseminating thousands of micro metrics in WAN. We evaluate the MicroMon prototype on Cassandra deployed across multiple data centers and show 10–50% throughput gains in a geo-distributed setting with storage and network heterogeneity.

10:15 am–11:00 am

Break

11:00 am–12:15 pm

Joint Keynote Address with HotCloud '20

From Hyper Converged Infrastructure to Hybrid Cloud Infrastructure

Karan Gupta, Nutanix

Available Media

This talk will cover Nutanix's journey as it transitioned from a pioneer in Hyper Converged Infrastructure to a strong contender of Hybrid Cloud Infrastructure. I will draw on examples from my seven years of experience of building distributed storage systems and LSM-based key value stores at Nutanix. I will describe challenges faced by customers and engineers on the ground, and briefly touch on challenges I see on the horizon for hybrid cloud infrastructure.

Karan Gupta is the principal architect at Nutanix and has over 20 years of experience in distributed filesystems. He has designed and led the evolution of hyper converged storage infrastructure for all tiers (performance and scale) of workloads. He has published multiple papers on LSM based key-value stores and won the best paper award in ATC’19. At Nutanix, he is leading the charter to build geo-distributed federated object stores for the hybrid cloud world. He started his journey in distributed systems at IBM Research Labs at Almaden.