# HotCloud '19 Workshop Program

All sessions will be held in Grand Ballroom VII–IX unless otherwise noted.

Papers are available for download below to registered attendees now and to everyone beginning July 8, 2019. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Attendee Files
HotCloud '19 Attendee List (PDF)
HotCloud '19 Paper Archive (ZIP)
View mode:

## Continental Breakfast

Grand Ballroom Prefunction

## Opening Remarks

Program Co-Chairs: Christina Delimitrou, Cornell University, and Dan R. K. Ports, Microsoft Research

## The Seven Sins of Personal-Data Processing Systems under GDPR

Supreeth Shastri, Melissa Wasserman, and Vijay Chidambaram, University of Texas at Austin

Available Media

In recent years, our society is being plagued by unprecedented levels of privacy and security breaches. To rein in this trend, the European Union, in 2018, introduced a comprehensive legislation called the General Data Protection Regulation (GDPR). In this paper, we review GDPR from a system design perspective, and identify how its regulations conflict with the design, architecture, and operation of modern systems. We illustrate these conflicts via the seven GDPR sins: storing data forever; reusing data indiscriminately; walled gardens and black markets; risk-agnostic data processing; hiding data breaches; making unexplainable decisions; treating security as a secondary goal. Our findings reveal a deep-rooted tussle between GDPR requirements and how modern systems have evolved. We believe that achieving compliance requires comprehensive, grounds up solutions, and anything short would amount to fixing a leaky faucet in a sinking ship.

## NetWarden: Mitigating Network Covert Channels without Performance Loss

Jiarong Xing, Adam Morrison, and Ang Chen, Rice University

Available Media

Network covert channels are an advanced threat to the security and privacy of cloud systems. One common limitation of existing defenses is that they all come at the cost of performance. This presents significant barriers to their practical deployment in high-speed networks. We sketch the design of NetWarden, a novel defense whose key design goal is to preserve TCP performance while mitigating covert channels. The use of programmable data planes makes it possible for NetWarden to adapt defenses that were only demonstrated before as proof of concept, and apply them at linespeed. Moreover, NetWarden uses a set of performance boosting techniques to temporarily increase the performance of connections that have been affected by channel mitigation, with the ultimate goal of neutralizing its impact on performance. Our simulation provides initial evidence that NetWarden can mitigate several covert channels with little performance disturbance. As ongoing work, we are working on a full system design and implementation of NetWarden.

## A Double-Edged Sword: Security Threats and Opportunities in One-Sided Network Communication

Shin-Yeh Tsai and Yiying Zhang, Purdue University

Available Media

One-sided network communication technologies such as RDMA and NVMe-over-Fabrics are quickly gaining adoption in production software and in datacenters. Although appealing for their low CPU utilization and good performance, they raise new security concerns that could seriously undermine datacenter software systems building on top of them. At the same time, they offer unique opportunities to help enhance security. Indeed, one-sided network communication is a double-edged sword in security. This paper presents our insights into security implications and opportunities of one-sided communication.

## DynaShield: Reducing the Cost of DDoS Defense using Cloud Services

Shengbao Zheng and Xiaowei Yang, Duke University

Available Media

Fueled by IoT botnets and DDoS-as-a-Service tools, distributed denial of service (DDoS) attacks have reached record high volumes. Although there exist DDoS protection services, they can be costly for small organizations as well as individual users. In this paper, we present a low-cost DDoS solution, DynaShield, which a user can deploy at common cloud service providers. DynaShield employs three techniques to reduce cost. First, it uses an on-demand model. A server dynamically updates its DNS record to redirect clients’ traffic to DynaShield when it is under attack, avoiding paying for cloud services during peacetime. Second, DynaShield combines serverless functions and elastic servers provided by cloud providers to auto-scale to large attacks without overprovisioning. Third, DynaShield uses cryptocurrency puzzles as proof of work. The coin mining profit can further offset a protected server’s cloud service charges. Our preliminary evaluation suggests that DynaShield can cost as little as a few dollars per month to prevent an organization from common DDoS attacks.

## Deep Enforcement: Policy-based Data Transformations for Data in the Cloud

Ety Khaitzin, Julian James Stephen, Maya Anderson, Hani Jamjoom, Ronen Kat, and Arjun Natarajan, IBM Research; Roger Raphael, IBM Cloud; Roee Shlomo and Tomer Solomon, IBM Research

Available Media

Despite the growing collection and use of private data in the cloud, there remains a fundamental disconnect between unified data governance and the storage system enforcement techniques. On one side, high-level governance policies derived from regulations like General Data Protection Regulation (GDPR) have emerged with stricter rules dictating who, when and how data can be processed. On the other side, storage-level controls, both role- or attribute-based, continue to focus on access/deny enforcement. In this paper, we propose how to bridge this gap. We introduce Deep Enforcement, a system that provides unified governance and transformation policies coupled with data transformations embedded into the storage fabric to achieve policy compliance. Data transformations can vary in complexity, from simple redactions to complex differential privacy-based techniques to provide the required amount of anonymization. We show how this architecture can be implemented into two broad classes of data storage systems in the cloud: object storages and SQL databases. Depending on the complexity of the transformation, we also demonstrate how to implement them either in-line (on data access) or off-line (creating an alternate cached dataset).

## Break with Refreshments

Grand Ballroom Prefunction

## Jumpgate: In-Network Processing as a Service for Data Analytics

Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova, University of British Columbia

Available Media

In-network processing, where data is processed by special-purpose devices as it passes over the network, is showing great promise at improving application performance, in particular for data analytics tasks. However, analytics and in-network processing are not yet integrated and widely deployed. This paper presents a vision for providing in-network processing as a service to data analytics frameworks, and outlines benefits, remaining challenges, and our current research directions towards realizing this vision.

## Just In Time Delivery: Leveraging Operating Systems Knowledge for Better Datacenter Congestion Control

Amy Ousterhout and Adam Belay, MIT CSAIL; Irene Zhang, Microsoft Research

Available Media

Network links and server CPUs are heavily contended resources in modern datacenters. To keep tail latencies low, datacenter operators drastically overprovision both types of resources today, and there has been significant research into effectively managing network traffic and CPU load. However, this work typically looks at the two resources in isolation.

In this paper, we make the observation that, in the datacenter, the allocation of network and CPU resources should be co-designed for the most efficiency and the best response times. For example, while congestion control protocols can prioritize traffic from certain flows, this provides no benefit if the traffic arrives at an overloaded server that will only queue the request.

This paper explores the potential benefits of such a co-designed resource allocator and considers the recent work in both CPU scheduling and congestion control that is best suited to such a system. We propose a Chimera, a new datacenter OS that integrates a receiver-based congestion control protocol with OS insight into application queues, using the recent Shenango operating system.

## Designing an Efficient Replicated Log Store with Consensus Protocol

Jung-Sang Ahn, Woon-Hak Kang, Kun Ren, Guogen Zhang, and Sami Ben-Romdhane, eBay Inc.

Available Media

Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.

## Tackling Parallelization Challenges of Kernel Network Stack for Container Overlay Networks

Jiaxin Lei, SUNY at Binghamton; Kun Suo, The University of Texas at Arlington; Hui Lu, SUNY at Binghamton; Jia Rao, The University of Texas at Arlington

Available Media

Overlay networks are the de facto networking technique for providing flexible, customized connectivity among distributed containers in the cloud. However, overlay networks also incur non-trivial overhead due to its complexity, resulting in significant network performance degradation of containers. In this paper, we perform a comprehensive empirical performance study of container overlay networks which identifies unrevealed, important parallelization bottlenecks of the kernel network stack that prevent container overlay networks from scaling. Our observations and root cause analysis cast light on optimizing the network stack of modern operating systems on multi-core systems to more efficiently support container overlay networks in light of high-speed network devices.

## A Black-Box Approach for Estimating Utilization of Polled IO Network Functions

Harshit Gupta, Georgia Institute of Technology; Abhigyan Sharma, Alex Zelezniak, and Minsung Jang, AT&T Labs Research; Umakishore Ramachandran, Georgia Institute of Technology

Available Media

Cloud management tasks such as performance diagnosis, workload placement, and power management depend critically on estimating the utilization of an application. But, it is challenging to measure actual utilization for polled IO network functions (NFs) without code instrumentation. We ask if CPU events (e.g., data cache misses) measured using hardware performance counters are good at estimating utilization for polled-IO NFs. We find a strong correlation between several CPU events and NF utilization for three representative types of network functions. Inspired by this finding, we explore the possibility of computing a universal estimation function that maps selected CPU events to NF utilization estimates for a wide-range of NFs, traffic profiles and traffic loads. Our NF-specific estimators and universal estimators achieve absolute estimation errors below 6% and 10% respectively.

Olympic Pavilion

## DLion: Decentralized Distributed Deep Learning in Micro-Clouds

Rankyung Hong and Abhishek Chandra, University of Minnesota

Available Media

Deep learning is a popular technique for building inference models and classifiers from large quantities of input data for applications in many domains. With the proliferation of edge devices such as sensor and mobile devices, large volumes of data are generated at rapid pace all over the world. Migrating large amounts of data into centralized data center(s) over WAN environments is often infeasible due to cost, performance or privacy reasons. Moreover, there is an increasing need for incremental or online deep learning over newly generated data in real-time. These trends require rethinking of the traditional training approach to deep learning. To handle the computation on distributed input data, micro-clouds, small-scale clouds deployed near edge devices in many different locations, provide an attractive alternative for data locality reasons. However, existing distributed deep learning systems do not support training in micro-clouds, due to the unique characteristics and challenges in this environment. In this paper, we examine the key challenges of deep learning in micro-clouds: computation and network resource heterogeneity at inter- and intra micro-cloud levels and their scale. We present DLion, a decentralized distributed deep learning system for such environments. It employs techniques specifically designed to address the above challenges to reduce training time, enhance model accuracy, and provide system scalability. We have implemented a prototype of DLion in TensorFlow and our preliminary experiments show promising results towards achieving accurate and efficient distributed deep learning in micro-clouds.

## The Case for Unifying Data Loading in Machine Learning Clusters

Aarati Kakaraparthy, University of Wisconsin, Madison & Microsoft Gray Systems Lab, Madison; Abhay Venkatesh, University of Wisconsin, Madison; Amar Phanishayee, Microsoft Research, Redmond; Shivaram Venkataraman, University of Wisconsin, Madison

Available Media

Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments we can reduce up to 89% I/O and 97% pre-processing redundancy.

Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.

## Accelerating Deep Learning Inference via Freezing

Adarsh Kumar, Arjun Balasubramanian, Shivaram Venkataraman, and Aditya Akella, University of Wisconsin, Madison

Available Media

Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher prediction latencies. Prior efforts to reduce this latency such as quantization, model distillation, and any-time prediction models typically trade-off accuracy for performance. In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests. We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that introduces approximate caching at each intermediate layer and we discuss techniques to reduce the cache size and improve the cache hit rate. Finally, we discuss some of the open research challenges in realizing such a design.

## Characterization and Prediction of Performance Interference on Mediated Passthrough GPUs for Interference-aware Scheduler

Xin Xu, Na Zhang, and Michael Cui, VMware Inc; Michael He, The University of Texas at Austin; Ridhi Surana, VMware Inc

Available Media

Sharing GPUs in the cloud is cost effective and can facilitate the adoption of hardware accelerator enabled cloud. Butsharing causes interference between co-located VMs andleads to performance degradation. In this paper, we proposedan interference-aware VM scheduler at the cluster level withthe goal of minimizing interference. NVIDIA vGPU pro-vides sharing capability and high performance, but it has unique performance characteristics, which have not been studied thoroughly before. Our study reveals several key ob-servations. We leverage our observations to construct modelsbased on machine learning techniques to predict interferencebetween co-located VMs on the same GPU. We proposed a system architecture leveraging our models to schedule VMs to minimize the interference. The experiments show that our observations improves the model accuracy (by 15% ̃ 40%) and the scheduler reduces application run-time overhead by 24.2% in simulated scenarios.

## Happiness index: Right-sizing the cloud’s tenant-provider interface

Vojislav Dukic and Ankit Singla, ETH Zurich

Available Media

Cloud providers and their tenants have a mutual interest in identifying optimal configurations in which to run tenant jobs, i.e., ones that achieve tenants' performance goals at minimum cost; or ones that maximize performance within a specified budget. However, different tenants may have different performance goals that are opaque to the provider. A consequence of this opacity is that providers today typically offer fixed bundles of cloud resources, which tenants must themselves explore and choose from. This is burdensome for tenants and can lead to choices that are sub-optimal for both parties.

We thus explore a simple, minimal interface, which lets tenants communicate their happiness with cloud infrastructure to the provider, and enables the provider to explore resource configurations that maximize this happiness. Our early results indicate that this interface could strike a good balance between enabling efficient discovery of application resource needs and the complexity of communicating a full description of tenant utility from different configurations to the provider.

## The True Cost of Containing: A gVisor Case Study

Ethan G. Young, Pengfei Zhu, Tyler Caraza-Harter, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin-Madison

Available Media

We analyze many facets of the performance of gVisor, a new security-oriented container engine that integrates with Docker and backs Google’s serverless platform. We explore the effect gVisor’s in-Sentry network stack has on network throughput as well as the overheads of performing all file opens via gVisor’s Gofer service. We further analyze gVisor startup performance, memory efficiency, and system-call overheads. Our findings have implications for the future design of similar hypervisor- based container engines.

## Carving Perfect Layers out of Docker Images

Dimitris Skourtis, Lukas Rupprecht, Vasily Tarasov, and Nimrod Megiddo, IBM Research

Available Media

Container frameworks such as Docker have changed the way developers package, share, and deploy applications. Container images form a cornerstone of this new model. Images contain applications and their required runtime dependencies and can be easily versioned, stored, and shared via centralized registry services. The ease of creating and sharing images has led to a rapid adoption of container technology but registries are now faced with the task of efficiently storing and serving millions of images. While in theory identical parts of Docker images can be shared across images and stored only once as layers, in practice this provides limited benefits as layers are rarely fully identical.

In this paper, we argue that Docker image layers should be reorganized in order to maximize their overlap and, thereby, reduce storage and network consumption. This argument is based on the observation that many layers only differ in a small number of files but would otherwise be identical. We present a set of design challenges that need to be solved when realizing such a reorganization approach, e.g., how to optimally reorganize layers, how to deal with the scale of current registries, and how to integrate the approach into the container image lifecycle. We then present a mathematical formulation of the problem and evaluate it on a set of real Docker images. Our preliminary results provide storage savings of 2:3x, and indicate that promising network savings are possible.

## Break with Refreshments

Grand Ballroom Prefunction

## Bridging the Edge-Cloud Barrier for Real-time Advanced Vision Analytics

Yiding Wang, Weiyan Wang, and Junxue Zhang, HKUST; Junchen Jiang, University of Chicago; Kai Chen, HKUST

Available Media

Advanced vision analytics plays a key role in a plethora of real-world applications. Unfortunately, many of these applications fail to leverage the abundant compute resource in cloud services, because they require high computing resources {\em and} high-quality video input, but the (wireless) network connections between visual sensors (cameras) and the cloud/edge servers do not always provide sufficient and stable bandwidth to stream high-fidelity video data in real time.

This paper presents CloudSeg, an edge-to-cloud framework for advanced vision analytics that co-designs the cloud-side inference with real-time video streaming, to achieve both low latency and high inference accuracy. The core idea is to send the video stream in low resolution, but recover the high-resolution frames from the low-resolution stream via a {\em super-resolution} procedure tailored for the actual analytics tasks. In essence, CloudSeg trades additional cloud-side computation (super-resolution) for significantly reduced network bandwidth. Our initial evaluation shows that compared to previous work, CloudSeg can reduce bandwidth consumption by $\sim$6.8$\times$ with negligible drop in accuracy.

## NLUBroker: A Flexible and Responsive Broker for Cloud-based Natural Language Understanding Services

Lanyu Xu, Wayne State University; Arun Iyengar, IBM T.J. Watson Research Center; Weisong Shi, Wayne State University

Available Media

Cloud-based Natural Language Understanding (NLU) services are getting more and more popular with the development of artificial intelligence. More applications are integrated with cloud-based NLU services to enhance the way people communicate with machines. However, with NLU services provided by different companies powered by unrevealed AI technology, how to choose the best one is a problem for users. To our knowledge, there is currently no platform that can provide guidance to users and make recommendations based on their needs. To fill this gap, in this paper, we propose NLUBroker, a platform to comprehensively measure the performance indicators of candidate NLU services, and further provide a broker to select the most suitable service according to the different needs of users. Our evaluation shows that different NLU services leading in different aspects, and NLUBroker is able to improve the quality of experience by automatically choosing the best service. In addition, reinforcement learning is used to support NLUBroker by an intelligent agent in a dynamic environment, and the results are promising.

## Static Call Graph Construction in AWS Lambda Serverless Applications

Matthew Obetz, Stacy Patterson, and Ana Milanova, Rensselaer Polytechnic Institute

Available Media

We present new means for performing static program analysis on serverless programs. We propose a new type of call graph that captures the stateless, event-driven nature of such programs and describe a method for constructing these new extended service call graphs. Next, we survey applications of program analysis that can leverage our extended service call graphs to answer questions about code that executes on a serverless platform. We present findings on the applicability of our techniques to real open source serverless programs. Finally, we close with several open questions about how to best incorporate static analysis in problem solving for developing serverless applications.

## Agile Cold Starts for Scalable Serverless

Anup Mohan, Harshad Sane, Kshitij Doshi, Saikrishna Edupuganti, and Naren Nayak, Intel Corporation; Vadim Sukhomlinov, Google Inc.

Available Media

The Serverless or Function-as-a-Service (FaaS) model capitalizes on lightweight execution by packaging code and dependencies together for just-in-time dispatch. Often a container environment has to be set up afresh– a condition called “cold start", and in such cases, performance suffers and overheads mount, both deteriorating rapidly under high concurrency. Caching and reusing previously employed containers ties up memory and risks information leakage. Latency for cold starts is frequently due to work and wait-times in setting up various dependencies – such as in initializing networking elements. This paper proposes a solution that pre-crafts such resources and then dynamically re-associates them with baseline containers. Applied to networking, this approach demonstrates an order of magnitude gain in cold starts, negligible memory consumption, and flat startup time under rising concurrency.

## Co-evolving Tracing and Fault Injection with Box of Pain

Daniel Bittman, Ethan L. Miller, and Peter Alvaro, UC Santa Cruz

Available Media

Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain’s lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.