Cocktail: A Multidimensional Optimization for Model Serving in Cloud

Authors: 

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das, The Pennsylvania State University

Abstract: 

With a growing demand for adopting ML models for a variety of application services, it is vital that the frameworks serving these models are capable of delivering highly accurate predictions with minimal latency along with reduced deployment costs in a public cloud environment. Despite high latency, prior works in this domain are crucially limited by the accuracy offered by individual models. Intuitively, model ensembling can address the accuracy gap by intelligently combining different models in parallel. However, selecting the appropriate models dynamically at runtime to meet the desired accuracy with low latency at minimal deployment cost is a nontrivial problem. Towards this, we propose Cocktail, a cost effective ensembling-based model serving framework. Cocktail comprises of two key components: (i) a dynamic model selection framework, which reduces the number of models in the ensemble, while satisfying the accuracy and latency requirements; (ii) an adaptive resource management (RM) framework that employs a distributed proactive autoscaling policy combined with importance sampling, to efficiently allocate resources for the models. The RM framework leverages transient virtual machine (VM) instances to reduce the deployment cost in a public cloud. A prototype implementation of Cocktail on the AWS EC2 platform and exhaustive evaluations using a variety of workloads demonstrate that {Cocktail} can reduce deployment cost by 1.45x, while providing 2x reduction in latency and satisfying the target accuracy for up to 96% of the requests, when compared to state-of-the-art model-serving frameworks.

NSDI '22 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {276950,
author = {Jashwant Raj Gunasekaran and Cyan Subhra Mishra and Prashanth Thinakaran and Bikash Sharma and Mahmut Taylan Kandemir and Chita R. Das},
title = {Cocktail: A Multidimensional Optimization for Model Serving in Cloud},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
isbn = {978-1-939133-27-4},
address = {Renton, WA},
pages = {1041--1057},
url = {https://www.usenix.org/conference/nsdi22/presentation/gunasekaran},
publisher = {USENIX Association},
month = apr
}

Presentation Video