Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments

Authors: 

Amitabha Banerjee, Chien-Chia Chen, Chien-Chun Hung, Xiaobo Huang, Yifan Wang, and Razvan Chevesaran, VMware Inc

Abstract: 

This paper presents how VMware addressed the following challenges in operationalizing our ML-based performance diagnostics solution in enterprise hybrid-cloud environments: data governance, model serving and deployment, dealing with system performance drifts, selecting model features, centralized model training pipeline, setting the appropriate alarm threshold, and explainability. We also share the lessons and experiences we learned over the past four years in deploying ML operations at scale for enterprise customers.

OpML '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {256640,
author = {Amitabha Banerjee and Chien-Chia Chen and Chien-Chun Hung and Xiaobo Huang and Yifan Wang and Razvan Chevesaran},
title = {Challenges and Experiences with {MLOps} for Performance Diagnostics in {Hybrid-Cloud} Enterprise Software Deployments},
booktitle = {2020 USENIX Conference on Operational Machine Learning (OpML 20)},
year = {2020},
url = {https://www.usenix.org/conference/opml20/presentation/banerjee},
publisher = {USENIX Association},
month = jul
}

Presentation Video 
Teaser
Full