SPOK - Managing ML/Big Data Spark Workloads at scale on Kubernetes

Nagaraj Janardhana and Mike Arov, Intuit

Abstract: 

At Intuit, customer data sets are growing exponentially with the growth of the business and the capabilities offered. We built an elastic platform SpoK (Spark on Kubernetes) to run Jupyter notebooks, Data Processing, Feature Engineering, distributed training jobs, batch model inference and model evaluation workflows on Spark using Kubernetes as the resource manager.

With the whole organization moving to Kubernetes for running the services workload, we saw an opportunity to run the ML workloads as well on Kubernetes for simplified management of the cluster operations, bring the goodness of containers to data processing, scalable infrastructure, cost and efficiency improvements and also to reuse the CI/CD, security certification tooling already built. This migration from EMR/Yarn to Kubernetes has improved the developer productivity by reducing the time to deploy from more than 7 days to less than a day. Provided cost improvements in the range of 25~30%. Eased Cluster Operations Management as all types of workloads share the same cluster.

Nagaraj Janardhana, Intuit

Nagaraj is Principal engineer at Intuit, Mountain View responsible for designing and developing ML and Featurization platforms. In the past he has been involved with developing Data Ingestion and Processing platforms, Identity and Subscription Platforms at Intuit. He has contributed to the Spinnaker open source project.

Mikhail Arov, Intuit

Mike is a Staff ML Engineer at Intuit, Mountain View. He was responsible for development and deployment of many models for Cash Flow Forecasting, Mileage and Expense classification and Marketing propensity ML models. Big advocate for K8s and Argo in ML, he pioneered the use of Spark on Kubernetes for Intuit Data Platform and especially ML.

OpML '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {256666,
author = {Nagaraj Janardhana and Mikhail Arov},
title = {{SPOK} - Managing {ML/Big} Data Spark Workloads at scale on Kubernetes},
year = {2020},
publisher = {USENIX Association},
month = jul
}

Presentation Video 
Teaser
Full