Time Travel and Provenance for Machine Learning Pipelines

Authors: 

Alexandru A. Ormenisan, KTH - Royal Institute of Technology; Moritz Meister, Fabio Buso, and Robin Andersson, Logical Clocks AB; Seif Haridi and Jim Dowling, KTH - Royal Institute of Technology

Abstract: 

Machine learning pipelines have become the defacto paradigm for productionizing machine learning applications as they clearly abstract the processing steps involved in transforming raw data into engineered features that are then used to train models. In this paper, we use a bottom-up method for capturing provenance information regarding the processing steps and artifacts produced in ML pipelines. Our approach is based on replacing traditional intrusive hooks in application code (to capture ML pipeline events) with standardized change-data-capture support in the systems involved in ML pipelines: the distributed file system, feature store, resource manager, and applications themselves. In particular, we leverage data versioning and time-travel capabilities in our feature store to show how provenance can enable model reproducibility and debugging.

OpML '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {256644,
author = {Alexandru A. Ormenisan and Moritz Meister and Fabio Buso and Robin Andersson and Seif Haridi and Jim Dowling},
title = {Time Travel and Provenance for Machine Learning Pipelines},
booktitle = {2020 USENIX Conference on Operational Machine Learning (OpML 20)},
year = {2020},
url = {https://www.usenix.org/conference/opml20/presentation/ormenisan},
publisher = {USENIX Association},
month = jul
}

Presentation Video 
Teaser
Full