The Case for Unifying Data Loading in Machine Learning Clusters


Aarati Kakaraparthy, University of Wisconsin, Madison & Microsoft Gray Systems Lab, Madison; Abhay Venkatesh, University of Wisconsin, Madison; Amar Phanishayee, Microsoft Research, Redmond; Shivaram Venkataraman, University of Wisconsin, Madison


Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments we can reduce up to 89% I/O and 97% pre-processing redundancy.

Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {234827,
author = {Aarati Kakaraparthy and Abhay Venkatesh and Amar Phanishayee and Shivaram Venkataraman},
title = {The Case for Unifying Data Loading in Machine Learning Clusters},
booktitle = {11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19)},
year = {2019},
address = {Renton, WA},
url = {},
publisher = {USENIX Association},
month = jul