Dan Graur, Damien Aymon, Dan Kluser, and Tanguy Albrici, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich
Processing input data plays a vital role in ML training, impacting accuracy, throughput, and cost. The input pipeline, which is responsible for feeding data-hungry GPUs/TPUs with training examples, is a common bottleneck. Alleviating data stalls is critical yet challenging for users. While today's frameworks provide mechanisms to maximize input pipeline throughput (e.g., distributing data processing on remote CPU workers and/or reusing cached data transformations), leveraging these mechanisms to jointly optimize training time and cost is non-trivial. Users face two key challenges. First, ML schedulers focus on GPU/TPU resources, leaving users on their own to optimize multi-dimensional resource allocations for data processing. Second, input pipelines often consume excessive compute power to repeatedly transform the same data. Deciding which source or transformed data to cache is non-trivial: large datasets are expensive to store, the compute time saved by caching is not always the bottleneck for end-to-end training, and transformations may not be deterministic, hence reusing transformed data can impact accuracy.
We propose Cachew, a fully-managed service for ML data processing. Cachew dynamically scales distributed resources for data processing to avoid stalls in training jobs. The service also automatically applies caching when and where it is performance/cost-effective to reuse preprocessed data within and across jobs. Our key contributions are autoscaling and autocaching policies, which leverage domain-specific metrics collected at data workers and training clients (rather than generic resource utilization metrics) to minimize training time and cost. Compared to scaling workers with Kubernetes, Cachew's policies reduce training time by up to 4.1x and training cost by 1.1x to 3.8x.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Dan Graur and Damien Aymon and Dan Kluser and Tanguy Albrici and Chandramohan A. Thekkath and Ana Klimovic},
title = {Cachew: Machine Learning Input Data Processing as a Service},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-43},
address = {Carlsbad, CA},
pages = {689--706},
url = {https://www.usenix.org/conference/atc22/presentation/graur},
publisher = {USENIX Association},
month = jul
}