Caching in the Multiverse


Mania Abdi, Northeastern University; Amin Mosayyebzadeh, Boston University; Mohammad Hossein Hajkazemi, Northeastern University; Ata Turk, State Street; Orran Krieger, Boston University; Peter Desnoyers, Northeastern University


To get good performance for data stored in Object storage services like S3, data analysis clusters need to cache data locally. Recently these caches have started taking into account higher-level information from analysis framework, allowing prefetching based on predictions of future data accesses. There is, however, a broader opportunity; rather than using this information to predict one future, we can use it to select a future that is best for caching. This paper provides preliminary evidence that we can exploit the directed acyclic graph (DAG) of inter-task dependencies used by data-parallel frameworks such as Spark, Pig, and Hive to improve application performance, by optimizing caching for the critical path through the DAG for the application. We present experimental results for PIG running TPC-H queries, showing completion time improvements of up to 23% vs our implementation of MRD, a state-of-the-art DAG-based prefetching system, and improvements of up to 2.5x vs LRU caching. We then discuss the broader opportunity for building a system based on this opportunity.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {234723,
author = {Mania Abdi and Amin Mosayyebzadeh and Mohammad Hossein Hajkazemi and Ata Turk and Orran Krieger and Peter Desnoyers},
title = {Caching in the Multiverse},
booktitle = {11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19)},
year = {2019},
address = {Renton, WA},
url = {},
publisher = {USENIX Association},
month = jul,