Provenance in DISC Systems: Reducing Space Overhead at Runtime


Ralf Diestelkämper, Melanie Herschel, and Priyanka Jadhav, IPVS - University of Stuttgart


Data intensive scalable computing (DISC) systems, such as Apache Hadoop or Spark, allow to process large amounts of heterogenous data. For varying provenance applications, emerging provenance solutions for DISC systems track all source data items through each processing step, imposing a high space and time overhead during program execution.

We introduce a provenance collection approach that reduces the space overhead at runtime by sampling the input data based on the definition of equivalence classes. A preliminary empirical evaluation shows that this approach allows to satisfy many use cases of provenance applications in debugging and data exploration, indicating that provenance collection for a fraction of the input data items often suffices for selected provenance applications. When additional provenance is required, we further outline a method to collect provenance at query time, reusing, when possible, partial provenance already collected during program execution.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {204239,
author = {Ralf Diestelk{\"a}mper and Melanie Herschel and Priyanka Jadhav},
title = {Provenance in {DISC} Systems: Reducing Space Overhead at Runtime},
booktitle = {9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2017)},
year = {2017},
address = {Seattle, WA},
url = {},
publisher = {USENIX Association},
month = jun