Provenance in DISC Systems: Reducing Space Overhead at Runtime


Ralf Diestelkämper, Melanie Herschel, and Priyanka Jadhav, IPVS - University of Stuttgart


Data intensive scalable computing (DISC) systems, such as Apache Hadoop or Spark, allow to process large amounts of heterogenous data. For varying provenance applications, emerging provenance solutions for DISC systems track all source data items through each processing step, imposing a high space and time overhead during program execution.

We introduce a provenance collection approach that reduces the space overhead at runtime by sampling the input data based on the definition of equivalence classes. A preliminary empirical evaluation shows that this approach allows to satisfy many use cases of provenance applications in debugging and data exploration, indicating that provenance collection for a fraction of the input data items often suffices for selected provenance applications. When additional provenance is required, we further outline a method to collect provenance at query time, reusing, when possible, partial provenance already collected during program execution.

