You are here
HadoopProv: Towards Provenance as a First Class Citizen in MapReduce
Sherif Akoush, Ripduman Sohan, and Andy Hopper, University of Cambridge
We introduce HadoopProv, a modiﬁed version of Hadoopthat implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenancecapture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferringconstruction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the Map and Reduce provenance ﬁles. In our prototype implementation, HadoopProv has an overhead below 10% on typical job runtime (<7% and <30% average temporal increase on Map and Reduce tasks respectively). Additionally, we demonstrate that provenance queries are serviceable in O (k log n), where n is the number of records per Map task and k is the set of Map tasks in which the key appears.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.