Modeling Provenance and Understanding Reproducibility for OpenRefine Data Cleaning Workflows

Authors: 

Timothy McPhillips, Lan Li, Nikolaus Parulian, and Bertram Ludäscher, University of Illinois at Urbana-Champaign

Abstract: 

Preparation of data sets for analysis is a critical component of research in many disciplines. Recording the steps taken to clean data sets is equally crucial if such research is to be transparent and results reproducible. OpenRefine is a tool for interactively cleaning data sets via a spreadsheet-like interface and for recording the sequence of operations carried out by the user. OpenRefine uses its operation history to provide an undo/redo capability that enables a user to revisit the state of the data set at any point in the data cleaning process. OpenRefine additionally allows the user to export sequences of recorded operations as recipes that can be applied later to different data sets. Although OpenRefine internally records details about every change made to a data set following data import, exported recipes do not include the initial data import step. Details related to parsing the original data files are not included. Moreover, exported recipes do not include any edits made manually to individual cells. Consequently, neither a single recipe, nor a set of recipes exported by OpenRefine, can in general represent an entire, end-to-end data preparation workflow.

Here we report early results from an investigation into how the operation history recorded by OpenRefine can be used to (1) facilitate reproduction of complete, real-world data cleaning workflows; and (2) support queries and visualizations of the provenance of cleaned data sets for easy review.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {236189,
author = {Timothy McPhillips and Lan Li and Nikolaus Parulian and Bertram Ludaescher},
title = {Modeling Provenance and Understanding Reproducibility for OpenRefine Data Cleaning Workflows},
booktitle = {11th International Workshop on Theory and Practice of Provenance (TaPP 2019)},
year = {2019},
address = {Philadelphia, PA},
url = {https://www.usenix.org/conference/tapp2019/presentation/mcphillips},
publisher = {{USENIX} Association},
}