Modeling Provenance and Understanding Reproducibility for {OpenRefine} Data Cleaning Workflows

Timothy McPhillips; Lan Li; Nikolaus Parulian; Bertram Ludaescher

Provenance Meets Bidirectional Transformations

Anthony Anjorin, Paderborn University; James Cheney, University of Edinburgh and The Alan Turing Institute

Available Media

Bidirectional transformations (bx) manage consistency between different independently-changing data structures, such as software engineering models. Many bx tools construct, exploit, and maintain various auxiliary structures required for correct and efficient consistency management. These data structures seem analogous to provenance in other settings, but their design is often ad hoc and implementationdependent. However, it is increasingly urgent to rationalize their design and use as first-class explanations, to help users understand complex system behavior. In this paper we explore whether and how these auxiliary structures can already be viewed as forms of provenance, and outline open questions and possible future directions for provenance in bidirectional transformations, and vice versa.

Structural summaries for visual provenance analysis

Houssem Ben Lahmar and Melanie Herschel, IPVS–University of Stuttgart

Available Media

Many systems today exist to collect provenance that describes how some data was derived. Such provenance represents useful information in many use cases, e.g., reproducibility or derivation process quality. Depending on the use case, collected provenance traces need to be explored and analyzed. Therefore, various approaches, including visual analysis approaches have been proposed. However, these typically focus on analyzing individual provenance traces.

We propose to create structure-based summaries of provenance by aggregating many provenance traces provided in W3C-PROV representation. We further describe the analysis tasks that apply on these summaries. We showcase the usefulness of structural summaries based on several use cases, when using appropriate visualization and interaction techniques.

Query-based Why-not Explanations for Nested Data

Ralf Diestelkämper, IPVS–University of Stuttgart; Boris Glavic, Illinois Institute of Technology; Melanie Herschel, IPVS–University of Stuttgart; Seokki Lee, Illinois Institute of Technology

Available Media

We present the first query-based approach for explaining missing answers to queries over nested relational data which is a common data format used by big data systems such as Apache Spark. Our main contributions are a novel way to define query-based why-not provenance based on repairs to queries and presenting an implementation and preliminary experiments for answering such queries in Spark.

Aggregating unsupervised provenance anomaly detectors

Ghita Berrada, University of Edinburgh; James Cheney, University of Edinburgh and The Alan Turing Institute

Available Media

System-level provenance offers great promise for improving security by facilitating the detection of attacks. Unsupervised anomaly detection techniques are necessary to defend against subtle or unpredictable attacks, such as advanced persistent threats (APTs). However, it is difficult to know in advance which views of the provenance graph will be most valuable as a basis for unsupervised anomaly detection on a given system. We present baseline anomaly detection results on the effectiveness of two existing algorithms on APT attack scenarios from four different operating systems, and identify simple score or rank aggregation techniques that are effective at aggregating anomaly scores and improving detection performance.

GitHub2PROV: Provenance for Supporting Software Project Management

Heather S. Packer, Adriane Chapman, and Leslie Carr, University of Southampton

Available Media

Software project management is a complex task that requires accurate information and experience to inform the decisionmaking process. In the real world software project managers rarely have access to perfect information. In order to support them, we propose leveraging information from Version Control Systems and their repositories to support decision-making. In this paper, we propose a PROV model GitHub2PROV, which extends Git2PROV with details about GitHub commits and issues from GitHub repositories. We discuss how this model supports project management decisions in agile development, specifically in terms of Control Schedule Reviews and workload.

Mining Data Provenance to Detect Advanced Persistent Threats

Mathieu Barre, INRIA; Ashish Gehani and Vinod Yegneswaran, SRI International

Available Media

An advanced persistent threat (APT) is a stealthy malware instance that gains unauthorized access to a system and remains undetected for an extended time period. The aim of this work is to evaluate the feasibility of applying advanced machine learning and provenance analysis techniques to automatically detect the presence of APT infections within hosts in the network.We evaluate our techniques using a corpus of recent APT malware. Our results indicate that while detecting new APT instances is a fundamentally difficult problem, provenance-based learning techniques can detect over 50% of them with low false positive rates (< 4%).

Modeling Provenance and Understanding Reproducibility for OpenRefine Data Cleaning Workflows

Timothy McPhillips, Lan Li, Nikolaus Parulian, and Bertram Ludäscher, University of Illinois at Urbana-Champaign

Available Media

Preparation of data sets for analysis is a critical component of research in many disciplines. Recording the steps taken to clean data sets is equally crucial if such research is to be transparent and results reproducible. OpenRefine is a tool for interactively cleaning data sets via a spreadsheet-like interface and for recording the sequence of operations carried out by the user. OpenRefine uses its operation history to provide an undo/redo capability that enables a user to revisit the state of the data set at any point in the data cleaning process. OpenRefine additionally allows the user to export sequences of recorded operations as recipes that can be applied later to different data sets. Although OpenRefine internally records details about every change made to a data set following data import, exported recipes do not include the initial data import step. Details related to parsing the original data files are not included. Moreover, exported recipes do not include any edits made manually to individual cells. Consequently, neither a single recipe, nor a set of recipes exported by OpenRefine, can in general represent an entire, end-to-end data preparation workflow.

Here we report early results from an investigation into how the operation history recorded by OpenRefine can be used to (1) facilitate reproduction of complete, real-world data cleaning workflows; and (2) support queries and visualizations of the provenance of cleaned data sets for easy review.

TaPP 2019 Accepted Papers

Provenance Meets Bidirectional Transformations

Structural summaries for visual provenance analysis

Query-based Why-not Explanations for Nested Data

Aggregating unsupervised provenance anomaly detectors

GitHub2PROV: Provenance for Supporting Software Project Management

Mining Data Provenance to Detect Advanced Persistent Threats

Modeling Provenance and Understanding Reproducibility for OpenRefine Data Cleaning Workflows