Joud Khoury, Timothy Upthegrove, Armando Caro, Brett Benyo, and Derrick Kong, Raytheon BBN Technologies
We present a common data model for representing causal events across a wide range of platforms and granularities. The model was developed for attack provenance analysis under the DARPA Transparent Computing program. The unified model successfully expresses data provenance across a range of granularities (e.g., object or byte level) and platforms (e.g., Linux and Android, BSD, and Windows). This paper describes our experience developing the common data model, the lessons learned, and performance results in controlled lab experiments.
Sheung Chi Chan, University of Edinburgh; Ashish Gehani and Hassaan Irshad, SRI International; James Cheney, University of Edinburgh and The Alan Turing Institute
Data provenance is a kind of meta-data recording inputs, entities and processes. It provides historical records and origin information of the data. Because of the rich information provided, provenance is increasingly being used as a foundation for security analysis and forensic auditing. For example, system-level provenance can help us trace activities at the level of libraries or system calls, which offers great potential for detecting subtle malicious activities that can otherwise go undetected. However, most of these security related applications of provenance data require completeness and correctness of the provenance collection process. This cannot be guaranteed in some cases because some provenance recording modules collect information from some unreliable sources. We present work in progress on provenance graph integrity checking and abnormal component detection using ProvMark, the provenance expressiveness benchmarking tool. We also discuss possible applications of the ProvMark tool in aid of the quality checking of provenance data.
A First-Principles Algebraic Approach to Data Transformations in Data Cleaning: Understanding Provenance from the Ground Up
Santiago Núñez-Corrales, iSchool and NCSA, UIUC; Lan Li, iSchool, UIUC; Bertram Ludäscher, iSchool and NCSA, UIUC
We provide a model describing data transformation workflows on tables constructed from first principles, namely by defining datasets as structures with functions and sets for which certain morphisms correspond to data transformations. We define rigid and deep data transformations depending on whether the geometry of the dataset is preserved or not. Finally, we add a model of concurrency using meet and join operations. Our work suggests that algebraic structures and homotopy type theory provide a more general context than other formalisms to reason about data cleaning, data transformations and their provenance.
Dennis Dosso, University of Padua; Susan B. Davidson, University of Pennsylvania; Gianmaria Silvello, University of Padua
In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples.
We take inspiration from the classical lineage, metadata that enables users to discover which tuples in the input are used to produce a tuple in the output. Attribute lineage is instead defined as the set of all cells in the input database that are used by the query to produce one cell in the output.
It is shown that attribute lineage is more informative than simple lineage and we discuss potential new applications for this new metadata.
Raza Ahmad, SRI; Eunjin Jung, University of San Francisco; Carolina de Senne Garcia, Ecole Polytechnique; Hassaan Irshad and Ashish Gehani, SRI
Data provenance describes the origins of a digital object. Such information is particularly useful when analyzing distributed workflows because extant tools, such as debuggers and application profilers, do not support tracing through heterogeneous runtimes that span multiple hosts. In decentralized systems, each host maintains the authoritative record of its own activity, represented as a dependency graph. Reconstructing the provenance of an object may involve the assembly of subgraphs from multiple, independently administered hosts. We term the collection of host-specific dependencies coupled with cross-host flows whole-network provenance. Such information can grow to terabytes for a small network. Aspects of distributed querying, caching, and response discrepancy detection that are specific to provenance are described and analyzed.
Yuta Nakamura and Tanu Malik, DePaul University; Ashish Gehani, SRI International
Reproducing experiments entails repeating experiments with changes. Changes, such as a change in input arguments, a change in the invoking environment, or a change due to nondeterminism in the runtime may alter results. If results alter significantly, perusing them is not sufficient—users must analyze the impact of a change and determine if the experiment computed the same steps. Making fine-grained, stepwise comparisons can be both challenging and time-consuming. In this paper, we compare a reproduced execution with recorded system provenance of the original execution, and determine provenance alignment. The alignment is based on comparing the specific location in the program, the control flow of the execution, and data inputs. Experiments show that the alignment method has a low overhead to compute a match and realigns with a small look-ahead buffer.
Sylvain Hallé and Hugo Tremblay, Université du Québec à Chicoutimi
This paper presents the theoretical foundations of a data lineage framework for a model of computation called a function circuit, which is a directed graph of elementary calculation units called "functions". The proposed framework allow functions to manipulate arbitrary data structures, and allows lineage relationships to be expressed at a fine level of granularity over these structures by introducing the abstract concept of designator. Moreover, the lineage graphs produced by this framework allow multiple alternate explanations of a function’s outputs from its inputs. The theoretical groundwork presented in this paper opens the way to the implementation of a generic lineage library for function circuits.
Scott Friedman, Jeff Rye, David LaVergne, and Dan Thomsen, SIFT, LLC; Matthew Allen and Kyle Tunis, Raytheon BBN Technologies
Analytic software tools and workflows are increasing in capability, complexity, number, and scale, and the integrity of our workflows is as important as ever. Specifically, we must be able to inspect the process of analytic workflows to assess (1) confidence of the conclusions, (2) risks and biases of the operations involved, (3) sensitivity of the conclusions to sources and agents, (4) impact and pertinence of various sources and agents, and (5) diversity of the sources that support the conclusions. We present an approach that tracks agents’ provenance with PROV-O in conjunction with agents’ appraisals and evidence links (expressed in our novel DIVE ontology). Together, PROV-O and DIVE enable dynamic propagation of confidence and counter-factual refutation to improve human machine trust and analytic integrity. We demonstrate representative software developed for user interaction with that provenance, and discuss key needs for organizations adopting such approaches.We demonstrate all of these assessments in a multi-agent analysis scenario, using an interactive web-based information validation UI.