Sheung Chi Chan, Heriot Watt University; James Cheney, University of Edinburgh and The Alan Turing Institute; Pramod Bhatotia, Technische Universität München
Data provenance is a form of meta-data recording inputs and processes. It provides historical records and origin information of the data. Because of the rich information provided, provenance is increasingly being used as a foundation for security analysis and forensic auditing. These applications require provenance with high quality. Earlier works have proposed a provenance expressiveness benchmarking approach to automatically identify and compare the results of different provenance systems and their generated provenance. However, previous work was limited to benchmarking deterministic activities, whereas all real-world systems involve non-determinism, for example through concurrency and multiprocessing. Benchmarking non-deterministic events is challenging because the process owner has no control over the interleaving between processes or the execution order of system calls coming from different processes, leading to a rapid growth in the number of possible schedules that need to be observed. To cover these cases and provide all-around automated expressiveness benchmarking for real-world examples, we proposed an extension to the automated provenance benchmarking tool, ProvMark, to handle non-determinism.
Argyro Avgoustaki, Giorgos Flouris, and Dimitris Plexousakis, ICS - FORTH
Provenance has been widely studied in several different contexts and with respect to different aspects and applications. Although the problem of determining how provenance should be recorded and represented has been thoroughly discussed, the issue of querying data provenance has not yet been adequately considered. In this paper, we introduce a novel highlevel structured query language, named ProvQL, which is suitable for seeking information related to data provenance. ProvQL treats provenance information as a first class citizen and allows formulating queries about the sources that contributed to data generation and the operations involved, about data records with a specific provenance/origins (or with common provenance), and others. This makes ProvQL a useful tool for tracking data provenance information and supporting applications that need to assess data reliability, access control, trustworthiness, or quality.
Tom Blount and Adriane Chapman, University of Southampton; Michael Johnson, Max Planck Institute for Radio Astronomy; Bertram Ludascher, University of Illinois Urbana-Champaign
Provenance has been of interest to the Computer Science community for nearly two decades, with proposed uses ranging from data authentication, to security auditing, to ensuring trust in decision making processes. However, despite its enthusiastic uptake in the academic community, its adoption elsewhere is often hindered by the cost of implementation. In this paper we seek to alleviate some of these factors, and propose the idea of possible provenance in which we relax the constraint that provenance must be directly observed. We categorise some existing approaches to gathering provenance and compare the costs and benefits of each, and illustrate one method for generating possible provenance in more detail with a simple example: inferring the possible provenance of a game of Connect Four. We then go on to discuss some of the benefits and ramifications of this approach to gathering provenance, and suggest some key next steps in advancing this research.
Andreas Schreiber, Claas de Boer, and Lynn von Kurnatowski, German Aerospace Center (DLR)
Assertions about quality, reliability, or trustworthiness of software systems are important for many software applications. In addition to typical quality assurance measures, we extract the provenance of software artifacts from source code repository's—especially git-based repository's. Software repository's contain information about source code changes, the software development processes, and team interactions. We focus on the web-based DevOps life-cycle tool GITLAB, which provides a git-repository manager and other development tools. We propose a provenance model defined using W3C PROV data model and an implementation: GITLAB2PROV.
George Alter, University of Michigan; Jack Gager, Pascal Heus, and Carson Hunter, Metadata Technology North America; Sanda Ionescu, University of Michigan; Jeremy Iverson, Colectica; H V Jagadish, University of Michigan; Bertram Ludaescher, University of Illinois Urbana-Champaign; Jared Lyle, University of Michigan; Timothy McPhillips, University of Illinois Urbana-Champaign; Alexander Mueller, University of Michigan; Sigve Nordgaard and Ørnulf Risnes, Norwegian Centre for Research Data; Dan Smith, Colectica; Jie Song, University of Michigan; Thomas Thelen, University of California Santa Barbara
We have created a set of tools for automating the extraction of fine-grained provenance from statistical analysis software used for data management. Our tools create metadata about steps within programs and variables (columns) within data-frames in a way consistent with the ProvONE extension of the PROV model. Scripts from the most widely used statis-tical analysis programs are translated into Structured Data Transformation Language (SDTL), an intermediate language for describing data transformation commands. SDTL can be queried to create histories of each variable in a dataset. For example, we can ask, “Which commands modified variable X?” or “Which variables were affected by variable Y?” SDTL was created to solve several problems. First, research-ers are divided among a number of mutually unintelligible statistical languages. SDTL serves as a lingua franca provid-ing a common language for downstream applications. Sec-ond, SDTL is a structured language that can be serialized in JSON, XML, RDF, and other formats. Applications can read SDTL without specialized parsing, and relationships among elements in SDTL are not defined by an external grammar. Third, SDTL provides general descriptions for operations that are handled differently in each language. For example, the SDTL MergeDatasets command describes both earlier languages (SPSS, SAS, Stata), in which merging is based on sequentially sorted files, and recent languages (R, Python) modelled on SQL. In addition, we have developed a flexible tool that translates SDTL into natural language. Our tools also embed variable histories including both SDTL and natu-ral language translations into standard metadata files, such as Data Documentation Initiative (DDI) and Ecological Metadata Language (EML), which are used by data reposito-ries to inform data catalogs, data discovery services, and codebooks. Thus, users can receive detailed information about the effects of data transformation programs without un-derstanding the language in which they were written.
Julia Kühnert, IANS/IPVS - University of Stuttgart; Dominik Göddeke, IANS - University of Stuttgart; Melanie Herschel, IPVS - University of Stuttgart
Simulations based on partial differential equations (PDEs) are used in a large variety of scenarios, that each come with varying requirements, e.g., in terms of runtime or accuracy. Different numerical approaches to approximate exact solutions exist, that typically contain a multitude of parameters that can be tailored to the problem at hand. We explore how high-level provenance, i.e., provenance that is expensive to capture in a single simulation, can be used to optimize such parameters in future simulations for sufficiently similar problems. Our experiments on one of the key building blocks of PDE simulations underline the potential of this approach.
Michael A. C. Johnson, Institute of Data Science (DLR) and Max Planck Institute for Radio Astronomy; Marcus Paradies and Marta Dembska, Institute of Data Science (DLR); Kristen Lackeos, Hans-Rainer Klöckner, and David J. Champion, Max Planck Institute for Radio Astronomy; Sirko Schindler, Institute of Data Science (DLR)
In this decade astronomy is undergoing a paradigm shift to handle data from next generation observatories such as the Square Kilometre Array (SKA) or the Vera C. Rubin Observatory (LSST). Producing real time data streams of up to 10 TB/s and data products of the order of 600 Pbytes/year, the SKA will be the biggest civil data producing machine of the world that demands novel solutions on how these data volumes can be stored and analysed. Through the use of complex, automated pipelines the provenance of this real time data processing is key to establish confidence within the system, its final data products, and ultimately its scientific results.
The intention of this paper is to lay the foundation for making an automated provenance generation tool for astronomical/data-processing pipelines. We therefore present a use case analysis, specific to the astronomical needs which addresses the issues of trust and reproducibility as well as other ulterior use cases which are of interest to astronomers. This analysis is subsequently used as the basis to discuss the requirements, challenges, and opportunities involved in designing both the tool and the associated provenance model.
Taeho Jung, University of Notre Dame; Seokki Lee, University of Cincinnati; Wenyi Tang, University of Notre Dame
Data sharing is becoming increasingly important, but how much risk and benefit is incurred from the sharing is not well understood yet. Certain existing models can be leveraged to partially determine the risk and benefit. However, such naïve ways of quantification are inaccurate because they fail to capture the context and the history of the datasets in data sharing. This paper suggests utilizing the data provenance to accurately and quantitatively model the risk and benefit of data sharing between two parties, and describes preliminary approaches as well as further issues to consider.
Ashish Gehani, Raza Ahmad, and Hassaan Irshad, SRI
Data provenance records may be rife with sensitive information. If such metadata is to be shared with others, it must be transformed to protect the privacy of parties whose activity is being reported. In general, this is a challenging task. It is further complicated by properties of provenance that facilitate drawing inferences about disparate portions of data sets. We consider aspects of the problem, describe strategies to address the identified issues, and share our implementation of configurable primitives for practical application of provenance privacy protection.