Wednesday, July 11, 2018
Principles of Provenance: We (Still) Need Some
James Cheney, University of Edinburgh
Session I: Provenance Use Cases and Applications
Using Provenance for Generating Automatic Citations
Dai Hai Ton That, Tanu Malik, Alexander Rasin, and Andrew Youngdahl, DePaul University
When computational experiments include only datasets, they could be shared through the Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs) which point to these resources. However, experiments seldom include only datasets, but most often also include software, execution results, provenance, and other associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While an entire Research Object may be citable using a URI or a DOI, it is often desirable to cite specific sub-components of a research object to help identify, authorize, date, and retrieve the published sub-components of these objects. In this paper, we present an approach to automatically generate citations for sub-components of research objects by using the object's recorded provenance traces. The generated citations can be used "as is" or taken as suggestions that can be grouped and combined to produce higher level citations.
Pointer Provenance in a Capability Architecture
Alfredo Mazzinghi, Ripduman Sohan, and Robert N. M. Watson, University of Cambridge
We design and implement a framework for tracking pointer provenance, using our CHERI fat-pointer capability architecture to facilitate analysis of security implications of program pointer flows in both user and privileged code, with minimal instrumentation. CHERI enforces pointer provenance validity at the architectural level, in the presence of complex pointer arithmetic and type casting. CHERI present new opportunities for provenance research: we discuss use cases and highlight lessons and open questions from our work.
Provenance-based Intrusion Detection: Opportunities and Challenges
Xueyuan Han, Harvard University; Thomas Pasquier, University of Cambridge; Margo Seltzer, Harvard University
Intrusion detection is an arms race; attackers evade intrusion detection systems by developing new attack vectors to sidestep known defense mechanisms. Provenance provides a detailed, structured history of the interactions of digital objects within a system. It is ideal for intrusion detection, because it offers a holistic, attack-vector-agnostic view of system execution. As such, provenance graph analysis fundamentally strengthens detection robustness.We discuss the opportunities and challenges associated with provenance-based intrusion detection and provide insights based on our experience building such systems.
Provenance and Probabilities in Relational Databases: From Theory to Practice
Pierre Senellart, Ecole Normale Supérieure, Paris, France
We review the basics of data provenance in relational databases. We describe different provenance formalisms, from Boolean provenance to provenance semirings and beyond, that can be used for a wide variety of purposes, to obtain additional information on the output of a query. We discuss representation systems for data provenance, circuits in particular, with a focus on practical implementation. Finally, we explain how provenance is practically used for probabilistic query evaluation in probabilistic databases.
Session II: Provenance Enabled Systems
Curator: Provenance Management for Modern Distributed Systems
Warren Smith, The Weather Company; Thomas Moyer, UNC Charlotte; Charles Munson, MIT Lincoln Laboratory
Data provenance is a valuable tool for protecting and troubleshooting distributed systems. Careful design of the provenance components reduces the impact on the design, implementation, and operation of the distributed system. In this paper, we present Curator, a provenance management toolkit that can be easily integrated with microservice-based systems and other modern distributed systems. This paper describes the design of Curator and discusses how we have used Curator to add provenance to distributed systems. We find that our approach results in no changes to the design of these distributed systems and minimal additional code and dependencies to manage. In addition, Curator uses the same scalable infrastructure as the distributed system and can therefore scale with the distributed system.
Wrattler: Reproducible, live and polyglot notebooks
Tomas Petricek, University of Kent and The Alan Turing Institute; James Geddes, The Alan Turing Institute; Charles Sutton, The University of Edinburgh, The Alan Turing Institute, and Google
Notebooks such as Jupyter became a popular environment for data science, because they support interactive data exploration and provide a convenient way of interleaving code, comments and visualizations. Alas, most notebook systems use an architecture that leads to a limited model of interaction and makes reproducibility and versioning difficult.
In this paper, we present Wrattler, a new notebook system built around provenance that addresses the above issues. Wrattler separates state management from script evaluation and controls the evaluation using a dependency graph maintained in the web browser. This allows richer forms of interactivity, an efficient evaluation through caching, guarantees reproducibility and makes it possible to support versioning.
Thursday, July 12, 2018
Session III: Extension and Implementation of How-Provenance
Semiring Provenance over Graph Databases
Yann Ramusat, ENS, PSL University; Silviu Maniu, LRI, Université Paris-Sud, Université Paris-Saclay; Pierre Senellart, DI ENS, ENS, CNRS, PSL University & Inria Paris & LTCI, Télécom ParisTech
We generalize three existing graph algorithms to compute the provenance of regular path queries over graph databases, in the framework of provenance semirings – algebraic structures that can capture different forms of provenance. Each algorithm yields a different trade-off between time complexity and generality, as each requires different properties over the semiring. Together, these algorithms cover a large class of semirings used for provenance (top-k, security, etc.). Experimental results suggest these approaches are complementary and practical for various kinds of provenance indications, even on a relatively large transport network.
How "How" Explains What "What" Computes — How-Provenance for SQL and Query Compilers
Daniel O’Grady, Tobias Müller, and Torsten Grust,Universität Tübingen
SQL emphasizes the What, the declarative specification of complex computations over a database. How exactly the individual parts of an intricate query interact and contribute to the result, often remains in the dark, however. How-provenance helps to understand queries and build trust in their results. We propose a new approach that derives how-provenance for SQL at a fine granularity: (1) every single piece of the result provides information on how exactly it did get there, and (2) the contribution of any query construct to the overall output can be assessed—from entire subqueries down to the subexpression leaf level. The method applies to real-world dialects of SQL and, more generally, to the modern breed of database systems that pursue a compilation-based approach to query processing.