The Data, They Are {A-Changin}'

Workshop Program

The workshop papers are available for download below. Copyright to the individual works is retained by the author(s).

Thursday, June 9, 2016

08:45–09:00	Thursday
Welcome
09:00–10:00	Thursday
Keynote Address On Data Citation and Provenance Susan B. Davidson, University of Pennsylvania Most information is now published in complex, structured, evolving datasets or databases. As such, there is increasing demand that this digital information should be treated in the same way as conventional publications and cited appropriately. While principles and standards have been developed for data citation, they are unlikely to be used unless we can couple the process of extracting information with that of providing a citation for it. I will discuss the problem of automatically generating citations for data in a database given how the data was obtained (the query) as well as the content (the data), and show how the problem of generating a citation is related to a well-understood problem in databases. I will also discuss the connection between data citation and provenance: are they different versions of the same problem or different problems entirely? Most information is now published in complex, structured, evolving datasets or databases. As such, there is increasing demand that this digital information should be treated in the same way as conventional publications and cited appropriately. While principles and standards have been developed for data citation, they are unlikely to be used unless we can couple the process of extracting information with that of providing a citation for it. I will discuss the problem of automatically generating citations for data in a database given how the data was obtained (the query) as well as the content (the data), and show how the problem of generating a citation is related to a well-understood problem in databases. I will also discuss the connection between data citation and provenance: are they different versions of the same problem or different problems entirely? Read more about On Data Citation and Provenance
10:00–10:30	Thursday
Coffee Break
10:30–11:30	Thursday
Research Session 1: Provenance in Relational Databases Session Chair: James Cheney, University of University of Edinburgh Fine-grained Provenance for Linear Algebra Operators Zhepeng Yan, Val Tannen and Zachary G. Ives, University of Pennsylvania Provenance is well-understood for relational query operators. Increasingly, however, data analytics is incorporating operations expressed through linear algebra: machine learning operations, network centrality measures, and so on. In this paper, we study provenance information for matrix data and linear algebra operations. Our core technique builds upon provenance for aggregate queries and constructs a K semialgebra. This approach tracks provenance by annotating matrix data and propagating these annotations through linear algebra operations. We investigate applications in matrix inversion and graph analysis. Available Media Quantifying Causal Effects on Query Answering in Databases Babak Salimi, University of Washington; Leopoldo Bertossi, Carleton University; Dan Suciu, University of Washington; Guy Van den Broeck, University of California, Los Angeles The notion of actual causation, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute actual causes for possibly unexpected answers to monotone queries. Causes take the form of database tuples, and can be ranked according to their causal responsibility, a numerical measure of their relevance as a cause for the query answer. In this work we revisit this notion, introducing and making a case for an alternative measure of causal contribution, that of causal effect. In doing so, we generalize the notion of actual cause, in particular, going beyond monotone queries. We show that causal effect provides intuitive and intended results. Available Media Refining SQL Queries based on Why-Not Polynomials Nicole Bidoit, Université Paris Sud; Melanie Herschel, University of Stuttgart; and Katerina Tzompanaki, Télécom ParisTech Explaining why some data are not part of a query result has recently gained significant interest. One use of why-not explanations is adapting queries to meet user expectations. We propose an algorithm to automatically generate changes to a query, by using Why-Not polynomials, one form of why-not explanations based on query operators.We improve on the state of the art in three aspects: (i) we refine both selection and join predicates, (ii) we guarantee a maximum similarity to the original query, and (iii) we cover all possible cases of why the desired data was missing. A prototype implementation shows the applicability of our approach in practice. Available Media
11:30–12:30	Thursday
Research Session 2: Provenance and Security Paolo Missier, Newcastle University Provenance Segmentation Rui Abreu, Palo Alto Research Center; Dave Archer and Erin Chapman, Galois, Inc.; James Cheney, University of Edinburgh; Hoda Eldardiry, Palo Alto Research Center; Adrià Gascón, University of Edinburgh Using pervasive provenance to secure mainstream systems has recently attracted interest from industry and government. Recording, storing and managing all of the provenance associated with a system is a considerable challenge. Analyzing the resulting noisy, heterogeneous, continuously-growing provenance graph adds to this challenge, and apparently necessitates segmentation, that is, approximating, compressing or summarizing part or all of the graph in order to identify patterns or features. In this paper, we describe this new problem space for provenance data management, contrast it with related problem spaces addressed by prior work on provenance abstraction and sanitization, and highlight challenges and future directions toward solutions to the provenance segmentation problem. Available Media Towards Secure User-space Provenance Capture Nikilesh Balakrishnan, Thomas Bytheway, Lucian Carata, Ripduman Sohan, and Andy Hopper, University of Cambridge System and library call interception performed entirely in user-space is a viable technique for provenance capture. The primary advantages of such an approach are that it is lightweight, has a low barrier to adoption and does not require root privileges to install and configure. However, since both the user’s application and the provenance capture mechanism execute at the same privilege level and as part of the same address there is ample opportunity for an untrustworthy user or application to either circumvent or falsify provenance during capture. We describe a security threat model for such provenance capture mechanisms, discuss various attack vectors to circumvent or falsify provenance collection and finally argue that hardening against such attacks is possible if the application is sandboxed using contemporary techniques in the area of user-space software-based fault isolation. Available Media Scaling SPADE to "Big Provenance" Ashish Gehani, Hasanat Kazmi, and Hassaan Irshad, SRI International Provenance middleware (such as SPADE) lets individuals and applications use a common framework for reporting, storing, and querying records that characterize the history of computational processes and resulting data artifacts. Previous efforts have addressed a range of issues, from instrumentation techniques to applications in the domains of scientific reproducibility and data security. Here we report on our experience adapting SPADE to handle large provenance data sets. In particular, we describe two motivating case studies, several challenges that arose from managing provenance at scale, and our approach to address each concern. Available Media
12:30–13:40	Thursday
Lunch
13:40–15:20	Thursday
Research Session 3: Provenance in Workflows and Data Analysis Pipelines Session Chair: Bertram Ludaescher, University of Illinois Automatic Versus Manual Provenance Abstractions: Mind the Gap Pinar Alper, University of Manchester; Khalid Belhajjame, Université Paris Dauphine; Carole A. Goble, University of Manchester In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi-automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow’s provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow’s design, such as using subworkflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOMUserViews andWorkflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation.We discuss reasons and suggest future research directions. Available Media Composition and Substitution in Provenance and Workflows Peter Buneman, Adrià Gascón, and Dave Murray-Rust, University of Edinburgh* It is generally accepted that any comprehensive provenance model must allow one to describe provenance at various levels of granularity. For example, if we have a provenance graph of a process which has nodes to describe subprocesses, we need a method of expanding these nodes to obtain a more detailed provenance graph. To date, most of the work that has attempted to formalize this notion has adopted a descriptive approach to this issue: for example, given two provenance graphs under what conditions is one “finer grained” than another. In this paper we take an operational approach. For example, given two provenance graphs of interacting processes, what does it mean to compose those graphs? Also, given a provenance graph of a process and a provenance graph of one of its subprocesses, what is the operation of substitution that allows us to expand the graph into a finer-grained graph? As well as provenance graphs, these questions also apply to workflow graphs and other process models that occur in computer science.We propose a model and operations that addresses these problems. While it is only one of a number of possible solutions, it does indicate that a basic adjustment to provenance models is needed if they are properly to accommodate such an operational approach to composition and substitution. Available Media From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard, Nantes Academic Hospital; Hala Skaf-Molli, Université de Nantes; and Audrey Bihouée, L'Institut du Thorax and Université de Nantes Scientific Workflow management systems have been largely adopted by data-intensive science communities. Many efforts have been dedicated to the representation and exploitation of provenance to improve reproducibility in data-intensive sciences. However, few works address the mining of provenance graphs to annotate the produced data with domain-specific context for better interpretation and sharing of results. In this paper, we propose PoeM, a lightweight framework for mining provenance in scientific workflows. PoeM allows to produce linked in silico experiment reports based on workflow runs. PoeM leverages semantic web technologies and reference vocabularies ((PROV-O, P-Plan) to generate provenance mining rules and finally assemble linked scientific experiment reports (Micropublications, Experimental Factor Ontology). Preliminary experiments demonstrate that PoeM enables the querying and sharing of Galaxy-processed genomic data as 5-star linked datasets. Available Media Provenance-aware Versioned Dataworkspaces Xing Niu and Bahareh Sadat Arab, Illinois Institute of Technology; Dieter Gawlick, Zhen Hua Liu, and Vasudha Krishnaswamy, Oracle Corporation; Oliver Kennedy, University at Buffalo; Boris Glavic, Illinois Institute of Technology Data preparation, curation, and analysis tasks are often exploratory in nature, with analysts incrementally designing workflows that transform, validate, and visualize their input sources. This requires frequent adjustments to data and workflows. Unfortunately, in current data management systems, even small changes can require time- and resource-heavy operations like materialization, manual version management, and re-execution. This added overhead discourages exploration. We present Provenance-aware Versioned Dataworkspaces (PVDs), our vision of a sandboxed environment in which users can apply—and more importantly, easily undo—changes to their data and workflows. A PVD keeps a log of the user’s operations in a light-weight version graph structure. We describe a model for PVDs that admits efficient automatic refresh, merging of histories, reenactment, and automated conflict resolution. We also highlight the conceptual and technical challenges that need to be overcome to create a practical PVD. Available Media The Data, They Are A-Changin' Paolo Missier, Jacek Cala, and Eldarina Wijaya, Newcastle University The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors: low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. As those datasets change over time, the value of their derivative knowledge may decay, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions. We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis. Available Media
15:20–15:50	Thursday
Coffee Break
15:50–17:00	Thursday
Discussions

connect with us

twitter

usenix conference policies

Workshop Program

Thursday, June 9, 2016

Welcome

Keynote Address

On Data Citation and Provenance

Coffee Break

Research Session 1: Provenance in Relational Databases

Fine-grained Provenance for Linear Algebra Operators

Quantifying Causal Effects on Query Answering in Databases

Refining SQL Queries based on Why-Not Polynomials

Research Session 2: Provenance and Security

Provenance Segmentation

Towards Secure User-space Provenance Capture

Scaling SPADE to "Big Provenance"

Lunch

Research Session 3: Provenance in Workflows and Data Analysis Pipelines

Automatic Versus Manual Provenance Abstractions: Mind the Gap

Composition and Substitution in Provenance and Workflows

From Scientific Workflow Patterns to 5-star Linked Open Data

Provenance-aware Versioned Dataworkspaces

The Data, They Are A-Changin'

Coffee Break

Discussions

connect with us

twitter

usenix conference policies

You are here

connect with us

Workshop Program

Thursday, June 9, 2016

Welcome

Coffee Break

Lunch

Coffee Break

Discussions