TaPP 2017 Workshop Program

Friday, June 23, 2017

9:00–9:10

Opening

9:10–10:00

Keynote Address

Interactive Debugging for Data-Intensive Scalable Computing using Data Provenance

Tyson Condie, UCLA

Data-Intensive Scalable Computing (DISC) systems are being leveraged for analyzing large datasets. DISC system programs are authored in a domain specific language and submitted for execution on a cluster of machines in the form of jobs. Today, DISC users have limited visibility into the logical operations of their jobs during execution. As such, DISC programmers must resort to rudimentary methods—such as, trial and error debugging—to debug their program logic. BigDebug is our effort to fill this program execution visibility gap by providing an interactive debugging toolkit for Apache Spark. Interestingly, many of features in BigDebug stem from the use of data provenance. In this talk, I will present BigDebug and Titian, which augments Apache Spark with interactive data provenance query capabilities. More information on the BigDebug project can be found here: https://sites.google.com/site/sparkbigdebug/

10:00–10:30

Break

10:30–12:10

Session 1: Algorithms and Methods

Integrating Approximate Summarization with Provenance Capture

Seokki Lee and Xing Niu, Illinois Institute of Technology; Bertram Ludäscher, University of Illinois at Urbana-Champaign; Boris Glavic, Illinois Institute of Technology

Available Media

How to use provenance to explain why a query returns a result or why a result is missing has been studied extensively. Recently, we have demonstrated how to uniformly answer these types of provenance questions for first-order queries with negation and have presented an implementation of this approach in our PUG (Provenance Unification through Graphs) system. However, for realistically-sized databases, the provenance of answers and missing answers can be very large, overwhelming the user with too much information and wasting computational resources. In this paper, we introduce an (approximate) summarization technique that generates compact representations of why and why-not provenance. Our technique uses patterns as a summarized representation of sets of elements from the provenance, i.e., successful or failed derivations. We rank these patterns based on their descriptiveness (we use precision and recall as quality measures for patterns) and return only the top-k summaries. We demonstrate how this summarization technique can be integrated with provenance capture to compute summaries on demand and how sampling techniques can be employed to speed up both the summarization and capture steps. Our preliminary experiments demonstrate that this summarization technique scales to large instances of a real-world dataset.

Expressiveness Benchmarking for System-Level Provenance

Sheung Chi Chan, University of Edinburgh; Ashish Gehani, SRI International; James Cheney, University of Edinburgh; Ripduman Sohan, University of Cambridge; Hassaan Irshad, SRI International

Available Media

Provenance is increasingly being used as a foundation for security analysis and forensics. System-level provenance can help us trace activities at the level of libraries or system calls, which offers great potential for detecting subtle malicious activities that can otherwise go undetected. However, analysing the raw provenance trace is challenging, due to scale and to differences in data representation among system-level provenance recorders: for example, common queries to identify malicious patterns need to be formulated in different ways on different systems. As a first step toward understanding the similarities and differences among approaches, this paper proposes an expressiveness benchmark consisting of tests intended to capture the provenance of individual system calls. We present work in progress on the benchmark examples for Linux and discuss how they are handled by two different provenance collection tools, SPADE and OPUS.

Corroboration via Provenance Patterns

Lina Barakat, King’s College London, UK; Phillip Taylor and Nathan Griffiths, University of Warwick, UK; Simon Miles, King’s College London, UK

Available Media

In today’s distributed and heterogeneous systems, provenance data is becoming increasingly important for understanding process flow, tracing how outputs came about, and enabling users to make more informed decisions based on such outputs. However, within such systems, the sources (computational or human) that generate provenance may belong to different stakeholders operating under different policies. Thus, being autonomous and self-interested, these stakeholders may claim untrue data to protect their interests (e.g. to justify bad performance). In response, this paper proposes a corroboration methodology for verifying a claim made by a source, via confirming it against the claims of other sources. In particular, given a claim in PROV, this claim is generalised to varying levels of abstraction, deriving two types of provenance templates, namely confirmation patterns, capturing the information to be confirmed, and witness patterns, capturing the relevant witnesses. These patterns are utilised to find relevant evidence, among the reports of others, that supports the claim, and to respectively estimate the reliability degree of the claim. The proposed corroboration methodology is illustrated via a case study in the service provision domain.

Answering Historical What-if Queries with Provenance, Reenactment, and Symbolic Execution

Bahareh Sadat Arab and Boris Glavic, Illinois Institute of Technology

Available Media

What-if queries predict how the results of an analysis would change based on hypothetical changes to a database. While a what-if query determines the effect of a hypothetical change on a query’s result, it is often unclear how such a change could have been achieved limiting the practical applicability of such queries. We propose an alternative model for what-if queries where the user proposes a hypothetical change to past update operations. Answering such a query amounts to determining the effect of a hypothetical change to past operations on the current database state (or a query’s result). We argue that such historical what-if queries are often easier to formulate for a user and lead to more actionable insights. In this paper, we take a first stab at answering historical what-if queries. We use reenactment, a declarative replay technique for transactional histories, to evaluate the effect of a modified history on the current database state. Furthermore, we statically analyze the provenance dependencies of a history to limit reenactment to transactions and data affected by a hypothetical change.

12:10–13:30

Poster Session & Lunch

Accepted Posters

Integrating Approximate Summarization with Provenance Capture

Seokki Lee and Xing Niu, Illinois Institute of Technology; Bertram Ludäscher, University of Illinois at Urbana-Champaign; Boris Glavic, Illinois Institute of Technology

How to use provenance to explain why a query returns a result or why a result is missing has been studied extensively. Recently, we have demonstrated how to uniformly answer these types of provenance questions for first-order queries with negation and have presented an implementation of this approach in our PUG (Provenance Unification through Graphs) system. However, for realistically-sized databases, the provenance of answers and missing answers can be very large, overwhelming the user with too much information and wasting computational resources. In this paper, we introduce an (approximate) summarization technique that generates compact representations of why and why-not provenance. Our technique uses patterns as a summarized representation of sets of elements from the provenance, i.e., successful or failed derivations. We rank these patterns based on their descriptiveness (we use precision and recall as quality measures for patterns) and return only the top-k summaries. We demonstrate how this summarization technique can be integrated with provenance capture to compute summaries on demand and how sampling techniques can be employed to speed up both the summarization and capture steps. Our preliminary experiments demonstrate that this summarization technique scales to large instances of a real-world dataset.

ProvDIVE: PROV Derivation Inspection and Visual Exploration

Sven Lieber, Ghent University - imec - IDLab, Belgium; Io Taxidou and Peter M. Fischer, University of Freiburg, Germany; Tom De Nies, and Erik Mannens, Ghent University - imec - IDLab, Belgium

In a previous work, we presented a method to reconstruct PROV derivations from short social media messages. This method can capture a wide range of information spreading (and thus influence) among users, from explicit attribution like quoting to implicit means like content similarity. When applying this method to real-life datasets containing several million messages (e.g., a popular event), we are creating derivations in the same order of magnitude. To assess the provenance, it is useful to manually inspect the overall structure, the individual derivations and the users involved. Such tasks can be supported well by visualization techniques, yet thousands to millions of nodes are notoriously difficult to visualize.

Answering Historical What-if Queries with Provenance, Reenactment, and Symbolic Execution

Bahareh Sadat Arab and Boris Glavic, Illinois Institute of Technology

What-if queries predict how the results of an analysis would change based on hypothetical changes to a database. While a what-if query determines the effect of a hypothetical change on a query’s result, it is often unclear how such a change could have been achieved limiting the practical applicability of such queries. We propose an alternative model for what-if queries where the user proposes a hypothetical change to past update operations. Answering such a query amounts to determining the effect of a hypothetical change to past operations on the current database state (or a query’s result). We argue that such historical what-if queries are often easier to formulate for a user and lead to more actionable insights. In this paper, we take a first stab at answering historical what-if queries. We use reenactment, a declarative replay technique for transactional histories, to evaluate the effect of a modified history on the current database state. Furthermore, we statically analyze the provenance dependencies of a history to limit reenactment to transactions and data affected by a hypothetical change.

Navigating Versioned Version Trees

David Koop, University of Massachusetts Dartmouth

While version trees usually serve as a data structure to organize different versions of a code repository or workflows, we are not restricted from applying operations to the tree itself. When we create a new version, we create a new node in the version tree and attach it to the parent version with edge. The actions of adding a new node and edge are mutations to the version tree. Usually, these are the only actions; the version tree is monotonic and no versions are deleted or moved. In version control systems, even when we merge two versions, the effect the creation of a new node. However, there are times, especially when we want a change to affect multiple versions, that editing the version tree may be more efficient then creating new versions for each of the affected entities. For example, if we introduced a bug in a commit that is included in multiple branches, it would be nice to fix that bug in the original commit and propagate it to all of the other branches. Or, if we wish to upgrade a specific module used multiple workflows to use a new algorithm, it would be nice to replace the action where that module was added rather than create a new workflow (and version) for each workflow.

Provenance Tracing in the Internet of Things

Qi Wang, Wajih Ul Hassan, Adam Bates, and Carl Gunter, University of Illinois at Urbana-Champaign

As the Internet of Things (IoT) continues to proliferate, diagnosing incorrect behavior within increasingly-automated homes becomes considerably more difficult. Devices and apps may be chained together in long sequences of trigger-action rules to the point that from an observable symptom (e.g., an unlocked door) it may be impossible to identify the distantly removed root cause (e.g., a malicious app). We present a platform-centric approach to centralized auditing in the Internet of Things. Our system performs efficient automated instrumentation of IoT apps in order to generate data provenance that provides a holistic explanation of system activities, including malicious behaviors. Here, we prototype our system for the Samsung SmartThings platform, and consider how it can be leveraged to meet the needs of a variety of stakeholders in the IoT ecosystem.

Deduplicating Container Provenance with Graph Grammars

Wajih Ul Hassan, University of Illinois at Urbana-Champaign; Mark Lemay, Boston University; Adam Bates, University of Illinois at Urbana-Champaign; Thomas Moyer, University of North Carolina at Charlotte

Container-based virtualization is enabling unprecedented portability and deployment of code, facilitated by online registries like Docker Store and cluster management tools like Docker Swarm. However, present day audit mechanisms were not designed for this emerging paradigm, especially in large-scale clusters of container deployments where the sheer scale of storing and processing audit logs makes system monitoring prohibitively costly. In this poster, we consider a unique adaptation of Regular Grammar principles that enables us to define a provenance model for a container's expected behavior, and subsequently prune all but the unique/anomalous behaviors of a particular container instance. We consider the performance of such an approach, as well as real-world attack scenarios in which this approach enables cluster-wide monitoring of containers.

13:30–14:15

Invited Talk

Provenance of Computation Meets Persistent Threat Detection: A Progress Report

David Archer, Galos, Inc.

Carefully subtle cyber threats are designed to avoid detection. By minimizing profiles of network communication, avoiding common malware signatures, and moving slowly and cautiously, such threats penetrate, persist, explore, and exfiltrate without detection. In contrast to that cautious demeanor, their potential for harm is substantial, as we’ve seen repeatedly in high-profile data breaches discovered long after the fact. In this work, we describe our approach to detecting such subtle threats using a combination of statistical anomaly detection and computational provenance exploration. We explain the architecture of our detection system, describe the sensor data model we use as input to our analysis, explore our anomaly detection and provenance computation methods, and present preliminary results from a recent adversarial engagement on real systems under realistic attack.

14:15–15:30

Session 2: Systems and Performance

Automated Provenance Analytics: A Regular Grammar Based Approach with Applications in Security

Mark Lemay, Boston University; Wajih Ul Hassan, University of Illinois at Urbana-Champaign; Thomas Moyer, Nabil Schear, and Warren Smith, MIT Lincoln Laboratory

Available Media

Provenance collection techniques have been carefully studied in the literature, and there are now several systems to automatically capture provenance data. However, the analysis of provenance data is often left “as an exercise for the reader”. The provenance community needs tools that allow users to quickly sort through large volumes of provenance data and identify records that require further investigation. By detecting anomalies in provenance data that deviate from established patterns, we hope to actively thwart security threats. In this paper, we discuss issues with current graph analysis techniques as applied to data provenance, particularly Frequent Subgraph Mining (FSM). Then we introduce Directed Acyclic Graph regular grammars (DAGr) as a model for provenance data and show how they can detect anomalies. These DAGr provide an expressive characterization of DAGs, and by using regular grammars as a formalism, we can apply results from formal language theory to learn the difference between “good” and “bad” provenance. We propose a restricted subclass of DAGr called deterministic Directed Acyclic Graph automata (dDAGa) that guarantees parsing in linear time. Finally, we propose a learning algorithm for dDAGa, inspired by Minimum Description Length for Grammar Induction.

Provenance in DISC Systems: Reducing Space Overhead at Runtime

Ralf Diestelkämper, Melanie Herschel, and Priyanka Jadhav, IPVS - University of Stuttgart

Available Media

Data intensive scalable computing (DISC) systems, such as Apache Hadoop or Spark, allow to process large amounts of heterogenous data. For varying provenance applications, emerging provenance solutions for DISC systems track all source data items through each processing step, imposing a high space and time overhead during program execution.

We introduce a provenance collection approach that reduces the space overhead at runtime by sampling the input data based on the definition of equivalence classes. A preliminary empirical evaluation shows that this approach allows to satisfy many use cases of provenance applications in debugging and data exploration, indicating that provenance collection for a fraction of the input data items often suffices for selected provenance applications. When additional provenance is required, we further outline a method to collect provenance at query time, reusing, when possible, partial provenance already collected during program execution.

ACCESSPROV: Tracking the Provenance of Access Control Decisions

Frank Capobianco, The Pennsylvania State University; Christian Skalka, The University of Vermont; Trent Jaeger, The Pennsylvania State University

Available Media

Access control protects security-sensitive operations from access by unauthorized subjects. Unfortunately, access control mechanisms are implemented manually in practice, which can lead to exploitable errors. Prior work aims to find such errors through static analysis, but the correctness of access control enforcement depends on runtime factors, such as the access control policies enforced and adversary control of the program inputs. As a result, we propose to apply provenance tracking to find flaws in access control enforcement. To do so, we track the inputs used in access control decisions to enable detection of flaws. We have developed ACCESSPROV, a Java bytecode analysis tool capable of retrofitting legacy Java applications with provenance hooks. We utilize ACCESSPROV to add provenance hooks at all locations that either may require access control enforcement or may impact access control policy decisions. We evaluate ACCESSPROV on OpenMRS, an open-source medical record system, detecting access control errors while incurring only 2.1% overhead when running the OpenMRS test suite on the instrumented OpenMRS program.

15:30–16:00

Break

16:00–17:40

Session 3: New Applications

Provenance-based Recommendations for Visual Data Exploration

Houssem Ben Lahmar and Melanie Herschel, IPVS - University of Stuttgart

Available Media

Visual data exploration allows users to analyze datasets based on visualizations of interesting data characteristics, to possibly discover interesting information about the data that users are a priori unaware of. In this context, both recommendations of queries selecting the data to be visualized and recommendations of visualizations that highlight interesting data characteristics support users in visual data exploration. So far, these two types of recommendations have been mostly considered in isolation of one another.

We present a recommendation approach for visual data exploration that unifies query recommendation and visualization recommendation. The recommendations rely on two types of provenance, i.e., data provenance (aka lineage) and evolution provenance that tracks users’ interactions with a data exploration system. This paper presents the provenance data model as well as the overall system architecture. We then provide details on our provenance-based recommendation algorithms. A preliminary experimental evaluation showcases the applicability of our solution in practice.

Applying Provenance in APT Monitoring and Analysis: Practical Challenges for Scalable, Efficient and Trustworthy Distributed Provenance

Graeme Jenkinson, Lucian Carata, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, and Robert N. M. Watson, University of Cambridge; Jonathan Anderson and Brian Kidney, Memorial University; Amanda Strnad and Arun Thomas, BAE Systems Inc; George Neville-Neil, Neville-Neil Consulting

Available Media

Advanced Persistent Threats (APT) are a class of security threats in which a well-resourced attacker targets a specific individual or organisation with a predefined goal. This typically involves exfiltration of confidential material, although increasingly attacks target the encryption or destruction of mission critical data. With traditional prevention and detection mechanisms failing to stem the tide of such attacks, there is a pressing need for new monitoring and analysis tools that reduce both false-positive rates and the cognitive burden on human analysts.

We propose that local and distributed provenance metadata can simplify and improve monitoring and analysis of APTs by providing a single, authoritative sequence of events that captures the context (and side effects) of potentially malicious activities. Provenance metadata allows a human analyst to backtrack from detection of malicious activity to the point of intrusion and, similarly, to work forward to fully understand the consequences. Applying provenance to APT monitoring and analysis introduces some significantly different challenges and requirements in comparison to more traditional applications. Drawing from our experiences working with and adapting the OPUS (Observed Provenance in User Space) system to an APT monitoring and analysis use case, we introduce and discuss some of the key challenges in this space. These preliminary observations are intended to prime a discussion within the community about the design space for scalable, efficient and trustworthy distributed provenance for scenarios that impose different constraints from traditional provenance applications such as workflow and data processing frameworks.

Dataflow Notebooks: Encoding and Tracking Dependencies of Cells

David Koop and Jay Patel, University of Massachusetts Dartmouth

Available Media

Computational notebooks have seen widespread adoption among scientists in many fields, and allow users to view interactive graphical results inline, to embed text and code together, to organize code into cells, and to selectively edit and re-execute cells. Because they allow quick and recordable analyses, they play an important role in documenting experiments. However, the reproducibility of notebooks can vary significantly due to the ordering of cells or changes in global state that affect the re-execution of those cells. In addition, in many popular notebook environments, cells are tagged with transient identifiers that change when a cell is re-executed so it is impossible to robustly reference other cells. We introduce dataflow notebooks as a method to allow users to explicitly encode dependencies between cells by adding a unique, persistent identifier to each cell and expanding incode references to results in other cells. In these notebooks, we can both pose and answer provenance queries about dependencies etween cells. This permits new notebook operations like downstream updates which, given a change to one cell, allow users to update all cells that may be impacted by the change while leaving all other cells alone. At the same time, dataflow notebooks increase reproducibility and enable greater reuse by making dependencies clear.

Visualizing Provenance using Comics

Andreas Schreiber, German Aerospace Center (DLR); Regina Struminski, University of Applied Sciences Düsseldorf

Available Media

Understanding how a piece of data was produced, where it was stored, and by whom it was accessed, is crucial information in many processes. To understand the trace of data, the provenance of that data can be recorded and analyzed. But it is sometimes hard to understand this provenance information, especially for people who are not familiar with software or computer science. To close this gap, we present a visualization technique for data provenance using comics strips. Each strip of the comic represents an activity of the provenance graph, for example, using an app, storing or retrieving data on a cloud service, or generating a diagram. The comic strips are generated automatically using recorded provenance graphs. These provenance comics are intended to enable people to understand the provenance of their data and realize crucial points more easily.

17:40–18:00

Town Hall and Conclusions

18:00–20:00

Dinner Reception