Technical Sessions

To access a presentation's content, please click on its title below.

All sessions will take place in Grand Ballroom J unless otherwise noted.

Tuesday, April 2, 2013

8:30 a.m.–9:00 a.m.	Tuesday
Continental Breakfast Grand Ballroom Foyers
9:00 a.m.–9:15 a.m.	Tuesday
Opening Remarks Program Co-Chairs: Alexandra Meliou, University of Massachusetts, Amherst, and Val Tannen, University of Pennsylvania
9:15 a.m.–10:30 a.m.	Tuesday
Keynote Address Managing the When-provenance of Data: Opportunities and Challenges Wang-Chiew Tan, University of California, Santa Cruz Available Media Read more about Managing the When-provenance of Data: Opportunities and Challenges
10:30 a.m.–11:00 a.m.	Tuesday
Break Grand Ballroom Foyers
11:00 a.m.–12:30 p.m.	Tuesday
Reproducibility and Audits Session Chair: Paolo Missier, Newcastle University ReproZip: Using Provenance to Support Computational Reproducibility Fernando Chirigati, Polytechnic Institute of NYU; Dennis Shasha, New York University; Juliana Freire, Polytechnic Institute of NYU We describe ReproZip, a tool that makes it easier for authors to publish reproducible results and for reviewers to validate these results. By tracking operating system calls, ReproZip systematically captures detailed provenance of existing experiments, including data dependencies, libraries used, and conﬁguration parameters. This information is combined into a package that can be installed and run on a different environment. An important goal that we have for ReproZip is usability. Besides simplifying the creation of reproducible results, the system also helps reviewers. Because the package is self contained, reviewers need not install any additional software to run the experiments. In addition, ReproZip generates a workﬂow speciﬁcation for the experiment. This not only enables reviewers to execute this speciﬁcation within a workﬂow system to explore the experiment and try different conﬁgurations, but also the provenance kept by the workﬂow system can facilitate communication between reviewers and authors. Available Media Using Provenance for Repeatability Quan Pham, University of Chicago; Tanu Malik and Ian Foster, University of Chicago and Argonne National Laboratory We present Provenance-To-Use (PTU), a tool that minimizes computation time during repeatability testing. Authors can use PTU to build a package that includes their software program and a provenance trace of an initial reference execution. Testers can select a subset of the package’s processes for a partial deterministic replay—based, for example, on their compute, memory and I/O utilization as measured during the reference execution. Using the provenance trace, PTU guarantees that events are processed in the same order using the same data from one execution to the next. We show the efficiency of PTU for conducting repeatability testing of workflow-based scientific programs. Available Media Supporting Undo and Redo in Scientiﬁc Data Analysis Xiang Zhao, University of Massachusetts, Amherst; Emery R. Boose, Harvard University; Yuriy Brun, University of Massachusetts, Amherst; Barbara Staudt Lerner, Mount Holyoke College; Leon J. Osterweil, University of Massachusetts, Amherst This paper presents a provenance-based technique to support undoing and redoing data analysis tasks. Our technique targets scientists who experiment with combinations of approaches to processing raw data into presentable datasets. Raw data may be noisy and in need of cleaning, it may suffer from sensor drift that requires retrospective calibration and data correction, or it mayneed gap-ﬁlling due to sensor malfunction or environmental conditions. Different raw datasets may have different issues requiring different kinds of adjustments, and each issue may potentially be handled by different approaches. Thus, scientists must often experiment with different sequences of approaches. In our work, we show how provenance information can be used to facilitate this kind of experimentation with scientiﬁc datasets. We describe an approach that supports the ability to (1) undo a set of tasks while setting aside the artifacts and consequences of performing those tasks, (2) replace, remove, or add a data-processing technique, and (3) redo automatically those set aside tasks that are consistent with changed technique. We have implemented our technique and demonstrate its utility with a case study of a common, sensor-network, data-processing scenario showing how our approach can reduce the cost of changing intermediate data-processing techniques in a complex, data-intensive process. Available Media Android Provenance: Diagnosing Device Disorders Nathaniel Husted, Indiana University; Sharjeel Qureshi, Dawood Tariq, and Ashish Gehani, SRI International Mobile devices are a ubiquitous part of our daily lives. Smartphones are being used in many areas where data privacy and integrity are a concern. One threat to integrity and privacy is the existence of bugs in operating system code. Little has been done to provide tools for system-wide runtime proﬁling and accountability. We propose operating system auditing and data provenance tracking as mechanisms for generating useful traces of system activity and information ﬂow on mobile devices. The goal of these traces is to enable debugging and proﬁling of complicated system issues such as increased power drain. We contribute a prototype system for Android-based mobile devices and provide realistic examples of how our system can be used for debugging. Available Media
12:30 p.m.–2:00 p.m.	Tuesday
Workshop Luncheon Grand Ballroom HI
2:00 p.m.–3:30 p.m.	Tuesday
Provenance Capture and Analysis Session Chair: Juliana Freire, New York University Provenance for Data Mining Boris Glavic, Illinois Institute of Technology; Javed Siddique, Periklis Andritsos, and Renee J. Miller, University of Toronto Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workﬂow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workﬂow, and data mining provenance, suggest new types of provenance, and identify new use cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent item set mining and multi-dimensional scaling. Available Media Provenance Analyzer: Exploring Provenance Semantics with Logic Rules Saumen Dey, Sean Riddle, and Bertram Ludäscher, University of California, Davis Abstract not available. Available Media Declaratively Processing Provenance Metadata Scott Moore, Harvard University; Ashish Gehani, SRI International Systems that gather ﬁne-grained provenance metadata must process and store large amounts of information. Filtering this metadata as it is collected has a number of beneﬁts, including reducing the amount of persistent storage required and simplifying subsequent provenance queries. However, writing these ﬁlters in a procedural language is verbose and error prone. We propose a simple declarative language for processing provenance metadata and evaluate it by translating ﬁlters implemented in SPADE, an open-source provenance collection platform. Available Media OPUS: A Lightweight System for Observational Provenance in User Space Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, and Andy Hopper, University of Cambridge A variety of current provenance systems address the challenges of provenance capture, storage and query. However they require special setup and conﬁguration, do not capture all I/O operations and limit themselves to speciﬁc specialised platforms. In this paper we propose the design of a data provenance capture and query tool called OPUS. OPUS works entirely in user space, is light-weight and requires minimum user intervention. OPUS is based on a formal model for versioning provenance objects that enables the succinct, complete representation of I/O operations in a manner that abstracts it from the details of the underlying operating system. Available Media
3:30 p.m.–5:00 p.m.	Tuesday
Poster Session with Refreshments Grand Ballroom I

Wednesday, April 3, 2013

8:45 a.m.–9:15 a.m.	Wednesday
Continental Breakfast Grand Ballroom Foyers
9:15 a.m.–10:30 a.m.	Wednesday
Keynote Address World Domination Through Provenance Margo Seltzer, Harvard School of Engineering and Applied Sciences and Oracle Available Media Read more about World Domination Through Provenance
10:30 a.m.–11:00 a.m.	Wednesday
Break Grand Ballroom Foyers
11:00 a.m.–12:30 p.m.	Wednesday
Provenance Models and Applications Session Chair: Boris Glavic, Illinois Institute of Technology D-PROV: Extending the PROV Provenance Model with Workﬂow Structure Paolo Missier, Newcastle University; Saumen Dey, University of California, Davis; Khalid Belhajjame, University of Manchester; Victor Cuevas-Vicenttín and Bertram Ludäscher, University of California, Davis This paper presents an extension to the W3C PROV provenance model, aimed at representing process structure. Although the modelling of process structure is out of the scope of the PROV speciﬁcation, it is beneﬁcial when capturing and analyzing the provenance of data that is produced by programs or other formally encoded processes. In the paper, we motivate the need for such and extended model in the context of an ongoing large data federation and preservation project, DataONE, where provenance traces of scientiﬁc workﬂow runs are captured and stored alongside the data products. We introduce new provenance relations for modelling process structure along with their usage patterns, and present sample queries that demonstrate their beneﬁt. Available Media IPAPI: Designing an Improved Provenance API Lucian Carata, Ripduman Sohan, Andrew Rice, and Andy Hopper, University of Cambridge We investigate the main limitations imposed by existing provenance systems in the development of provenance aware applications. In the case of disclosed provenance APIs, most of those limitations can be traced back to the inability to integrate provenance from different sources, layers and of different granularities into a coherent view of data production. We consider possible solutions in the design of an Improved Provenance API (IPAPI), based on a general model of how different system entities interact to generate, accumulate or propagate provenance. The resulting architecture enables a whole new range of provenance capture scenarios, for which available APIs do not provide adequate support. Available Media HadoopProv: Towards Provenance as a First Class Citizen in MapReduce Sherif Akoush, Ripduman Sohan, and Andy Hopper, University of Cambridge We introduce HadoopProv, a modiﬁed version of Hadoopthat implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenancecapture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferringconstruction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the Map and Reduce provenance ﬁles. In our prototype implementation, HadoopProv has an overhead below 10% on typical job runtime (<7% and <30% average temporal increase on Map and Reduce tasks respectively). Additionally, we demonstrate that provenance queries are serviceable in O (k log n), where n is the number of records per Map task and k is the set of Map tasks in which the key appears. Available Media A Provenance Model for Key-value Systems Devdatta Kulkarni In this paper we present key-value provenance model (KVPM). In KVPM, provenance information can be collected for both, the data and the data’s schema in a key-value system. Collection of the information isapplication-driven, and it can collected at different levels of the data model hierarchy. Here we present the capabilities of the KVPM system along with its design and implementation for Cassandra and initial evaluation. Available Media