A Formal Model of Provenance in Distributed Systems

Issam Souilah
University of Southampton, UK

Adrian Francalanza
University of Malta, Malta

Vladimiro Sassone
University of Southampton, UK

Abstract

We present a formalism for provenance in distributed systems based on the $\pi$ -calculus. Its main feature is that all data products are annotated with metadata representing their provenance. The calculus is given a provenance tracking semantics, which ensures that data provenance is updated as the computation proceeds. The calculus also enjoys a pattern-restricted input primitive which allows processes to decide what data to receive and what branch of computation to proceed with based on the provenance information of data. We give examples to illustrate the use of the calculus and discuss some of the semantic properties of our provenance notion. We conclude by reviewing related work and discussing directions for future research.

1 Introduction

As a concept, provenance indicates the source and derivation of an object, and is also known as origin, lineage and pedigree. More concretely, provenance refers to a record of such source and derivation [19]. This record may include information about the ownership, influences, contributions and any other historical or contextual information which may be deemed relevant and useful. Provenance has many applications such as auditing, detecting errors and ensuring reproducibility of results. It is also central to the trust one places in data, since it can be used as an indicator of quality, especially in a setting where independent verification of other attributes may not be possible.

Provenance tracking is the problem of recording provenance information, and has been studied in a wide range of settings, including databases [4,8,11,5,7], scientific computing [13,21,12,22] and file systems [20]. Initially, most of the work on provenance relied on intuitive and informal concepts such as influences, contributes to, and depends on in defining what provenance means. These notions were then used to provide mostly ad hoc implementations, with no formal guarantees as to their correctness or adequacy. However, lately there have been a surge of interest in underpinning the more theoretical principles of provenance [4,11,7]. Such body of work aims to establish a mathematical and semantic basis for provenance, which is important if we are to compare different notions of provenance and assess the correctness of their implementations. The present work falls in this latter line of research and aims to provide a formal study of provenance-based trust in concurrent and distributed systems.

We choose the asynchronous $\pi$ -calculus [15,3], a version of the $\pi$ -calculus [18] where message output is non-blocking, as the formalism for modelling distributed systems. The choice of asynchrony is motivated mainly by the envisaged applications of our calculus and its impact on possible implementations, however, it has minimal effect on our results. In order to be able to refer to individual agents, we extend this basic formalism with explicit identities. These identities can be thought of as representing units of trust, and are merely labels for identifying processes with no effect on communication (cf. localities in the DPI calculus [14]). With this extension we can express the essence of computation in a distributed setting, whereby multiple agents interact by exchanging messages. For instance the code:

$\displaystyle a[n \langle v_1\rangle ]\quad \vert \quad b[n \langle v_2\rangle ] \vert \quad c[n (x).P]$

denotes a system composed of three principals:

and

that are sending values

and

on the same channel

and

that is attempting to read a value from the channel

and use it in the continuation code

. Interestingly, from the point of view of

, there exists a level of non-determinism, that is a market of values on channel

from which

is free to choose. However, assuming

has no way of assessing the quality of the different values available on channel

, it cannot decide which of the two values

to consume. In such situations, provenance may be used as a measure of the quality of data. For example, the three principals may agree on a convention whereby the senders (producers of data) attach a provenance tag. Thus the code may be modified to:

$\displaystyle a[n \langle a,v_1\rangle ] \vert b[n \langle b,v_2\rangle ] \vert c[n (x,y).\textsf{if} x=a \textsf{then} P \textsf{else} Q ]$

Now, the consumer of the data,

, can determine where the data originated from and branch accordingly. Even though this encoding works, it carries two major disadvantages:
$\begin{inparaenum}[\itshape\upshape (1\upshape )] \item it is cumbersome and mud... ...ere to these provenance conventions, which cannot be enforced. \end{inparaenum}$
Crucially, the second point leads to circular reasoning with respect to trust. For instance, in the above code nothing stops principal

from forging

's identity using $b[n \langle a,v_2\rangle ]$ . This simple problem may be solved using a digital signature scheme for example, however, in general the picture can be far more complicated as messages may carry, not only data authored by the agent itself, but also data obtained from other agents. Moreover, agents may consult other agents about where to get data from and what to do with it. These may also be of interest to consumers of data in order to be able to judge the trustworthiness of data.

A distinguishing feature of our distributed $\pi$ -calculus extension is the use of provenance annotated data. Every value, , is annotated with its provenance, $\kappa$ , and denoted as $v\colon\kappa$ . The provenance $\kappa$ is a sequence of `'-delimited events. The events in a provenance sequence $e_{1} ;\ldots ;e_{n}$ are chronologically ordered with $e_{1}$ being the most recent event. An event $e_{i}$ may be an output event $a!\kappa$ , which says that the value has been sent by an agent on a channel whose provenance is $\kappa$ , or an input event $a?\kappa$ , which says that the value has been received by an agent on a channel whose provenance is $\kappa$ ; the rationale here is that in the $\pi$ -calculus channels are data too.

We propose a two-tiered framework which automates the process of provenance tracking and separates it from the actual computation.¹ This two-tiered framework is manifested by the provenance tracking reduction semantics, which, in addition to describing the interaction between agents, keeps the provenance of values up-to-date. For instance, the rule for sending data on a particular channel has two facets:

$\displaystyle a[m\mathord{:}\kappa_{m} \langle v\colon\kappa_{v}\rangle ] \mat... ...ightarrow} m\langle\!\langle v\colon a!\kappa_{m} ;\kappa_{v}\rangle\!\rangle$

From the computational perspective, the rule represents the first of a two-step communication process: a principal

wanting to output a value $v\colon\kappa_{v}$ on a channel $m\colon\kappa_{m}$ generates a packaged message $m\langle\!\langle v\colon a!\kappa_{m} ;\kappa_{v}\rangle\!\rangle$ , where

represents the address and $v\colon a!\kappa_{m} ; \kappa_{v}$ the annotated data content. From the provenance tracking perspective, the rule also describes how the provenance of the value

is updated to $a!\kappa_{m} ;\kappa_{v}$ after the output operation. This tells us that the value has been most recently sent by agent

on a channel whose provenance is $\kappa_{m}$ .

Dually, the rule for receiving values, the second step in communication, has two facets as well:

$\displaystyle \inference{\kappa_v \models \pi}{ b[m\mathord{:}\kappa_{m} ( \pi... ... \rangle\!\rangle \rightarrow b[P\{^{v\colon b?\kappa_m ;\kappa_v }/_{x}\}] }$

On the one hand, the rule states that principal

can input a value from channel

and proceed as

if there exists a packaged value with destination

in the system; otherwise it blocks until such a packaged value becomes available. More importantly though, this rule also describes how the provenance information attached to the packaged data, $\kappa_v$ , is first used for vetting purposes before consumption, and afterwards updated. More specifically, the data input on channel

only progresses if the provenance of the packaged data, $\kappa_v$ , passes the input test $\pi$ ; this is denoted as $\kappa_v \models \pi$ . Should this test succeed, the input is allowed to occur and, in addition, the provenance attached to value

is updated to $v\colon b?\kappa_m ;\kappa_v$ in the continuation

, thereby recording the fact that the value was received by principal

on some channel with provenance $\kappa_m$ .

In addition to proposing a provenance-based calculus, we also formalise a notion of correctness for our provenance annotations. We interpret the provenance $\kappa$ of an annotated value $v\colon\kappa$ as a partial order of assertions about events that occurred in the system, relating specifically to value . For example, the provenance of annotated value $v\colon a!\kappa_{m} ; \kappa_{v}$ tells us that has been sent by principal on some channel with provenance $\kappa_{m}$ , and that before this, the events described by $\kappa_{m}$ and $\kappa_{v}$ took place, without restricting the relative ordering between the events in $\kappa_{m}$ and $\kappa_{v}$ ; the information in the recorded event $a!\kappa_{m}$ is also partial because it does not specify the channel on which was sent. To prove correctness, we devise a technique whereby a decorated version of our reduction semantics constructs a global log which records a total ordering of past events relating to all values. The provenance of an annotated value is then correct if its partial order denotation is, in some sense, consistent with the total ordering of events specified by the global log.

The rest of the paper is structured as follows. In §2, we give an overview of the calculus, review its syntax and semantics, and provide examples to illustrate some of its features. We discuss what past information is exactly recorded by the provenance tracking reduction relation and study its correctness properties in §3. In §4, we review related work. In §5, we discuss possibilities for future work and conclude the paper.

2 The Provenance Calculus

The formalism we consider for studying provenance is a variant of the asynchronous $\pi$ -calculus, which we extend with the following four main features:

Explicit identities: every process is located at a named principal, which is used for provenance and does not otherwise affect communication between processes.
Annotated data: every value is annotated with its provenance, which is updated as the computation proceeds to reflect what happened to the value.
Provenance tracking: we specify the meaning of the calculus by a provenance tracking reduction semantics which, in addition to describing the possible interactions between principals, also tracks the provenance of values as they are exchanged between principals.
Pattern restricted input: In order to allow principals to make use of the provenance information of values, we adopt a version of guarded choice that is restricted to inputs on the same channel with possibly differing patterns (similar in spirit to the one used in [6]). This allows principals to restrict the set of values they are willing to receive on a particular channel to those that satisfy the patterns specified. It also allows them to branch to different continuations based on the provenance information.

2.1 Syntax

The formal syntax of the calculus is given in Table 1. We assume a set $\mathcal{X}$ of variables, ranged over by $x, y, z, \ldots$ , a set $\mathcal{C}$ of channel names, ranged over by $l, m, n, \ldots$ , and a set $\mathcal{A}$ of principal names, ranged over by $a, b, c, \ldots$ . We assume that all three sets are pair-wise disjoint and define the set $\mathcal{V}$ of plain values to be $\mathcal{C} \cup \mathcal{A}$ , and use the letters $u, v, \ldots$ to range over this set.

We represent provenance as a sequence of events. The events in a provenance sequence are assumed to be temporally ordered from left to right, where the left-most event (the head of the sequence) is the most recent event. We use $\mathcal{K}$ for the set of provenance sequences and $\mathcal{E}$ for the set of events and let $\kappa, \kappa', \ldots$ and $e, e', \ldots$ range over elements of each set respectively.

The set $\mathcal{D}$ of annotated values is then defined to be the set of terms of the form $v\colon\kappa$ where is a plain value and $\kappa$ its provenance. An output event $a!\kappa$ in the provenance of a value denotes that the value has been sent by principal on a channel whose provenance is $\kappa$ , while an input event $a?\kappa$ denotes that the value has been received by principal on a channel whose provenance is $\kappa$ . We also define the set $\mathcal{I}$ of identifiers to be $\mathcal{D} \cup \mathcal{X}$ and let $w, w', \ldots$ range over identifiers.

To allow principals to query the provenance of values we use patterns and pattern matching. Instead of defining a particular pattern matching language, we opt for a more general approach and make the calculus parametric on the choice of the pattern matching language. We do give a concrete language to use with the examples however.

Definition 1 A pattern matching language is a pair $(\Pi, \mathord{\models})$ where $\Pi$ is a set of patterns, ranged over by $\pi, \pi', \ldots$ , and $\mathord{\models}\subseteq \mathcal{K} \times \Pi$ is the pattern satisfaction (or matching) relation, a relation between provenance sequences and patterns.

Processes in our calculus are based on those of the asynchronous $\pi$ -calculus and are ranged over by $P, Q, \ldots$ . The main differences with respect to the asynchronous $\pi$ -calculus is our use of variants of the input and summation constructs. For input, we use a variant that takes two parameters in the form $\pi \mathsf{as} x$ , where $\pi$ is a pattern that is used to restrict the set of values to be received to those whose provenance matches the pattern $\pi$ and is the standard variable binder, a placeholder for the value to be received. We use the notation $w (\pi \mathsf{as} x).P$ for the process that is ready to receive, on channel , a value whose provenance matches the pattern $\pi$ and continue as with the appropriate substitution. For summation, we use a restricted version that only allows input guarded choice on the same channel, denoted by $\Sigma_{i \in I}w (\pi_{i} \mathsf{as} x_{i}).P_{i}$ for some finite index set . We use the notation 0 as syntactic sugar for the empty sum. The output form $w \langle w'\rangle$ denotes the process that is ready to send on channel , while $\textsf{if} w = w' \textsf{then} P \textsf{else} Q$ denotes the process that proceeds as if is equal to and as otherwise. Scope restriction of channel to process is denoted by $(\nu n)P$ . Parallel composition of processes and is denoted by $P \vert Q$ while $\mathopen{*} P$ denotes replication of process . It should be noted that all values in processes are annotated values, except channel names in restrictions. The reason for that is that we use restriction as usual to delimit scope, yet within one scope a channel name may have occurrences with possibly different provenance sequences.

A system is the composition of zero or more located processes and messages. We use $S, T, \ldots$ to range over systems. A located process, , is a process that is running under the authority of a principal . We overload the symbol 0 to denote the located process . As the structure of our systems is flat, it should be obvious from the context whether we mean by 0 the empty summation or the located process . A message is a value that has been sent but not yet received, and is denoted in the calculus by $n\langle\!\langle w'\rangle\!\rangle$ . Restriction is denoted by $(\nu n)S$ while parallel composition of two systems is denoted by $S \parallel T$ .

$\includegraphics{Figures/syntax.eps}$

2.2 Provenance Tracking Semantics

The semantics of the calculus is defined by two relations, the structural congruence relation, $\equiv$ , and the provenance tracking reduction relation, $\mathrel{\rightarrow}$ . Structural congruence allows us to make structural manipulations of systems which makes the definition of reduction simpler. The structural congruence relation is standard and is omitted due to space limitations.

Interaction between located processes is described by the reduction relation $\mathrel{\rightarrow}$ . As we mentioned previously, the reduction relation also tracks and updates the provenance of values as the system evolves. The reduction relation is defined on closed systems (i.e., those that contain no free variables) and is given in Table 2. The two main reduction rules are RED SND and RED RCV, which describe sending and receiving values respectively. The reason we split communication into the two steps of sending and receiving is to make the semantics simpler by only adding a single event to the provenance sequence at a time. In the rule RED SND, a located process $a[m\mathord{:}\kappa_{m} \langle v\colon\kappa_{v}\rangle ]$ may output the value $v\colon\kappa_{v}$ on the channel $m\colon\kappa_{m}$ which results in the message $m\langle\!\langle v\colon a!\kappa_{m} ;\kappa_{v}\rangle\!\rangle$ . What should be noted here is that the provenance of the value changes from $\kappa_{v}$ to $a!\kappa_{m} ;\kappa_{v}$ after the output action to reflect the fact that the value has been most recently sent by principal on a channel whose provenance is $\kappa_{m}$ . In the rule RED RCV, a message $m\langle\!\langle v\colon\kappa_{v}\rangle\!\rangle$ may be received by the located process $a[\Sigma_{i \in I}m\colon\kappa_{m} ( \pi_{i} \mathsf{as} x_{i}).P_{i}]$ if the provenance of the value satisfies one of the patterns $\pi_{i}$ . The process $P_{j}$ whose pattern $\pi_{j}$ is satisfied is chosen for the continuation. If more than one such pattern exists, one of them is chosen non-deterministically. Note that here too, the provenance of the value is updated from $\kappa_v$ to $a?\kappa_m ;\kappa_v$ to reflect the fact that it has most recently been received by principal on a channel with provenance $\kappa_{m}$ . The value with the updated provenance is then substituted for the variable in the continuation chosen. The capture avoiding substitution of $w_1, \ldots, w_n$ for the free occurrences of variables $x_1, \ldots, x_n$ in is denoted by $P\{^{w_1, \ldots, w_n}/_{x_1, \ldots, x_n}\}$ and its definition is standard. We let $\sigma, \sigma', \ldots$ range over substitutions. The rules $\textsc{Red IF$_{t}$}$ and $\textsc{Red IF$_{f}$}$ give the semantics of matching. It should be noted here that only the plain values are tested for equality while their provenance is ignored.² This means that if the two plain values are equal, irrespective of their provenances, the process in the then branch is chosen for the continuation as indicated by the rule $\textsc{Red IF$_{t}$}$ . If the two plain values are not equal, then the process in the else branch is chosen for the continuation as indicated by the rule $\textsc{Red IF$_{f}$}$ . The other three rules are standard and state that reduction is preserved under restriction, system composition as well as under structural congruence.

$\includegraphics{Figures/reduction.eps}$

2.3 Examples

In this section, we will give a few examples to illustrate the use of the provenance calculus. First, however, we have to give a concrete pattern matching language.

2.3.1 A Sample Pattern Matching Language

As our provenance sequences have a structure that is similar to that of XML documents, we choose to base our sample pattern language on regular expression pattern matching [16]. The formal syntax and semantics of the language is given in Table 3.

$\includegraphics{Figures/patterns.eps}$

The pattern $\mathsf{Any}$ matches any provenance sequence while the pattern $\epsilon$ is used to match the empty provenance sequence (denoted by $\epsilon$ as well). The two patterns $G!\pi$ and $G?\pi$ match send and receive events respectively. The use of group expressions in these patterns allows us to perform more general tests against the principal that performed the event. The group expression denotes the singleton set containing principal only, while $\mathord{\sim}$ denotes the set of all principals. and denote union and difference of groups respectively. The denotation of group expressions is given by the function $\llbracket - \rrbracket$ . The pattern $\pi ;\pi'$ matches a provenance sequence that is composed of two parts that match $\pi$ and $\pi'$ respectively. The alternation of patterns $\pi$ and $\pi'$ , denoted by $\pi \lor \pi'$ , matches a sequence that matches either patterns, while the repetition of pattern $\pi$ , denoted by $\pi^{*}$ , matches any provenance sequence that is the composition of zero or more sub-sequences, each of which matches the pattern $\pi$ .

2.3.2 Example Systems

2.3.2.1 Authentication.

Provenance can be used to establish the authenticity of messages, for example by checking their immediate sender, their original sender or any principal in between. The following system illustrates this.

$\displaystyle a[m (c!\mathsf{Any};\mathsf{Any} \mathsf{as} x).P] \parallel b[m (\mathsf{Any}; d!\mathsf{Any} \mathsf{as} y).Q] \parallel S$

In the above example, principal

wants to receive only data coming from

directly, no matter where it has been before, whereas principal

wants to receive data that originated at

, no matter what the intermediary links were.

2.3.2.2 Auditing.

Provenance can also be used as an auditing and troubleshooting tool to establish who might have been responsible for an error. For example, in the following system:

$\displaystyle S \triangleq a[m \langle v\rangle ] \parallel s[m (x).n' \langle x\rangle ] \parallel c[n' (x).P] \parallel b[n'' (x).Q]$

principal

is trying to send a value

to principal

, and this has to be done through an intermediary

(because

does not have a direct link to

for example). Because of faulty code at

, the value gets forwarded to

instead, as indicated by the following reduction:

$\displaystyle S \mathrel{\rightarrow^{*}} c[P\{^{v\colon c?\epsilon ;s!\epsilon ; s?\epsilon ;a!\epsilon }/_{x}\}] \parallel b[n'' (x).Q]$

Now when

detects the error, perhaps due to the unexpected value,

can use the provenance ${c?\epsilon ;s!\epsilon ; s?\epsilon ;a!\epsilon }$ to tell what principals were involved in making this error. In this case,

, and

itself. The three principals may be further investigated to determine who and what exactly caused the error.

2.3.2.3 Photography Competition.

The following example describes a photography competition. Contestants submit their entries for the competition to the organiser of the competition, who forwards those entries to the appropriate judges. Each judge rates the entries allocated to them and returns the results to the organiser. The organiser then publishes the results and announces the winners. We consider a version of the competition with three contestants: $c_{1}$ , $c_{2}$ and $c_{3}$ , one organiser:

, and two judges: $j_{1}$ and $j_{2}$ . The contestants submit their entries to the organiser on channel sub and receive the published results on channel pub. The organiser forwards entries submitted by $c_{1}$ and $c_{3}$ to judge $j_{1}$ and the entry by $c_{2}$ to $j_{2}$ . The judges return the entries together with their ratings to the organiser. Note that below we are using polyadic versions of the send and receive constructs, such an extension to the calculus being straightforward.

$\displaystyle \textbf{C}(c, \mathit{entry}, P)$	$\displaystyle \triangleq c[\mathit{sub} \langle\mathit{entry}\rangle \vert \... ...hsf{Any};c!\mathsf{Any} \mathsf{as} x \;,\; \mathsf{Any} \mathsf{as} y).P]$
$\displaystyle \textbf{O}$	$\displaystyle \triangleq o[\mathopen{} (\Sigma_{i \in \{1,2\}}\mathit{sub} (... ...x\rangle \vert \mathit{res} (y, z). \mathopen{} pub \langle y, z\rangle )]$
$\displaystyle \textbf{J}(j, inc)$	$\displaystyle \triangleq j[\mathit{inc} (x).res \langle x, \mathit{rate}(x)\rangle ]$
$\displaystyle \textbf{Comp}$	$\displaystyle \triangleq \textbf{C}(c_{1}, \mathit{e}_{1}, P_{1}) \parallel \te... ...\textbf{J}(j_{1}, \mathit{in}_{1}) \parallel \textbf{J}(j_{2}, \mathit{in}_{2})$

where $\pi_{1} \triangleq (c_{1} + c_{3})!\textsf{Any} ; \textsf{Any}$ and $\pi_{2} \triangleq c_{2}!\textsf{Any} ; \textsf{Any}$ . The system above evolves as follows, where $\mathord{\Pi_{i \in \{ 1, \ldots, n \}}}S_i$ is used to denote $S_1 \parallel \ldots \parallel S_n$ .

$\displaystyle \textbf{Comp} \mathrel{\rightarrow^{*}}\;$

$\displaystyle \mathord{\Pi_{i \in \{ 1,2,3 \}}} o[\mathopen{*} \mathit{pub} \... ...\colon\kappa'_{ei}, \mathit{rate}(\mathit{e}_{i})\colon\kappa'_{ri}}/_{x, y}\}]$

3 Properties of Provenance

In this section, we look at the properties of the provenance notion and the provenance tracking reduction relation we proposed.

3.1 Logs

We start by introducing logs. Logs are meant as a representation of the past behaviour of systems. They may be thought of as edge-labelled trees whose directed edges (from parent to child) are labelled with actions that occurred in a system at some point in the past. An edge leading out of a parent node represents an action that occurred more recently in time than those leading out of its children. Sibling subtrees are assumed to be temporally independent, in the sense that the order between their actions is unknown. The formal syntax of logs is as follows.

$\displaystyle \phi$	$\displaystyle ::= \varnothing \; \mid \; \alpha;\phi \; \mid \; \phi \vert \psi$
$\displaystyle \alpha$	$\displaystyle ::= a.\mathsf{snd}(V,V) \mid a.\mathsf{rcv}(V,V) \mid a.\mathsf{ift}(V, V) \mid a.\mathsf{iff}(V, V)$

We use $\Phi$ for the set of logs, $\phi, \psi, \ldots$ to range over logs and $\alpha, \beta, \ldots$ to range over actions. We also define the set $\mathcal{D}_x$ to be $\mathcal{V} \cup \mathcal{X} \cup \{ \star \}$ and use the metavariables $U, V, \ldots$ to range over its elements. Variables in logs are used to denote unknown values, so for example, action $a.\mathsf{snd}(x,v)$ says that principal

sent value

on some channel

, the identity of which is unknown. The special symbol $\star$ denotes an unknown private channel name, and its use will be demonstrated later. We use $\varnothing$ for the empty log, $\alpha;\phi$ for the log with edge labelled $\alpha$ leading to subtree $\phi$ , and $\phi \vert \psi$ for the composition of logs $\phi$ and $\psi$ . Note that the composition of logs $\phi \vert \psi$ joins their roots and hence results in a well-formed tree. Operator `

' has higher precedence than ` $\vert$ ' and parentheses may be used in the usual way. We also omit trailing $\varnothing$ to ease readability. We have four types of actions;
$\begin{inparaenum}[\itshape\upshape (1\upshape )] \item the \emph{output} action... ...$ tested $V$ and $V'$ for equality and this returned false. \end{inparaenum}$
Note that, in the log $a.\mathsf{snd}(x,V);\phi$ , the occurrence of variable

in action $a.\mathsf{snd}(x,V)$ binds occurrences of

in $\phi$ . The same holds for

in $a.\mathsf{rcv}(x,V);\phi$ . All other occurrences of variables in logs are considered free. We use $\mathit{fv}(\phi)$ for the set of free variables in $\phi$ , $\mathit{bv}(\phi)$ for the set of bound variables in $\phi$ and define closed logs to be those containing no free variables. Furthermore, we consider indistinguishable those logs which only differ by alpha conversion (change of bound variables) or by application of the commutative monoid laws for the log composition operator $\vert$ (with identity element $\varnothing$ ).

As we have already mentioned, logs are meant as records of the past behaviour of systems. It is natural then to ask: what is the relation between two logs $\phi$ and $\psi$ that both record the past of the same system? To answer this question, we start by defining $\alpha \mathrel{\preceq} \alpha'$ to mean that there exists a (possibly empty) substitution $\sigma$ of values for variables such that $\alpha' = \alpha\sigma$ . Note that since we are requiring $\sigma$ to strictly substitute variables with values, then $\alpha'$ must be just like $\alpha$ except that it has less variables. Now, since we are using variables to stand for unknown values, this simply means that $\alpha'$ contains as much or more information about the past as that contained in $\alpha$ . With this, we proceed to overload the symbol $\mathrel{\preceq}$ and define the relation $\mathrel{\preceq}$ on the set of closed logs. Intuitively, $\phi \mathrel{\preceq}\psi$ can be taken to mean that log $\psi$ tells us at least as much about the past as log $\phi$ does. Relation $\mathrel{\preceq}$ is defined to be the smallest relation closed under the following inference rules, where $\sigma$ and $\sigma'$ denote closing substitutions.

$\includegraphics{Figures/order.eps}$

The empty log $\varnothing$ tells us nothing about the past, and hence we have that $\varnothing \mathrel{\preceq}\phi$ for every log $\phi$ as expressed by Rule $\textsc{LNil}$ . Rule $\textsc{LPre$_1$}$ can be understood as follows: if we have two logs $\alpha;\phi$ and $\alpha';\psi$ , then in order to prove that $\alpha;\phi \mathrel{\preceq} \alpha';\psi$ holds, we need to prove that both $\alpha \mathrel{\preceq} \alpha'$ and $\phi\sigma \mathrel{\preceq}\psi\sigma'$ hold. Note that here we are using $\phi\sigma$ and $\psi\sigma'$ to denote the application of the (possibly empty) closing substitutions $\sigma$ and $\sigma'$ to logs $\phi$ and $\psi$ respectively and that this is needed as $\mathrel{\preceq}$ only applies to closed logs. Rule $\textsc{LPre$_2$}$ simply says that if $\phi \mathrel{\preceq}\psi$ , then adding an action at the beginning of $\psi$ will only add more information to it, and hence, $\phi \mathrel{\preceq}\alpha;\psi$ holds too. Rule $\textsc{LCmp$_1$}$ says that $\phi \vert \phi' \mathrel{\preceq}\psi$ holds if it is the case that both $\phi \mathrel{\preceq}\psi$ and $\phi' \mathrel{\preceq}\psi$ hold. The definition of $\textsc{LCmp$_1$}$ implies that we are taking a nonlinear interpretation of logs, that is we allow $\phi$ and $\phi'$ in $\phi \vert \phi'$ to reference the same actions, in effect duplicating the information conveyed about the past. This interpretation is required as the calculus allows the copying of values and their provenance. Rule $\textsc{LCmp$_2$}$ says that in order to show that $\phi \mathrel{\preceq}\psi \vert \psi'$ , we need to show that $\phi \mathrel{\preceq}\psi$ holds.

To illustrate the application of the above rules, let us consider the two logs $\phi \triangleq a.\mathsf{snd}(x,v);a.\mathsf{rcv}(n,x)$ and $\psi \triangleq a.\mathsf{snd}(m,v);a.\mathsf{rcv}(n,m)$ . Log $\phi$ tells us that sent on some unknown channel which it received on channel while $\psi$ tells us that sent on channel which it received on channel . Clearly $\psi$ tells us more information about the past than $\phi$ and the unknown channel in $\phi$ is actually channel (inferred from $\psi$ ). To show this using the inference rules above, we first apply Rule $\textsc{LPre$_1$}$ as follows:

$\displaystyle {\small \inference {a.\mathsf{snd}(x,v) \mathrel{\preceq}a.\maths... ...\mathsf{rcv}(n,x) \mathrel{\preceq} a.\mathsf{snd}(m,v);a.\mathsf{rcv}(n,m)} }$

Then, we show that $a.\mathsf{snd}(x,v) \mathrel{\preceq}a.\mathsf{snd}(m,v)$ and that $a.\mathsf{rcv}(n,m) \mathrel{\preceq}a.\mathsf{rcv}(n,m)$ . The former holds since $a.\mathsf{snd}(x,v)\{^{m}/_{x}\} = a.\mathsf{snd}(m,v)$ whereas the latter may be shown to hold by further application of $\textsc{LPre$_1$}$ and $\textsc{LNil}$ .

Proposition 1 The relation $\mathrel{\preceq}$ is a partial order.

3.2 Denotation of Provenance

We interpret the provenance $\kappa$ in an annotated value $v\colon\kappa$ to be a set of assertions about the past of the value

. These assertions tell us about events that took place in the system and that are relevant to the value

. For example, consider the value $v\colon a?\kappa ;b!\kappa';{\kappa''}$ , its provenance tells us that
$\begin{inparaenum}[\itshape\upshape (a\upshape )] \item $v$ was most recently r... ...''$ tells us about the past of $v$ before it was sent by $b$ \end{inparaenum}$
. It is important here to note that the provenance of

does not reveal the identities of the channels used for communication, nor does it tell us about the ordering between events in $\kappa$ and those in $b!\kappa',\kappa''$ or between events in $\kappa'$ and those in $\kappa''$ .

Assertions such as those above may be encoded in a ``condensed form'' as logs. We define the function $\llbracket - \rrbracket$ which maps $V\colon\kappa$ to a log $\phi$ that represents what the provenance sequence $\kappa$ tells us about the past of . The definition of $\llbracket - \rrbracket$ in Definition 2 below uses a different grammar for provenance sequences than that given in Table 1 in order to make the definition of $\llbracket - \rrbracket$ simpler. It is easy to check, however, that for our purposes the two grammars are equivalent.

Definition 2 (Denotation of provenance) The function $\llbracket - \rrbracket : \mathcal{D}_x \rightarrow \Phi$ is defined inductively on the structure of provenance sequences as follow:

$\displaystyle \llbracket V\colon\epsilon \rrbracket$	$\displaystyle = \varnothing$
$\displaystyle \llbracket V\colon a!\kappa' ;\kappa \rrbracket$	$\displaystyle = a.\mathsf{snd}(x,V);(\llbracket V\colon\kappa \rrbracket \vert \llbracket x\colon\kappa' \rrbracket )$
$\displaystyle \llbracket V\colon a?\kappa' ;\kappa \rrbracket$	$\displaystyle = a.\mathsf{rcv}(x,V);(\llbracket V\colon\kappa \rrbracket \vert \llbracket x\colon\kappa' \rrbracket )$

The empty provenance sequence $\epsilon$ in an annotated value $V\colon\epsilon$ tells us that originated here, and hence $\llbracket V\colon\epsilon \rrbracket$ is taken to be the empty log $\varnothing$ . The annotated values ${V\colon a!\kappa' ;\kappa}$ and ${V\colon a?\kappa' ;\kappa}$ denote that was sent (respectively received) on some unknown channel , and that before that the logs denoted by $\llbracket V\colon\kappa \rrbracket$ and $\llbracket x\colon\kappa' \rrbracket$ occurred in some unknown order. It is clear that a log $\llbracket V\colon\kappa \rrbracket$ is a partial record of what took place in a system. Indeed, it does not contain information about the identities of the channels used for communication, and it lacks information about the order in which some actions took place. This is to be expected of course as the provenance of a value is meant to only record information about actions that are relevant to the value itself.

3.3 Monitored Systems

In order to be able to assess the correctness and completeness of provenance, we introduce the notion of monitored systems. A monitored system is one where every action that takes place is recorded in a global log. The global log provides a repository where every action is logged and whose content is not accessible by principals. It is only meant as a proof tool against which the properties of provenance can be formulated and judged.

We use $M, N, \ldots$ to range over monitored systems and give their formal syntax below.

$\displaystyle M, N ::= \phi \mathrel{\triangleright}S \mid (\nu n) M \mid M \parallel S$

The notation $\phi \mathrel{\triangleright}S$ denotes the monitored system composed of global log $\phi$ and system

. Restriction $(\nu n) M$ and parallel composition $M \parallel S$ are needed to allow the global log to behave like other parts of the system with respect to scope extrusion and intrusion. More specifically, the form $(\nu n) M$ allows channel scopes to be extruded to include the log while the form $M \parallel S$ is needed to allow channels whose scope includes the log but not some part of the system

. Note that indeed the above syntax allows exactly one global log per monitored system. We also define versions of structural congruence and reduction for monitored systems and denote them by $\equiv_m$ and $\mathrel{\rightarrow_m}$ respectively. These are given in Table 4. It should clear from the definition of $\mathrel{\rightarrow_m}$ that the original provenance tracking semantics of systems is preserved, and that all $\mathrel{\rightarrow_m}$ adds is the recording of actions in the global log. We formalise this in Proposition 2. This latter makes use of the log erasure function, $\mathopen{\vert} - \mathclose{\vert}$ , which given a monitored system, removes the log and returns just the system part of it. The function $\mathopen{\vert} - \mathclose{\vert}$ is defined inductively on the structure of monitored systems as follows.

$\displaystyle %\begin{align*} \mathopen{\vert} \phi \mathrel{\triangleright}S \... ...arallel S \mathclose{\vert} = \mathopen{\vert} M \mathclose{\vert} \parallel S$

Proposition 2 $M \mathrel{\rightarrow_m}M'$ implies that $\mathopen{\vert} M \mathclose{\vert} \mathrel{\rightarrow}\mathopen{\vert} M' \mathclose{\vert}$ . Vice versa, $\mathopen{\vert} M \mathclose{\vert} \mathrel{\rightarrow}S$ implies that $M \mathrel{\rightarrow_m}M'$ for some such that $\mathopen{\vert} M' \mathclose{\vert} = S$ .

$\includegraphics{Figures/monitored.eps}$

3.4 Correctness

The first provenance property we look at is correctness. A provenance sequence $\kappa$ in an annotated value $v\colon\kappa$ is considered correct if what it tells us about the past of

agrees with what actually took place. This is defined relative to the global log which is assumed to be a correct and complete record of the past of a system. If every value in a system has correct provenance, then we say that the system as a whole has correct provenance. Theorem 1 states that provenance correctness is preserved by the reduction relation. That is, starting from a system with correct provenance, we are guaranteed to get a system with correct provenance after reduction. Definition 3 makes use of two auxiliary functions: $\mathit{log}(-)$ , which returns the global log of a monitored system, and $\mathit{values}(-)$ , which returns the set of ``annotated values'' of a monitored system. The definition of $\mathit{log}(-)$ , by induction on the structure of monitored systems, is straightforward and hence omitted. This is mostly the case for $\mathit{values}(-)$ too and hence, we only discuss the most interesting cases below. The set of values in a monitored system is defined to be that in its system part (i.e., we ignore the global log and top level restrictions). This is expressed by the following three rules:

$\displaystyle \mathit{values}(\phi \mathrel{\triangleright}S) \triangleq \mathit{values}(S) \quad \mathit{values}((\nu n)M) \triangleq \mathit{values}(M)$
$\displaystyle \quad \mathit{values}(M \parallel S) \triangleq \mathit{values}(M) \cup \mathit{values}(S)$

For systems, we proceed simply by gathering annotated values, that is ``subterms'' of the form $v\colon\kappa$ , and substituting $\star$ for any restricted channel names. So for example, we have that $\mathit{values}(a[P]) \triangleq \mathit{values}(P)$ , $\mathit{values}(m\langle\!\langle v\colon\kappa \rangle\!\rangle ) \triangleq \{ v\colon\kappa \}$ , $\mathit{values}(S \parallel S') \triangleq \mathit{values}(S) \cup \mathit{values}(S')$ , and $\mathit{values}((\nu n) S) \triangleq \mathit{values}(S)\{^{\star}/_{n}\}$ . Note that restriction here is treated differently from that at the top level of monitored systems. The rationale behind this discrepancy is that restricted names at the top level are known to the global log whereas those occurring here are not. The substitution is done to avoid any clashes with names appearing in the log and is inspired by [17]. Definition of $\mathit{values}(P)$ for processes

is similar.

Definition 3 A monitored system

has correct provenance if for all $V\colon\kappa$ in $\mathit{values}(M)$ , we have that $\llbracket V\colon\kappa \rrbracket \mathrel{\preceq}\mathit{log}(M)$ .

Lemma 1 has correct provenance and $M \equiv_m M'$ implies that has correct provenance.

Theorem 1 (Provenance correctness) . has correct provenance and $M \mathrel{\rightarrow_m}M'$ implies that has correct provenance.

Proof. We assume that

has correct provenance and that $M \mathrel{\rightarrow_m}M'$ . We show that

has correct provenance by induction on the derivation of $M \mathrel{\rightarrow_m}M'$ . The details of the proof are straightforward and therefore omitted.

$\qedsymbol$

3.5 Towards a Notion of Completeness

We have already shown informally that the provenance of a value is not a complete record of what took place in the whole system. We formalise this below.

Definition 4 A monitored system

has complete provenance if for all $V\colon\kappa$ in $\mathit{values}(M)$ , we have that $\mathit{log}(M) \mathrel{\preceq}\llbracket V\colon\kappa \rrbracket$ .

Proposition 3 (Provenance incompleteness) has complete provenance and $M \mathrel{\rightarrow_m}M'$ does not imply that has complete provenance.

Proof. Let us consider the monitored system

defined as follows:

$\displaystyle M \triangleq \varnothing \mathrel{\triangleright} a[m\colon\epsilon \langle v\colon\epsilon \rangle ] \parallel b[m\colon\epsilon (x).P]$

Assuming

has complete provenance, then it is clear that

has too. Now, we have that $M \mathrel{\rightarrow_m}M'$ where

is defined as follows:

$\displaystyle M' \triangleq a.\mathsf{snd}(m,v);\varnothing \mathrel{\triangler... ...angle v\colon a!\kappa_m \rangle\!\rangle \parallel b[m\colon\epsilon (x).P]$

does not have complete provenance, since $m\colon\epsilon$ does not. $\qedsymbol$

Completeness is too strong since we are requiring every value to have complete information about the past of the whole system. In fact, in general it is not even possible for a single value, as it can be easily demonstrated by considering a system where a value gets forgotten at some point in the evolution of the system, such as the following:

$\displaystyle \phi \mathrel{\triangleright}a[m\colon\kappa_m (x).0] \parallel m\langle\!\langle v\colon\kappa_v \rangle\!\rangle \parallel S$

where, assuming that

does not contain copies of $v\colon\kappa_v$ , the system will evolve to a state where $v\colon\kappa_v$ is completely forgotten.

4 Related Work

Provenance has been studied in a variety of settings, most prominently in databases and scientific computing. A major theme of provenance research in the database community has been that of identifying parts of the input of a query relevant to parts of its output. The definition of the provenance notion (i.e., what ``relevant'' means) as well as the granularity at which data is considered (i.e., what ``parts'' of the input or output constitute) varies depending on the intended application.

One of the earliest works that studied provenance for database management systems was by Wang and Madnick in their Polygen model [24], which aimed to resolve provenance issues for data composed from multiple sources. Woodruff and Stonebraker [25] studied a notion of provenance called lineage. Their aim was to provide fine-grained lineage information, which they argued was prohibitively expensive to achieve using metadata. Their approach instead relied on a small amount of information about the database operations performed and the analysis of base data. Buneman et al. [5] studied provenance in the setting of a data model that generalises both relational as well hierarchical databases. They proposed two different notions of provenance, which they termed ``why'' and ``where'' provenance, and gave algorithms to compute the two types of provenance information.

Motivated by the need for formal foundations of provenance, Cheney et al. [7] proposed a semantic characterisation of provenance based on dependency analysis. They showed that minimal dependency provenance is not computable and provided dynamic and static techniques to approximate it. Green et al. [11] introduced a model based on semiring valued annotations and showed that this model generalises several models of annotated relations including why provenance.

There is a big incentive to support provenance in scientific computing settings. This stems from the need to allow scientists to understand, analyse and replicate results of experiments. Some of the projects aimed at developing middleware to support provenance in this setting include Chimera [9] (physics and astronomy), $^{\textrm{my}}$ Grid [23] (biology), CMCS [21] (chemical sciences), and ESSW [10] (earth sciences). Bose and Frew [2] and Simmhan et al. [22] give surveys of the most important research efforts in this area.

5 Conclusion and Future Work

We presented a calculus for provenance in distributed systems. The semantics of the calculus tracks the provenance of values dynamically and allows principals to perform tests against it. These tests let principals decide what data to consume and what to do with it based on its provenance. We also discussed the semantics of provenance and studied some of its properties.

We proposed a characterisation of provenance correctness and completeness based on the notions of monitored systems and global logs. We showed that based on this characterisation, our provenance is correct but incomplete. However, as we have mentioned, completeness is too strong since we are requiring the provenance of every value to be a complete record of everything that has taken place within the system. Instead of asking for the provenance to be complete, what is needed is for provenance to be adequate. What we mean by this is that the provenance information recorded has to be enough for its intended application. In developing such an idea, we are interested in using information about the role each principal played in getting a piece of data to its current form (and the implied trust relations between principals), as a measure of how trustworthy a piece of data is likely to be. Hence, we only need to know about actions that are relevant to the value at hand, and we only need to know about the principals involved in the action. We are working on formalising adequacy at the moment. We also plan to investigate other ways of characterising provenance properties, for example, based on logic.

Provenance tracking is performed dynamically at the moment as part of the reduction relation. Although this yields accurate provenance information, it results in runtime overhead as provenance is computed, updated and tests are performed against it. We are working on a static analysis that would alleviate the need for dynamic provenance tracking. The idea behind it is to analyse the flow of data between principals and make sure that principals would only receive data with provenance that matches their expectations.

To keep the calculus simple, we only allowed statically defined patterns with no binding variables. This means that principals cannot use dynamic information for their provenance tests nor can they extract part (or all) of the provenance sequence and use it as data. This is one of the first extensions we aim to make to the calculus. In addition to this, provenance is currently tracked by the runtime environment, and principals are only given ``read-only'' access to it. Moreover, every principal is able to see the entire provenance sequence, regardless of the privacy and security policies of other principals. However, in many applications, principals may wish to control the disclosure of provenance information about them. We are working on extending the calculus with such features at the moment. We are also investigating how to extend the calculus to allow the communication of data structures. With this, we aim to provide support for scenarios where data is not only copied and communicated, but also broken down into pieces, combined with other data, and generally transformed to produce new data. In these scenarios, the ways in which we may manipulate data usually depend on its provenance and vice versa. This means that a transformation of values and needs to take into account their provenances $\kappa_1$ and $\kappa_2$ as well. That is, the result of the transformation should be of the form $f(v_1, \kappa_1, v_2, \kappa_2)$ as opposed to . The same holds for transformations $\bar{f}$ that manipulate the provenance. To be able to express this, we aim to extend the calculus, along similar lines as the applied $\pi$ -calculus [1], with data structures and operations on them.

Bibliography

1: ABADI, M., AND FOURNET, C.
Mobile values, new names, and secure communication.
ACM SIGPLAN Notices 36, 3 (Mar. 2001), 104-115.
2: BOSE, R., AND FREW, J.
Lineage retrieval for scientific data processing: a survey.
ACM Computing Surveys 37, 1 (Mar. 2005), 1-28.
3: BOUDOL, G.
Asynchrony and the $\pi$ -calculus.
Tech. Rep. 1702, INRIA, Sophia-Antipolis, 1992.
4: BUNEMAN, P., CHENEY, J., AND VANSUMMEREN, S.
On the expressiveness of implicit provenance in query and update languages.
In ICDT (2007), vol. 4353 of LNCS, Springer, pp. 209-223.
5: BUNEMAN, P., KHANNA, S., AND TAN, W. C.
Why and where: A characterization of data provenance.
In ICDT (2001), vol. 1973 of LNCS, Springer, pp. 316-330.
6: CASTAGNA, G., NICOLA, R. D., AND VARACCA, D.
Semantic subtyping for the pi-calculus.
Theoretical Computer Science 398, 1-3 (2008), 217-242.
7: CHENEY, J., AHMED, A., AND ACAR, U. A.
Provenance as dependency analysis.
In Database Programming Languages (2007), vol. 4797 of LNCS, Springer, pp. 138-152.
8: CUI, Y., AND WIDOM, J.
Practical lineage tracing in data warehouses.
In ICDE (Mar. 2000), IEEE Computer Society, pp. 367-378.
9: FOSTER, I. T., VÖCKLER, J.-S., WILDE, M., AND ZHAO, Y.
Chimera: A virtual data system for representing, querying, and automating data derivation.
In Statistical and Scientific Database Management (2002), IEEE Computer Society, pp. 37-46.
10: FREW, J., AND BOSE, R.
Earth system science workbench: A data management infrastructure for earth science products.
In Statistical and Scientific Database Management (2001), IEEE Computer Society, pp. 180-189.
11: GREEN, T. J., KARVOUNARAKIS, G., AND TANNEN, V.
Provenance semirings.
In PODS (2007), ACM, pp. 31-40.
12: GREENWOOD, M., GOBLE, C., STEVENS, R., ZHAO, J., ADDIS, M., MARVIN, D., MOREAU, L., AND OINN, T.
Provenance of e-science experiments - experience from bioinformatics.
Proceedings of the UK OST e-Science Second All Hands Meeting, Jan. 2003.
13: GROTH, P., MUNROE, S., MILES, S., AND MOREAU, L.
In Lucio Grandinetti (ed.), HPC and Grids in Action.
IOS Press, Jan. 2008, ch. Applying the Provenance Data Model to a Bioinformatics Case.
14: HENNESSY, M., AND RIELY, J.
Resource access control in systems of mobile agents.
Information and Computation 173, 1 (Feb 2002), 82-120.
15: HONDA, K., AND TOKORO, M.
An object calculus for asynchronous communication.
In ECOOP (1991), Springer-Verlag, pp. 133-147.
16: HOSOYA, H., AND PIERCE, B. C.
Regular expression pattern matching.
In POPL (2001).
Full version in Journal of Functional Programming, 13(6), Nov. 2003, pp. 961-1004.
17: LHOUSSAINE, C., AND SASSONE, V.
A dependently typed ambient calculus.
In ESOP (2004), vol. 2986 of LNCS, Springer, pp. 171-187.
18: MILNER, R.
Communicating and Mobile Systems: The $\pi$ -calculus.
Cambridge University Press, 1999.
19: MOREAU, L., GROTH, P., MILES, S., VAZQUEZ, J., IBBOTSON, J., JIANG, S., MUNROE, S., RANA, O., SCHREIBER, A., TAN, V., AND VARGA, L.
The Provenance of Electronic Data.
Communications of the ACM (Apr. 2008).
20: MUNISWAMY-REDDY, K.-K., HOLLAND, D. A., BRAUN, U., AND SELTZER, M. I.
Provenance-aware storage systems.
In USENIX Annual Technical Conference, General Track (2006), USENIX, pp. 43-56.
21: PANCERELLA, C., MYERS, J., AND RAHN, L.
Data provenance in the collaboratory for multiscale chemical science (CMCS).
Workshop on Data Derivation and Provenance (Jan 2002).
22: SIMMHAN, Y. L., PLALE, B., AND GANNON, D.
A survey of data provenance in e-science.
SIGMOD Record 34, 3 (2005), 31-36.
23: STEVENS, R. D., ROBINSON, A. J., AND GOBLE, C. A.
myGrid: personalised bioinformatics on the information grid.
In ISMB (Supplement of Bioinformatics) (2003), pp. 302-304.
24: WANG, Y. R., AND MADNICK, S. E.
A polygen model for heterogeneous database systems: The source tagging perspective.
In VLDB (1990), Morgan Kaufmann Publishers Inc., pp. 519-538.
25: WOODRUFF, A., AND STONEBRAKER, M.
Supporting fine-grained data lineage in a database visualization environment.
In Proceedings of the Thirteenth International Conference on Data Engineering (1997), IEEE Computer Society, pp. 91-102.

About this document ...

A Formal Model of Provenance in Distributed Systems

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation sfs_html

The translation was initiated by Issam on 2009-02-12

Footnotes

... computation.¹: In a typical implementation of our language, we would assign the provenance tracking tier to a trusted underlying middleware.
... ignored.²: Depending on the intended application, the provenance of the two values tested could be useful and hence should be tracked in the continuation. This is not considered here however as it is not important for the aims of this paper.

Issam 2009-02-12