"The declassified FBI document is a biography of Osama
match subsets of the keywords extracted in step 1, within
Bin Laden". Since the identity of the person to whom the
a reference corpus (such as the public Web) that encap-
document pertains has been redacted, it is impossible to
sulates as much of relevant public knowledge as possi-
learn the statement S from C alone, and so S ∈ K (C).
ble. Our tools then parse the documents returned by the
The statement S is clearly not in K (R) either since it is
search queries for keywords not present in the original
impossible to compute from R alone a statement about
private data. These additional keywords allow us to au-
a document that is in C but not in R. It follows that S
tomatically estimate the likelihood of certain inferences.
does not belong to K (C) ∪ K (R). But, as shown ear-
Potentially dangerous inferences are flagged for manual
lier, the statement S belongs to K (C ∪ R). Indeed, we
review.
learn from C that the document pertains to an individ-
ual characterized by the keywords "Saudi", "magnate",
3.2
Inference Detection Algorithm
"half-brothers", "Yemen", etc. We learn from R that
In this section, we give a generic description of our infer-
these keywords are closely associated with "Osama Bin
ence detection algorithm. This description emphasizes
Laden". If we combine these two sources of information,
we learn that the statement S is true with high probabil-
conceptual understanding. Specific instantiations of the
inference detection algorithms, tailored to two particular
ity.
applications, are given in section 5. These instantiations
It is critical to understand Diff(C, R) prior to pub-
do not realize the full complexity of this general algo-
lishing the collection C of private documents, to en-
rithm partly for efficiency reasons and partly because of
sure that the publication of C does not allow for un-
the attributes of the application. We start with a descrip-
wanted inferences. The owner of C may choose to with-
tion of the inputs, outputs and parameters of our generic
hold from publication parts or all of the documents in
algorithm.
the collection based on an assessment of the difference
Diff(C, R). Sometimes, the set of sensitive knowledge
INPUTS: A private collection of documents C =
K ∗ that should not be leaked is explicitly specified. In
{C1, . . . , Cn}, a collection of reference documents R
this case, the inference control problem consists more
and a list of sensitive keywords K ∗ that represent sen-
precisely of ensuring that the intersection Diff(C, R) ∩
sitive knowledge.
K ∗ is empty.
OUTPUT: A list L of inferences that can be drawn from
the union of C and R. Each inference is of the form:
3.1
Basic Approach
∗
(W1, . . . , Wk ) ⇒ K0 ,
In this work, we consider the case in which C can be any
arbitrary collection of documents. In particular, contrary
where W1, . . . , Wk are keywords extracted from docu-
ments in C, and K0 ⊆ K ∗ is a subset of sensitive key-
∗
to prior work on inference control in databases, we do
∗
not restrict ourselves to private documents formatted ac-
words. The inference (W1, . . . , Wk ) ⇒ K0 , indicates
cording to a well-defined structure. We assume that the
that the keywords (W1, . . . , Wk ), found in the collection
collection R of public documents consists of all publicly
C, together with the knowledge present in R allow for
∗
available documents, and that the public Web serves as a
inference of the sensitive keywords K0 . The algorithm
good proxy for this collection. Our generic approach to
returns an empty list if it fails to detect any sensitive in-
inference detection is based on the following two steps:
ference.
1. UNDERSTANDING
THE CONTENT OF THE DOCU-
PARAMETERS: The algorithm is parameterized by a
MENTS IN THE PRIVATE COLLECTION C . We employ
value α that controls the depth of the NLP analysis of
the documents in C, by two values β and γ that control
automated content analysis in order to efficiently extract
the search depth for documents in R that are related to
keywords that capture the content of the document in the
collection C. A wide array of NLP tools are possible for
C, and finally by a value δ that controls the depth of the
this process, ranging from simple text extraction to deep
NLP analysis of the documents retrieved by the search
linguistic analysis. For the proof-of-concept demonstra-
algorithm. The values α, β, γ and δ are all positive in-
tions described in section 5, we employ keyword selec-
tegers. They can be tuned to achieve different trade-offs
tion via a "term frequency - inverse document frequency"
between the running time of the algorithm and the com-
(TF.IDF) calculation, but we note that a deeper linguistic
pleteness and quality of inference detection.
analysis may produce better results.
UNDERSTANDING THE DOCUMENTS IN C . Our basic
2.
EFFICIENTLY
algorithm uses TF.IDF (term frequency - inverse docu-
DETERMINING THE INFERENCES
ment frequency, see [28] and section 5.1) to extract from
THAT CAN BE DRAWN FROM THE COMBINATION OF
C AND R. We issue search queries for documents that
each document Ci in the collection C the top α keywords