'Unknown' by Unknown - Page 4 of 16

"The declassified FBI document is a biography of Osama

match subsets of the keywords extracted in step 1, within

Bin Laden". Since the identity of the person to whom the

a reference corpus (such as the public Web) that encap-

document pertains has been redacted, it is impossible to

sulates as much of relevant public knowledge as possi-

learn the statement S from C alone, and so S ∈ K (C).

ble. Our tools then parse the documents returned by the

The statement S is clearly not in K (R) either since it is

search queries for keywords not present in the original

impossible to compute from R alone a statement about

private data. These additional keywords allow us to au-

a document that is in C but not in R. It follows that S

tomatically estimate the likelihood of certain inferences.

does not belong to K (C) ∪ K (R). But, as shown ear-

Potentially dangerous inferences are flagged for manual

lier, the statement S belongs to K (C ∪ R). Indeed, we

review.

learn from C that the document pertains to an individ-

ual characterized by the keywords "Saudi", "magnate",

3.2

Inference Detection Algorithm

"half-brothers", "Yemen", etc. We learn from R that

In this section, we give a generic description of our infer-

these keywords are closely associated with "Osama Bin

ence detection algorithm. This description emphasizes

Laden". If we combine these two sources of information,

we learn that the statement S is true with high probabil-

conceptual understanding. Specific instantiations of the

inference detection algorithms, tailored to two particular

ity.

applications, are given in section 5. These instantiations

It is critical to understand Diff(C, R) prior to pub-

do not realize the full complexity of this general algo-

lishing the collection C of private documents, to en-

rithm partly for efficiency reasons and partly because of

sure that the publication of C does not allow for un-

the attributes of the application. We start with a descrip-

wanted inferences. The owner of C may choose to with-

tion of the inputs, outputs and parameters of our generic

hold from publication parts or all of the documents in

algorithm.

the collection based on an assessment of the difference

Diff(C, R). Sometimes, the set of sensitive knowledge

INPUTS: A private collection of documents C =

K ∗ that should not be leaked is explicitly specified. In

{C1, . . . , Cn}, a collection of reference documents R

this case, the inference control problem consists more

and a list of sensitive keywords K ∗ that represent sen-

precisely of ensuring that the intersection Diff(C, R) ∩

sitive knowledge.

K ∗ is empty.

OUTPUT: A list L of inferences that can be drawn from

the union of C and R. Each inference is of the form:

3.1

Basic Approach

∗

(W1, . . . , Wk ) ⇒ K0 ,

In this work, we consider the case in which C can be any

arbitrary collection of documents. In particular, contrary

where W1, . . . , Wk are keywords extracted from docu-

ments in C, and K0 ⊆ K ∗ is a subset of sensitive key-

∗

to prior work on inference control in databases, we do

∗

not restrict ourselves to private documents formatted ac-

words. The inference (W1, . . . , Wk ) ⇒ K0 , indicates

cording to a well-defined structure. We assume that the

that the keywords (W1, . . . , Wk ), found in the collection

collection R of public documents consists of all publicly

C, together with the knowledge present in R allow for

∗

available documents, and that the public Web serves as a

inference of the sensitive keywords K0 . The algorithm

good proxy for this collection. Our generic approach to

returns an empty list if it fails to detect any sensitive in-

inference detection is based on the following two steps:

ference.

1. UNDERSTANDING

THE CONTENT OF THE DOCU-

PARAMETERS: The algorithm is parameterized by a

MENTS IN THE PRIVATE COLLECTION C . We employ

value α that controls the depth of the NLP analysis of

the documents in C, by two values β and γ that control

automated content analysis in order to efficiently extract

the search depth for documents in R that are related to

keywords that capture the content of the document in the

collection C. A wide array of NLP tools are possible for

C, and finally by a value δ that controls the depth of the

this process, ranging from simple text extraction to deep

NLP analysis of the documents retrieved by the search

linguistic analysis. For the proof-of-concept demonstra-

algorithm. The values α, β, γ and δ are all positive in-

tions described in section 5, we employ keyword selec-

tegers. They can be tuned to achieve different trade-offs

tion via a "term frequency - inverse document frequency"

between the running time of the algorithm and the com-

(TF.IDF) calculation, but we note that a deeper linguistic

pleteness and quality of inference detection.

analysis may produce better results.

UNDERSTANDING THE DOCUMENTS IN C . Our basic

EFFICIENTLY

algorithm uses TF.IDF (term frequency - inverse docu-

DETERMINING THE INFERENCES

ment frequency, see [28] and section 5.1) to extract from

THAT CAN BE DRAWN FROM THE COMBINATION OF

C AND R. We issue search queries for documents that

each document Ci in the collection C the top α keywords