'Unknown' by Unknown - Page 3 of 16

sometimes, relations amongst the attributes of the data-

to understand and analyze content. Nakov and Hearst

base that are meant to model the outside knowledge a

[30] have shown the power of using the Web as training

human may wield in order to infer sensitive information.

data for natural language analysis. Web-assistance for

To the best of our understanding, no systematic method

extracting keywords for the purposes of content indexing

has been demonstrated for integrating this outside knowl-

and annotation is studied in [12, 37, 26]. This work is fo-

edge into an inference detection system. Our work seeks

cused on automated, Web-based tools for understanding

to remedy this by demonstrating the use of the Web for

the meaning of the text as written, as opposed to the in-

this purpose. When coupled with simple keyword extrac-

ferences that can be drawn based on the text. That said,

tion, this general technique allows us to detect inference

in our work we use very simple content analysis tools,

in a variety of unstructured documents.

and improvements to our approach could involve more

sophisticated content analysis tools including Web-based

A particular type of inference allows the identifica-

tools such as those developed in these works.

tion of an individual. Sweeney looks for such inferences

using the Web in [35] where inferences are enabled by

WEB-BASED DATA AGGREGATION. Finally, we note

numerical values and other attributes characterizable by

that the commercial world is beginning to offer Web-

regular expressions such as SSNs, account numbers and

based data aggregation tools (see, for example [14, 13,

addresses. Sweeney does not consider inferences based

31]) for the purposes of tracking competitor behavior,

on English language words. We use the indexing power

doing market analysis and intelligence gathering. We are

of search engines to detect when words, taken together,

not aware of support for pre-production inference control

are closely associated with an individual.

in these offerings, as is the focus of this paper.

The closely related problem of author identification

has also been extensively studied by the machine learn-

Model and Generic Algorithm

ing community (see, for example, [25, 11, 24, 34, 20]).

The techniques developed generally rely on a training

Let C denote a private collection of documents that is

corpus of documents and use specific attributes like self-

being considered for public release, and let R denote a

citations [20] or writing style [25] to identify authors.

collection of reference documents. For example, the col-

Our work can be viewed as exploiting a previously un-

lection C may consist of the blog entries of a writer, and

studied method of author identification, using informa-

the collection R may consist of all documents publicly

tion authors reveal about themselves to identify them.

available on the Web.

Atallah, et al. [2], describe how natural language

Let K (C) denote all the knowledge that can be com-

processing can potentially be used to sanitize sensi-

puted from the private collection C. The set K (C) infor-

tive information when the sanitization rules are already

mally represents all the statements and facts that can be

known. Our work is focused on using the Web to iden-

logically derived from the information contained in the

tify the sanitization rules.

collection C. The set K (C) could in theory be computed

WEB-ASSISTED QUERY INTERPRETATION. There is a

with a complete and sound theorem prover given all the

large body of work on using the Web to improve query

axioms in C. In practice, such a computation is impos-

results (see, for example, [16, 32, 10]). One of the funda-

sible and we will instead rely on approximate represen-

mental ideas that has come out of this area is to use over-

tations of the set K (C). Similarly let K (R) denote all

lap in query results to establish a connection between dis-

the knowledge that can be computed from the reference

tinct queries. In contrast, we analyze the content of the

collection R.

query results in order to detect connections between the

Informally stated, the problem of inference control

query terms and an individual or topic.

comes from the fact that the knowledge that can be ex-

tracted from the union of the private and reference col-

WEB-BASED SOCIAL NETWORK ANALYSIS. Recently,

lections K (C ∪ R) is typically greater than the union

the Web has been used to detect social networks (e.g.,

K (C) ∪ K (R) of what can be extracted separately from

[1, 23]). A key idea in this work is using the Web to look

C and R. The inference control problem is to understand

for co-occurences of names and using this to infer a link

and control the difference:

in a social network. Our techniques can support this type

of analysis, when, for example, names in a network when

Diff(C, R) = K (C ∪ R) -

entered as a Web query, yield a name that is not already

in the network. However, our techniques are aimed at

Returning to the Osama Bin Laden example discussed

a broader goal, that is, understanding all inferences that

in the introduction, consider the case where the col-

can be drawn from a document.

lection C consists of the single declassified FBI docu-

ment [8], and where R consists of all information pub-

WEB-ASSISTED CONTENT ANALYSIS AND ANNOTA-

licly available on the Web. Let S denote the statement:

TION. There is a large body of work on using the Web