Check out the new USENIX Web site.
sometimes, relations amongst the attributes of the data-
to understand and analyze content. Nakov and Hearst
base that are meant to model the outside knowledge a
[30] have shown the power of using the Web as training
human may wield in order to infer sensitive information.
data for natural language analysis. Web-assistance for
To the best of our understanding, no systematic method
extracting keywords for the purposes of content indexing
has been demonstrated for integrating this outside knowl-
and annotation is studied in [12, 37, 26]. This work is fo-
edge into an inference detection system. Our work seeks
cused on automated, Web-based tools for understanding
to remedy this by demonstrating the use of the Web for
the meaning of the text as written, as opposed to the in-
this purpose. When coupled with simple keyword extrac-
ferences that can be drawn based on the text. That said,
tion, this general technique allows us to detect inference
in our work we use very simple content analysis tools,
in a variety of unstructured documents.
and improvements to our approach could involve more
sophisticated content analysis tools including Web-based
A particular type of inference allows the identifica-
tools such as those developed in these works.
tion of an individual. Sweeney looks for such inferences
using the Web in [35] where inferences are enabled by
WEB-BASED DATA AGGREGATION. Finally, we note
numerical values and other attributes characterizable by
that the commercial world is beginning to offer Web-
regular expressions such as SSNs, account numbers and
based data aggregation tools (see, for example [14, 13,
addresses. Sweeney does not consider inferences based
31]) for the purposes of tracking competitor behavior,
on English language words. We use the indexing power
doing market analysis and intelligence gathering. We are
of search engines to detect when words, taken together,
not aware of support for pre-production inference control
are closely associated with an individual.
in these offerings, as is the focus of this paper.
The closely related problem of author identification
has also been extensively studied by the machine learn-
3
Model and Generic Algorithm
ing community (see, for example, [25, 11, 24, 34, 20]).
The techniques developed generally rely on a training
Let C denote a private collection of documents that is
corpus of documents and use specific attributes like self-
being considered for public release, and let R denote a
citations [20] or writing style [25] to identify authors.
collection of reference documents. For example, the col-
Our work can be viewed as exploiting a previously un-
lection C may consist of the blog entries of a writer, and
studied method of author identification, using informa-
the collection R may consist of all documents publicly
tion authors reveal about themselves to identify them.
available on the Web.
Atallah, et al. [2], describe how natural language
Let K (C) denote all the knowledge that can be com-
processing can potentially be used to sanitize sensi-
puted from the private collection C. The set K (C) infor-
tive information when the sanitization rules are already
mally represents all the statements and facts that can be
known. Our work is focused on using the Web to iden-
logically derived from the information contained in the
tify the sanitization rules.
collection C. The set K (C) could in theory be computed
WEB-ASSISTED QUERY INTERPRETATION. There is a
with a complete and sound theorem prover given all the
large body of work on using the Web to improve query
axioms in C. In practice, such a computation is impos-
results (see, for example, [16, 32, 10]). One of the funda-
sible and we will instead rely on approximate represen-
mental ideas that has come out of this area is to use over-
tations of the set K (C). Similarly let K (R) denote all
lap in query results to establish a connection between dis-
the knowledge that can be computed from the reference
tinct queries. In contrast, we analyze the content of the
collection R.
query results in order to detect connections between the
Informally stated, the problem of inference control
query terms and an individual or topic.
comes from the fact that the knowledge that can be ex-
tracted from the union of the private and reference col-
WEB-BASED SOCIAL NETWORK ANALYSIS. Recently,
lections K (C ∪ R) is typically greater than the union
the Web has been used to detect social networks (e.g.,
K (C) K (R) of what can be extracted separately from
[1, 23]). A key idea in this work is using the Web to look
C and R. The inference control problem is to understand
for co-occurences of names and using this to infer a link
and control the difference:
in a social network. Our techniques can support this type
of analysis, when, for example, names in a network when
K
.
Diff(C, R) = K (C ∪ R) -
(C) K (R)
entered as a Web query, yield a name that is not already
in the network. However, our techniques are aimed at
Returning to the Osama Bin Laden example discussed
a broader goal, that is, understanding all inferences that
in the introduction, consider the case where the col-
can be drawn from a document.
lection C consists of the single declassified FBI docu-
ment [8], and where R consists of all information pub-
WEB-ASSISTED CONTENT ANALYSIS AND ANNOTA-
licly available on the Web. Let S denote the statement:
TION. There is a large body of work on using the Web