Check out the new USENIX Web site.
(at least not as it is used in the top 3 hits) but rather it
(such as HIV status, drug or alcohol abuse and mental
health conditions) from medical records prior to releas-
is an attribute associated with a large subset of her fan-
ing them. Among individuals, anonymous bloggers are
base. Similarly, the entire first page of hits returned by
a good example of people who seek to ensure that their
the query "naltrexone acamprosate" all pertain to alco-
posts do not disclose their secret (their identity). This
holism, not because they are alcoholism symptoms or in
is made challenging by the fact that in some cases very
some other way part of the definition of alcoholism, but
little personal information may suffice to infer the blog-
rather they are associated with alcoholism because they
ger's identity. For example, if the second author of this
are drugs commonly used in its treatment.
paper were to reveal his first name (Philippe) and men-
We propose generic tools for detecting unwanted in-
tion the first name of his wife (Sanae), then his last name
ferences automatically using the Web. These tools first
(or at least, a strong candidate for his last name) can be
extract salient keywords from the private data intended
inferred from the first hit returned by the Google query,
for release. Then, they issue search queries for docu-
"Philippe Sanae wedding".
ments that match subsets of these keywords, within a
reference corpus (such as the public Web) that encapsu-
In all these instances, the problem is not access con-
lates as much of relevant public knowledge as possible.
trol, but inference control. Assuming the existence of
Finally, our tools parse the documents returned by the
mechanisms to control access to a subset of informa-
search queries for keywords not present in the original
tion, the problem is to determine what information can
private data. These additional keywords allow us to au-
be released publicly without compromising certain se-
tomatically estimate the likelihood of certain inferences.
crets, and what subset of the information cannot be re-
Potentially dangerous inferences are flagged for manual
leased. What makes this problem difficult is the quantity
review. We call this new technology Web-based infer-
and complexity of inferences that arise when published
ence control.
data is combined with, and interpreted against, the back-
We demonstrate the success of our inference detection
drop of public knowledge and outside data.
tools with two experiments. The first experiment shows
This paper breaks new ground in considering the prob-
the use of our tools to automatically estimate the risk that
lem of inference detection not in a restricted setting (such
an anonymous document allows for re-identification of
as, e.g., database tables), but in all its generality. We
its author. The second experiment shows the use of our
propose the first all-purpose approach to detecting un-
tools to detect the risk that a document is linked to a sen-
wanted inferences. Our approach is based on the ob-
sitive topic. These experiments, while simple, capture
servation that the combination of search engines and the
the full complexity of inference detection and illustrate
Web, which is so well suited to detect inferences, works
the power of our approach.3
equally well defensively as offensively. The Web is an
excellent proxy for public knowledge, since it encapsu-
OVERVIEW.  We discuss related work in section 2.
lates a large fraction of that knowledge (though certainly
We define our models and tools, as well as our basic
not all). Furthermore, the dynamic nature of the Web
algorithm for Web-assisted inference detection in sec-
reflects the dynamic nature of human knowledge and
tion 3. We list a number of potential applications of
means that the inferences detected today may be different
Web-assisted inference control in section 4. Section 5
from those drawn yesterday. The likelihood of certain in-
describes two experiments that demonstrate the success
ferences can thus be estimated automatically, at any point
of our inference control tools. Section 6 provides an ex-
in time, by issuing search queries to the Web. Returning
ample using Web-based inference detection to improve
to the example of the biography redacted by the FBI, a
the redaction process. We conclude in section 7.
simple search query could have flagged the risk of re-
identification coming from the keywords "Saudi", "mag-
2
Related Work
nate" and "half-brother".
The Web is an ideal resource for identifying infer-
Our work can be viewed both as a new technique for in-
ences because keyword search allows for efficient de-
ference detection and as a new way of leveraging Web
tection of the information that is associated with an in-
search to understand content. There is substantial exist-
dividual. Such associations can be just as important in
ing work in both areas, but ours is the first Web-based
identifying someone as their personal attributes. As an
approach to inference detection. We discuss the most
example, consider the fact that the top 2 hits returned by
closely related work in these areas below.
the Google query, "pop singer vogueing"1 have nothing
to do with the singer Madonna, whereas the top 3 hits re-
INFERENCE DETECTION. Most of the previous work on
turned by the Google query, "gay pop singer vogueing"2
inference detection has focused on database content (see,
all pertain to Madonna. The attribute "gay" helps to fo-
for example, [33, 21, 43, 19]). Work in this area takes
cus the results not because it is an attribute of Madonna
as input the database schema, the data themselves and,