'Unknown' by Unknown - Page 2 of 16

(at least not as it is used in the top 3 hits) but rather it

(such as HIV status, drug or alcohol abuse and mental

health conditions) from medical records prior to releas-

is an attribute associated with a large subset of her fan-

ing them. Among individuals, anonymous bloggers are

base. Similarly, the entire first page of hits returned by

a good example of people who seek to ensure that their

the query "naltrexone acamprosate" all pertain to alco-

posts do not disclose their secret (their identity). This

holism, not because they are alcoholism symptoms or in

is made challenging by the fact that in some cases very

some other way part of the definition of alcoholism, but

little personal information may suffice to infer the blog-

rather they are associated with alcoholism because they

ger's identity. For example, if the second author of this

are drugs commonly used in its treatment.

paper were to reveal his first name (Philippe) and men-

We propose generic tools for detecting unwanted in-

tion the first name of his wife (Sanae), then his last name

ferences automatically using the Web. These tools first

(or at least, a strong candidate for his last name) can be

extract salient keywords from the private data intended

inferred from the first hit returned by the Google query,

for release. Then, they issue search queries for docu-

"Philippe Sanae wedding".

ments that match subsets of these keywords, within a

reference corpus (such as the public Web) that encapsu-

In all these instances, the problem is not access con-

lates as much of relevant public knowledge as possible.

trol, but inference control. Assuming the existence of

Finally, our tools parse the documents returned by the

mechanisms to control access to a subset of informa-

search queries for keywords not present in the original

tion, the problem is to determine what information can

private data. These additional keywords allow us to au-

be released publicly without compromising certain se-

tomatically estimate the likelihood of certain inferences.

crets, and what subset of the information cannot be re-

Potentially dangerous inferences are flagged for manual

leased. What makes this problem difficult is the quantity

review. We call this new technology Web-based infer-

and complexity of inferences that arise when published

ence control.

data is combined with, and interpreted against, the back-

We demonstrate the success of our inference detection

drop of public knowledge and outside data.

tools with two experiments. The first experiment shows

This paper breaks new ground in considering the prob-

the use of our tools to automatically estimate the risk that

lem of inference detection not in a restricted setting (such

an anonymous document allows for re-identification of

as, e.g., database tables), but in all its generality. We

its author. The second experiment shows the use of our

propose the first all-purpose approach to detecting un-

tools to detect the risk that a document is linked to a sen-

wanted inferences. Our approach is based on the ob-

sitive topic. These experiments, while simple, capture

servation that the combination of search engines and the

the full complexity of inference detection and illustrate

Web, which is so well suited to detect inferences, works

the power of our approach.3

equally well defensively as offensively. The Web is an

excellent proxy for public knowledge, since it encapsu-

OVERVIEW. We discuss related work in section 2.

lates a large fraction of that knowledge (though certainly

We define our models and tools, as well as our basic

not all). Furthermore, the dynamic nature of the Web

algorithm for Web-assisted inference detection in sec-

reflects the dynamic nature of human knowledge and

tion 3. We list a number of potential applications of

means that the inferences detected today may be different

Web-assisted inference control in section 4. Section 5

from those drawn yesterday. The likelihood of certain in-

describes two experiments that demonstrate the success

ferences can thus be estimated automatically, at any point

of our inference control tools. Section 6 provides an ex-

in time, by issuing search queries to the Web. Returning

ample using Web-based inference detection to improve

to the example of the biography redacted by the FBI, a

the redaction process. We conclude in section 7.

simple search query could have flagged the risk of re-

identification coming from the keywords "Saudi", "mag-

Related Work

nate" and "half-brother".

The Web is an ideal resource for identifying infer-

Our work can be viewed both as a new technique for in-

ences because keyword search allows for efficient de-

ference detection and as a new way of leveraging Web

tection of the information that is associated with an in-

search to understand content. There is substantial exist-

dividual. Such associations can be just as important in

ing work in both areas, but ours is the first Web-based

identifying someone as their personal attributes. As an

approach to inference detection. We discuss the most

example, consider the fact that the top 2 hits returned by

closely related work in these areas below.

the Google query, "pop singer vogueing"1 have nothing

to do with the singer Madonna, whereas the top 3 hits re-

INFERENCE DETECTION. Most of the previous work on

turned by the Google query, "gay pop singer vogueing"2

inference detection has focused on database content (see,

all pertain to Madonna. The attribute "gay" helps to fo-

for example, [33, 21, 43, 19]). Work in this area takes

cus the results not because it is an attribute of Madonna

as input the database schema, the data themselves and,