Web-Based Inference Detection
Jessica Staddon1 Philippe Golle1
,
,
2
Bryce Zimny
1
2
Palo Alto Research Center
University of Waterloo
{staddon,pgolle}@parc.com
bzzimny@student.cs.uwaterloo.ca
Abstract
newspapers, public records, personal webpages, blogs,
etc., make it easy and convenient to look up facts, keep
Newly published data, when combined with existing
up with events and catch up with people.
public knowledge, allows for complex and sometimes
On the flip side, information has never been harder to
unintended inferences.  We propose semi-automated
hide. With the help of a search engine or web informa-
tools for detecting these inferences prior to releasing
tion integration tool [45], one can easily infer facts, re-
data. Our tools give data owners a fuller understanding
construct events and piece together identities from frag-
of the implications of releasing data and help them ad-
ments of information collected from disparate sources.
just the amount of data they release to avoid unwanted
Protecting information requires hiding not only the in-
inferences.
formation itself, but also the myriad of clues that might
Our tools first extract salient keywords from the pri-
indirectly lead to it. Doing so is notoriously difficult, as
vate data intended for release. Then, they issue search
seemingly innocuous information may give away one's
queries for documents that match subsets of these key-
secret.
words, within a reference corpus (such as the public
To illustrate the problem, consider a redacted biogra-
Web) that encapsulates as much of relevant public knowl-
phy [8] (shown in the left-hand side of figure 6) that was
edge as possible. Finally, our tools parse the documents
released by the FBI. Prior to publication, the biography
returned by the search queries for keywords not present
was redacted to protect the identity of the person whom
in the original private data. These additional keywords
it describes. All directly identifying information, such as
allow us to automatically estimate the likelihood of cer-
first and last names, was expunged from the biography.
tain inferences.  Potentially dangerous inferences are
The redacted biography contains only keywords that ap-
flagged for manual review.
ply to many individuals, such as "half-brother", "Saudi",
We call this new technology Web-based inference
"magnate" and "Yemen". None of these keywords is par-
control. The paper reports on two experiments which
ticularly identifying on its own, but in aggregate they al-
demonstrate early successes of this technology. The first
low for near-certain identification of Osama Bin Laden.
experiment shows the use of our tools to automatically
Indeed, a Google search for the query "Saudi magnate
estimate the risk that an anonymous document allows
half-brother" returns in the top 10 results, pages that are
for re-identification of its author. The second experiment
all related to the Bin Laden family. This inference, as
shows the use of our tools to detect the risk that a doc-
well as potentially many others, should be anticipated
ument is linked to a sensitive topic. These experiments,
and countered in a thorough redaction process.
while simple, capture the full complexity of inference de-
The need to protect secret information from unwanted
tection and illustrate the power of our approach.
inferences extends far beyond the FBI. In addition to in-
telligence agencies and the military, numerous govern-
1
Introduction
ment agencies, businesses and individuals face the prob-
lem of insulating their secrets from the information they
Information has never been easier to find. Search en-
disclose publicly. In the litigation industry for example,
gines allow easy access to the vast amounts of infor-
information protected by client-attorney privilege must
mation available on the Web. Online data repositories,
be redacted from documents prior to disclosure. In the
healthcare industry, it is common practice and mandated
This work was done while a coop student at the Palo Alto Research
by some US state laws, to redact sensitive information
Center.