Check out the new USENIX Web site.
that are most representative of Ci. Let Si denote the set
sensitive keywords. One way of doing so is to issue a
of the top α keywords extracted from document Ci, and
search query for documents in the reference collection
let S = n 1Si.
R that contain the sensitive topic, then use TF.IDF
i=
to extract from these documents an expanded set of
INFERENCE DETECTION. The list L of inferences is ini-
sensitive keywords.
tially empty. We consider in turn every subset S  ⊆
S of size |S | β.  For every such subset S  =
(W1, . . . , Wk ), with k β, we do the following:
4
Example Applications
1. We use a search engine to retrieve from the collec-
tion R of reference documents the top γ documents
This section describes a wide array of potential applica-
that contain all the keywords W1, . . . , Wk .
tions for Web-based inference detection. All these appli-
cations are based on the fundamental algorithm of sec-
2. With TF.IDF, we extract the top δ keywords from
tion 3. The first two applications are the subjects of the
this collection of γ documents. Note that these key-
experiments described in detail in section 5. Experiment-
words are extracted from the aggregate collection of
ing with other applications will be the subject of future
γ documents (as if all these documents were con-
work.
catenated into a single large document), not from
REDACTION OF MEDICAL RECORDS. Medical records
each individual document.
are often released to third parties such as insurance com-
3. Let K0 denote the intersection of the δ keywords
panies, research institutions or legal counsel in the case
from step 2 with the set K  ∗ of sensitive keywords.
of malpractice lawsuits. State and federal legislation
If K0 is non-empty, we add to L the inference C
mandates the redaction of sensitive information from
K0 .
medical records prior to release. For example, all ref-
erences to drugs and alcohol, mental health and HIV sta-
The algorithm outputs the list L and terminates.
tus must typically be redacted. This redaction task is far
more complex than it may initially appear. Extensive and
3.3
Variants of the Algorithm
up-to-date knowledge of diseases and drugs is required to
detect all clues and combinations of clues that may allow
The algorithm of section 3.2 can be tailored to a variety
for inference of sensitive information. Since this medical
of applications. Two such applications are discussed in
information is readily available on public websites, the
exhaustive detail in section 5. Here, we discuss briefly
process of redacting sensitive information from medical
other possible variants of the basic algorithm.
records can be partially automated with Web-based infer-
ence control. Section 5.3 reports on our experiments with
DETECTING ALL INFERENCES. In some applications,
the set of sensitive knowledge K  ∗ may not be known or
Web-based inference detection for medical redaction.
may not be specified. Instead, the goal is to identify all
PRESERVING INDIVIDUAL ANONYMITY. Intelligence
possible inferences that arise from knowledge of the col-
and other governmental agencies are often forced by law
lection of documents C and the reference collection R.
(such as the Freedom of Information Act) to release pub-
A simple variation of the algorithm given in 3.2 handles
licly documents that pertain to a particular individual or
this case. In step 3 of the inference detection phase, we
group of individuals. To protect the privacy of those con-
record all inferences instead of only inferences that in-
cerned, the documents must be released in a form that
volve keywords in K  ∗. Note that this is equivalent to
does not allow for unique identification. This problem is
assuming that the set K  ∗ of sensitive knowledge consists
notoriously difficult, because seemingly innocuous infor-
of all knowledge. The algorithm may also track the num-
mation may allow for unique identification, as illustrated
ber of occurrences of each inference, so that the list L can
by the poorly redacted Osama Bin Laden biography [8]
be sorted from most to least frequent inference.
discussed in the introduction. Web-based inference con-
ALTERNATIVE  REPRESENTATION  OF  SENSITIVE
trol is perfectly suited to the detection of indirect infer-
KNOWLEDGE. The algorithm of section 3.2 assumes
ences based on publicly available data. Our tools can
that the sensitive knowledge K  ∗ is given as a set of
be used to determine how much information can be re-
leased about a person, entity or event while preserving k-
keywords. Other representations of sensitive knowledge
are possible. In some applications for example, sensitive
anonymity, i.e. ensuring that it remains hidden in a group
of like-entities of size at least k, and cannot be identified
knowledge may consist of a topic (e.g.  alcoholism,
or sexually transmitted diseases) instead of a list of
any more precisely within the group. Section 5.2 reports
keywords. To handle this case, we need a pre-processing
on our experiments with Web-based inference detection
step which converts a sensitive topic into a list of
for preserving individual anonymity.