Check out the new USENIX Web site.
let WB = {W1, W2, . . . , W30} be a subset
"transmit", "infected"). However, for some individuals
of 30 terms from that list, where the selec-
the very fact that even general terms are frequently as-
tion method varies with each run of the exper-
sociated with sensitive diseases may be enough to jus-
iment (see the results discussion below for the
tify redaction (e.g. a politician may desire the removal
specifics).
of any "red flag" words). In general though, we think
(c) For each pair of words {Wi, Wj } ∈ SB , let
a redaction practitioner could defensibly not make it a
Qi,j be the query consisting of just those two
practice to redact such general terms given their associa-
words with no additional punctuation and the
tion with other, less sensitive, diseases. This emphasizes
restriction that no pages from the domain of
that our techniques support semi-automation, but not full
source page B be returned, and that all re-
automation, of the redaction process.
turned pages be text or html (to avoid parsing
PERFORMANCE. Amortizing the cost of text extraction
difficulties). Let Hi,j denote the first hit re-
from the Wikipedia source page over all the queries, de-
turned after issuing query Qi,j to Google, after
termining if each keyword pair yielded a top hit contain-
known medical terms Web sites were removed
ing a sensitive word took approximately 150 seconds.
from the Google results8.
Hence, each of the experiments in figures 3 and 4 took
(d) For all i, j ∈ {1, . . . , 30}, i = j, and for
around 6 hours, since 435 pairs from the Wikipedia page
= {1, . . . , b}, search for the string v K  ∗
were tested along with 435 pairs from the "control" set
in the first 5000 lines of Hi,j . If v is found,
of keywords.
record v , wi, wj and Hi,j and discontinue the
As in the de-anonymization experiments, our main
search.
time cost was due to the process of text extraction from
2. Output: All triples (v , Qi,j , Hi,j ) found in step 1,
html. For these experiments caching is likely to signifi-
where v is in the first 5000 lines of Hi,j .
cantly improve performance as many of the medical re-
source sites were visited multiple times.
RESULTS FOR STD EXPERIMENTS. We ran the above
test on the Wikipedia page about STDs [41], B, and a
6 Use Scenario: Iterative Redaction
selected set, B , of 30 keywords from the medical term
index [29]. The set B was selected by starting at the
As mentioned in sections 1 and 4, the process of sani-
49th entry in the medical term index and selecting every
tizing documents by removing obviously identifying in-
400th word in order to approximate a random selection
formation like names and social security numbers can
of medical terms. As expected, keyword pairs from input
be improved by using Web-based inference detection to
B generated far more hits for STDs (306/435 > 70%)
identify pieces of seemingly innocuous information that
than keyword pairs from B (108/435 < 25%).  The
can be used to make sensitive inference. To illustrate this
results are summarized in figure 3.
idea, we return to the poorly redacted FBI document in
RESULTS FOR ALCOHOLISM EXPERIMENTS. We ran
the left-hand side of figure 6. Algorithms like those pre-
the above test on the Wikipedia page about alcoholism
sented in sections 3.2 and 5 can be used to identify sets
[40], B, and a selected set, B , of 30 keywords from the
of keywords that allow for undesired inferences. Some
medical term index [29]. For the run analyzed in Fig-
or all of those keywords can then be redacted to improve
ure 4, the set B was selected by starting at the 52nd entry
the sanitization process.
in the medical term index and selecting every 100th word
We emphasize that the strategy for redacting based
until 30 were accumulated in order to approximate a ran-
upon the inferences detected by our algorithms is a re-
dom selection of medical terms. As expected, keyword
search problem that is not addressed by this paper. In-
pairs from input B generated far more hits for alcoholism
deed many strategies are possible. For example, one
(47.82%) than B (9.43%). In addition, we manually re-
might redact the minimum set of words (in which case,
viewed the URLs that yielded a hit in v KAlc for a
the redactor seeks to find a minimum set cover for the
seemingly innocuous pair of keywords. These results are
collection of sets output by the inference detection algo-
summarized in figure 4.
rithm). Alternatively, the redactor might be biased in fa-
vor of redacting certain parts of speech (e.g. nouns rather
APPLYING THE RESULTS.  When redacting medical
than verbs) to enhance readability of the redacted docu-
records, a redaction practitioner might use the results in
ment.
figures 3 and 4 to choose content to redact. For exam-
ple, figure 4 indicates the medications naltrexone and
The type of redaction strategy that is employed may
acamprosate should be removed due to their popular-
influence the Web-based inference detection algorithm.
ity as alcoholism treatments. The words identified as
For example, if the goal is to redact the minimum set of
STD-inference enabling are far more ambiguous (e.g.
words, then it is necessary to consider all possible sets