Check out the new USENIX Web site.
Keywords
URL of Top Hit
Name of Person
campaigned soviets
Ronald Reagan
https://www.utexas.edu/features/archive/2004/election policy.html
defense contra reagan
Caspar Weinberger
https://www.pbs.org/wgbh/amex/reagan/peopleevents/pande08.html
reagan attorney
edit pornography
Edwin Meese
https://www.sourcewatch.org/index.php?title=Edwin Meese III
nfl nicole goldman
francisco pro
O. J. Simpson
https://www.brainyhistory.com/years/1997.html
https://www.amazon.com/Kung-Fu-Complete-Second-Season/
kung fu actors
David Carradine
dp/B0006BAWYM
medals medal raid
https://www.voicenet.com/~padilla/pearl.html
honor aviation
Jimmy Doolittle
l
fables chicago indiana
George Ade
https://www.indianahistory.org/pop hist/people/ade.html
wisconsin illinois chicago
architect designed
Frank Lloyd Wright
https://www.greatbuildings.com/architects/Frank Lloyd Wright.html
Figure 2:
Excerpts from our de-anonymization experiments. Each row lists keywords extracted from the Wikipedia biography of an individual
(categorized under "California" or "Illinois"), a hit returned by a Google query on those keywords that is one of the top three hits returned and
contains the individual's name, and the name of the individual.
5.3
Web-based Sensitive Topic Detection
use this output to decide what words to redact from med-
ical records before they are released in order to preserve
Another application of Web-based inference detection is
the privacy of the patient.
the redaction of medical records. As discussed earlier, it
To gain some confidence in our approach we also used
is common practice to redact all information about dis-
a collection of general medical terms as a "control" and
eases such as HIV/Aids, mental illness, and drug and
followed the same algorithm. That is, we made Google
alcohol abuse, prior to releasing medical records to a
queries using these medical terms and looked for refer-
third party (such as, e.g., a judge in malpractice liti-
ences to a sensitive disease (STDs and alcoholism) in the
gation). Implementing such protections today relies on
returned links. The purpose of this process was to see
the thoroughness of the redaction practitioner to keep
if the results would differ from those obtained with key-
abreast of all the medications, physician names, diag-
words from the Wikipedia pages about STDs and alco-
noses and symptoms that might be associated with such
holism. We expected a distinct difference because the
conditions and practices. Web-based inference detection
Wikipedia pages should yield keywords more relevant to
can be used to improve the thoroughness of this task by
STDs and alcoholism, and indeed the results indicate that
automating the process of identifying the keywords al-
is the case.
lowing such conditions to be inferred.
The following describes our experiment in more de-
tail.
To demonstrate how our algorithm can be used in this
application, our experiments take as input a page that is
1. Input: An ordered set of sensitive words, K  ∗ =
viewed as authoritative about a certain disease. In our
{v1, . . . , vb}, for some positive integer b, and a
experiments, we used Wikipedia to supply pages for al-
page, B. B is either the Wikipedia page for alco-
coholism and sexually transmitted diseases (STDs). The
holism [40], the Wikipedia page for sexually trans-
text is then extracted from the html, and keywords are
mitted diseases (STDs) [41] or a "control" page of
identified. To identify keywords that might allow the
general medical terms.
associated disease to be inferred we then issued Google
(a) If B is a Wikipedia page, extract the top
queries on subsets of keywords and examined the top hit
30 keywords from B, forming the set SB ,
for references to the associated disease. In general, we
through the following steps:
counted as a reference any mention of the associated dis-
ease. The one exception to this rule is that we filtered out
i. Extract text from html.
some medical term sites since such sites list unrelated
ii. Calculate the enhanced TF.IDF ranking
medical terms together (for indexing purposes) and we
of each word in the extracted text (sec-
didn't want such lists to trigger inference results.
tion 3). Select the top 30 words as the
ordered set, SB = {W1, W2, . . . , W30}.
In the event that such a reference was found we
(b) If B is a medical terms page, extract the terms
recorded those keywords as being potentially inference-
enabling. In practice, a redaction practitioner might then
using code customized for that Web site and