'Unknown' by Unknown - Page 9 of 16

Keywords

URL of Top Hit

Name of Person

campaigned soviets

Ronald Reagan

https://www.utexas.edu/features/archive/2004/election policy.html

defense contra reagan

Caspar Weinberger

https://www.pbs.org/wgbh/amex/reagan/peopleevents/pande08.html

reagan attorney

edit pornography

Edwin Meese

https://www.sourcewatch.org/index.php?title=Edwin Meese III

nfl nicole goldman

francisco pro

O. J. Simpson

https://www.brainyhistory.com/years/1997.html

https://www.amazon.com/Kung-Fu-Complete-Second-Season/

kung fu actors

David Carradine

dp/B0006BAWYM

medals medal raid

https://www.voicenet.com/~padilla/pearl.html

honor aviation

Jimmy Doolittle

fables chicago indiana

George Ade

https://www.indianahistory.org/pop hist/people/ade.html

wisconsin illinois chicago

architect designed

Frank Lloyd Wright

https://www.greatbuildings.com/architects/Frank Lloyd Wright.html

Figure 2:

Excerpts from our de-anonymization experiments. Each row lists keywords extracted from the Wikipedia biography of an individual

(categorized under "California" or "Illinois"), a hit returned by a Google query on those keywords that is one of the top three hits returned and

contains the individual's name, and the name of the individual.

5.3

Web-based Sensitive Topic Detection

use this output to decide what words to redact from med-

ical records before they are released in order to preserve

Another application of Web-based inference detection is

the privacy of the patient.

the redaction of medical records. As discussed earlier, it

To gain some confidence in our approach we also used

is common practice to redact all information about dis-

a collection of general medical terms as a "control" and

eases such as HIV/Aids, mental illness, and drug and

followed the same algorithm. That is, we made Google

alcohol abuse, prior to releasing medical records to a

queries using these medical terms and looked for refer-

third party (such as, e.g., a judge in malpractice liti-

ences to a sensitive disease (STDs and alcoholism) in the

gation). Implementing such protections today relies on

returned links. The purpose of this process was to see

the thoroughness of the redaction practitioner to keep

if the results would differ from those obtained with key-

abreast of all the medications, physician names, diag-

words from the Wikipedia pages about STDs and alco-

noses and symptoms that might be associated with such

holism. We expected a distinct difference because the

conditions and practices. Web-based inference detection

Wikipedia pages should yield keywords more relevant to

can be used to improve the thoroughness of this task by

STDs and alcoholism, and indeed the results indicate that

automating the process of identifying the keywords al-

is the case.

lowing such conditions to be inferred.

The following describes our experiment in more de-

tail.

To demonstrate how our algorithm can be used in this

application, our experiments take as input a page that is

1. Input: An ordered set of sensitive words, K ∗ =

viewed as authoritative about a certain disease. In our

{v1, . . . , vb}, for some positive integer b, and a

experiments, we used Wikipedia to supply pages for al-

page, B. B is either the Wikipedia page for alco-

coholism and sexually transmitted diseases (STDs). The

holism [40], the Wikipedia page for sexually trans-

text is then extracted from the html, and keywords are

mitted diseases (STDs) [41] or a "control" page of

identified. To identify keywords that might allow the

general medical terms.

associated disease to be inferred we then issued Google

(a) If B is a Wikipedia page, extract the top

queries on subsets of keywords and examined the top hit

30 keywords from B, forming the set SB ,

for references to the associated disease. In general, we

through the following steps:

counted as a reference any mention of the associated dis-

ease. The one exception to this rule is that we filtered out

i. Extract text from html.

some medical term sites since such sites list unrelated

ii. Calculate the enhanced TF.IDF ranking

medical terms together (for indexing purposes) and we

of each word in the extracted text (sec-

didn't want such lists to trigger inference results.

tion 3). Select the top 30 words as the

ordered set, SB = {W1, W2, . . . , W30}.

In the event that such a reference was found we

(b) If B is a medical terms page, extract the terms

recorded those keywords as being potentially inference-

enabling. In practice, a redaction practitioner might then

using code customized for that Web site and