'Unknown' by Unknown - Page 10 of 16

let WB = {W1, W2, . . . , W30} be a subset

"transmit", "infected"). However, for some individuals

of 30 terms from that list, where the selec-

the very fact that even general terms are frequently as-

tion method varies with each run of the exper-

sociated with sensitive diseases may be enough to jus-

iment (see the results discussion below for the

tify redaction (e.g. a politician may desire the removal

specifics).

of any "red flag" words). In general though, we think

a redaction practitioner could defensibly not make it a

Qi,j be the query consisting of just those two

practice to redact such general terms given their associa-

words with no additional punctuation and the

tion with other, less sensitive, diseases. This emphasizes

restriction that no pages from the domain of

that our techniques support semi-automation, but not full

source page B be returned, and that all re-

automation, of the redaction process.

turned pages be text or html (to avoid parsing

PERFORMANCE. Amortizing the cost of text extraction

difficulties). Let Hi,j denote the first hit re-

from the Wikipedia source page over all the queries, de-

turned after issuing query Qi,j to Google, after

termining if each keyword pair yielded a top hit contain-

known medical terms Web sites were removed

ing a sensitive word took approximately 150 seconds.

from the Google results8.

Hence, each of the experiments in figures 3 and 4 took

(d) For all i, j ∈ {1, . . . , 30}, i = j, and for

around 6 hours, since 435 pairs from the Wikipedia page

= {1, . . . , b}, search for the string v ∈ K ∗

were tested along with 435 pairs from the "control" set

in the first 5000 lines of Hi,j . If v is found,

of keywords.

record v , wi, wj and Hi,j and discontinue the

As in the de-anonymization experiments, our main

search.

time cost was due to the process of text extraction from

2. Output: All triples (v , Qi,j , Hi,j ) found in step 1,

html. For these experiments caching is likely to signifi-

where v is in the first 5000 lines of Hi,j .

cantly improve performance as many of the medical re-

source sites were visited multiple times.

RESULTS FOR STD EXPERIMENTS. We ran the above

test on the Wikipedia page about STDs [41], B, and a

6 Use Scenario: Iterative Redaction

selected set, B , of 30 keywords from the medical term

index [29]. The set B was selected by starting at the

As mentioned in sections 1 and 4, the process of sani-

49th entry in the medical term index and selecting every

tizing documents by removing obviously identifying in-

400th word in order to approximate a random selection

formation like names and social security numbers can

of medical terms. As expected, keyword pairs from input

be improved by using Web-based inference detection to

B generated far more hits for STDs (306/435 > 70%)

identify pieces of seemingly innocuous information that

than keyword pairs from B (108/435 < 25%). The

can be used to make sensitive inference. To illustrate this

results are summarized in figure 3.

idea, we return to the poorly redacted FBI document in

RESULTS FOR ALCOHOLISM EXPERIMENTS. We ran

the left-hand side of figure 6. Algorithms like those pre-

the above test on the Wikipedia page about alcoholism

sented in sections 3.2 and 5 can be used to identify sets

[40], B, and a selected set, B , of 30 keywords from the

of keywords that allow for undesired inferences. Some

medical term index [29]. For the run analyzed in Fig-

or all of those keywords can then be redacted to improve

ure 4, the set B was selected by starting at the 52nd entry

the sanitization process.

in the medical term index and selecting every 100th word

We emphasize that the strategy for redacting based

until 30 were accumulated in order to approximate a ran-

upon the inferences detected by our algorithms is a re-

dom selection of medical terms. As expected, keyword

search problem that is not addressed by this paper. In-

pairs from input B generated far more hits for alcoholism

deed many strategies are possible. For example, one

(47.82%) than B (9.43%). In addition, we manually re-

might redact the minimum set of words (in which case,

∗

viewed the URLs that yielded a hit in v ∈ KAlc for a

the redactor seeks to find a minimum set cover for the

seemingly innocuous pair of keywords. These results are

collection of sets output by the inference detection algo-

summarized in figure 4.

rithm). Alternatively, the redactor might be biased in fa-

vor of redacting certain parts of speech (e.g. nouns rather

APPLYING THE RESULTS. When redacting medical

than verbs) to enhance readability of the redacted docu-

records, a redaction practitioner might use the results in

ment.

figures 3 and 4 to choose content to redact. For exam-

ple, figure 4 indicates the medications naltrexone and

The type of redaction strategy that is employed may

acamprosate should be removed due to their popular-

influence the Web-based inference detection algorithm.

ity as alcoholism treatments. The words identified as

For example, if the goal is to redact the minimum set of

STD-inference enabling are far more ambiguous (e.g.

words, then it is necessary to consider all possible sets