'Unknown' by Unknown - Page 13 of 16

constraints as described in section 5.15 The

of keywords when looking for inferences. In contrast, if

(i, ni) values were: (2, 50), (3, 20), (4, 15)

readability is an important concern, then the considered

and (5, 13). We concluded with 5-tuples be-

sets might be those favoring certain word types.

cause no valid inferences were found for that

What we discuss here is one example of using

run of the algorithm, and only 7% of the links

Web-based inference detection to improve the redaction

returned by the algorithm run for (i, ni) =

process. The approach we take is influenced by readabil-

(4, 15) were valid. For each (i, ni) execu-

ity and performance (i.e. speed of the redaction process)

tion of the algorithm we received a list of sets

but is by no means an optimal approach with respect

of keywords that were potentially inference-

to either concern. We began by applying some simple

enabling, and the associated top link leading

redaction rules to the document [8]. Specifically, we re-

the algorithm to make this conclusion.

moved all location references since our example in sec-

(b) We reviewed the returned links to see if all the

tion 1 indicated those were important to identifying the

corresponding keywords were used in a dis-

biography subject, any dates near September 11, 2001,

cussion of Osama Bin Laden. If so, we made

which is clearly a memorable date, and finally, all cita-

a judgement as to which keyword or keywords

tion titles since when paired with the associated publi-

to remove to remove the inference while pre-

cation, these enable the citation articles to be easily re-

serving readability of the document.

trieved. The resulting redacted document is depicted in

figure 6, where grey rectangles indicate the redaction re-

the current form of the redacted document.

sulting from the rules just described.

Our subsequent redaction proceeded iteratively. At

each stage, we extracted the text from the current doc-

ument, calculated the keywords ordered by the TF.IDF

Figure 5 lists the words that were redacted as a result

metric and searched for inferences drawn from subsets

of our Web-based inference detection algorithm. The ta-

of a specified number of the top keywords. We then eval-

ble also gives an example link output by the algorithm

uated the output of the algorithm by checking to ensure

that motivates the redaction and a brief explanation of

the produced links did indeed reflect identifying infer-

why the word is sensitive (gained from the manual re-

ences. If a link did not use all the queried keywords in a

view of the link(s)). Note that while our algorithm found

discussion about Osama Bin Laden then it was deemed

some document features to be identifying that are un-

invalid. A common source of invalid links were news ar-

likely to have been covered by a generic redaction rule

ticle titles printed in the side-bar of the link that did not

(e.g. Osama Bin Laden's father's attribute of being a

make use of the keywords found in the main body. For

building magnate) it left other, seemingly unusual, at-

example, the query "condone citing prestigious", yields

tributes (such as Osama Bin Laden potentially being one

the top hit [6] (a humor site) because a sidebar links to

of 20 children). Since the Web is at best a proxy for hu-

an article with "Osama" in the title, however, none of the

man knowledge, and our algorithm used the Web in a

keywords are used in the description of that article.14

limited way (i.e. our analysis was limited to a few hits

We incorporated manual review of the links because

with little NLP use), it seems likely that inferences were

the current form of our algorithms involves too little con-

missed. Hence, we emphasize that our tool is best used

tent analysis to provide confidence that a returned link

to semi-automate the redaction process.

reflects a strong connection between the associated key-

Finally, we note that the act of redacting informa-

words and Osama Bin Laden. In addition, given the high

tion may introduce as well as remove, privacy problems.

security nature of most redaction settings it is unlikely

For example, as noted by Vern Paxson [39], redacting

that a purely automated process will ever be accepted.

"Boston" without redacting "Globe" may allow the sen-

For those inferences that were found valid, we made

sitive term "Boston" to be inferred. Our tool suggests

redactions to prevent such inferences and repeated this

"Boston" for redaction, as opposed to "Boston Globe",

process for the newly redacted document. The following

because a number of Osama Bin Laden's relatives reside

makes the steps we followed precise.

there, however, acting on this recommendation is prob-

lematic precisely because of the difference between the

1. Dates near September 11, 2001, titles of all citations

nature of the inference and the document usage of the

and location names were removed from the biogra-

term. An improved algorithm would understand the use

phy [8].

of the term within the document and use this to guide the

2. For i = 2, . . . , 5:

redaction process.

(a) We executed Google queries for each i-tuple

in the top ni keywords in the biography. The

Our final redacted document is shown in the right hand

ni values were chosen based on performance

side of figure 6.