Check out the new USENIX Web site.
constraints as described in section 5.15 The
of keywords when looking for inferences. In contrast, if
(i, ni) values were: (2, 50), (3, 20), (4, 15)
readability is an important concern, then the considered
and (5, 13). We concluded with 5-tuples be-
sets might be those favoring certain word types.
cause no valid inferences were found for that
What we discuss here is one example of using
run of the algorithm, and only 7% of the links
Web-based inference detection to improve the redaction
returned by the algorithm run for (i, ni) =
process. The approach we take is influenced by readabil-
(4, 15) were valid. For each (i, ni) execu-
ity and performance (i.e. speed of the redaction process)
tion of the algorithm we received a list of sets
but is by no means an optimal approach with respect
of keywords that were potentially inference-
to either concern. We began by applying some simple
enabling, and the associated top link leading
redaction rules to the document [8]. Specifically, we re-
the algorithm to make this conclusion.
moved all location references since our example in sec-
(b) We reviewed the returned links to see if all the
tion 1 indicated those were important to identifying the
corresponding keywords were used in a dis-
biography subject, any dates near September 11, 2001,
cussion of Osama Bin Laden. If so, we made
which is clearly a memorable date, and finally, all cita-
a judgement as to which keyword or keywords
tion titles since when paired with the associated publi-
to remove to remove the inference while pre-
cation, these enable the citation articles to be easily re-
serving readability of the document.
trieved. The resulting redacted document is depicted in
(c) We incremented i and returned to step (a) with
figure 6, where grey rectangles indicate the redaction re-
the current form of the redacted document.
sulting from the rules just described.
Our subsequent redaction proceeded iteratively. At
each stage, we extracted the text from the current doc-
ument, calculated the keywords ordered by the TF.IDF
Figure 5 lists the words that were redacted as a result
metric and searched for inferences drawn from subsets
of our Web-based inference detection algorithm. The ta-
of a specified number of the top keywords. We then eval-
ble also gives an example link output by the algorithm
uated the output of the algorithm by checking to ensure
that motivates the redaction and a brief explanation of
the produced links did indeed reflect identifying infer-
why the word is sensitive (gained from the manual re-
ences. If a link did not use all the queried keywords in a
view of the link(s)). Note that while our algorithm found
discussion about Osama Bin Laden then it was deemed
some document features to be identifying that are un-
invalid. A common source of invalid links were news ar-
likely to have been covered by a generic redaction rule
ticle titles printed in the side-bar of the link that did not
(e.g. Osama Bin Laden's father's attribute of being a
make use of the keywords found in the main body. For
building magnate) it left other, seemingly unusual, at-
example, the query "condone citing prestigious", yields
tributes (such as Osama Bin Laden potentially being one
the top hit [6] (a humor site) because a sidebar links to
of 20 children). Since the Web is at best a proxy for hu-
an article with "Osama" in the title, however, none of the
man knowledge, and our algorithm used the Web in a
keywords are used in the description of that article.14
limited way (i.e. our analysis was limited to a few hits
We incorporated manual review of the links because
with little NLP use), it seems likely that inferences were
the current form of our algorithms involves too little con-
missed. Hence, we emphasize that our tool is best used
tent analysis to provide confidence that a returned link
to semi-automate the redaction process.
reflects a strong connection between the associated key-
Finally, we note that the act of redacting informa-
words and Osama Bin Laden. In addition, given the high
tion may introduce as well as remove, privacy problems.
security nature of most redaction settings it is unlikely
For example, as noted by Vern Paxson [39], redacting
that a purely automated process will ever be accepted.
"Boston" without redacting "Globe" may allow the sen-
For those inferences that were found valid, we made
sitive term "Boston" to be inferred. Our tool suggests
redactions to prevent such inferences and repeated this
"Boston" for redaction, as opposed to "Boston Globe",
process for the newly redacted document. The following
because a number of Osama Bin Laden's relatives reside
makes the steps we followed precise.
there, however, acting on this recommendation is prob-
lematic precisely because of the difference between the
1. Dates near September 11, 2001, titles of all citations
nature of the inference and the document usage of the
and location names were removed from the biogra-
term. An improved algorithm would understand the use
phy [8].
of the term within the document and use this to guide the
2. For i = 2, . . . , 5:
redaction process.
(a) We executed Google queries for each i-tuple
in the top ni keywords in the biography. The
Our final redacted document is shown in the right hand
ni values were chosen based on performance
side of figure 6.