Check out the new USENIX Web site.
5
Experiments
FORMULATION OF REDACTION RULES. Our Web-based
inference detection tools can also be used to pre-compute
Our experiments focus on exploring the first two pri-
a set of redaction rules that is later applied to a collection
vacy monitor applications of section 4: redaction of med-
of private documents. For a large collection of private
ical records and preserving individual anonymity. In
documents, pre-computing redaction rules may be more
testing these ideas, we faced two main challenges that
efficient than using Web-based inference detection to an-
constrained our experimental design. First, and most
alyze each and every document. In 1995 for example,
challenging, was designing relevant experiments that we
executive order 12958 mandated the declassification of
could execute given available data. The second, more
large amounts of government data [9] (hundreds of mil-
pragmatic, challenge was getting the right tools in place
lions of pages). Sensitive portions of documents were to
and executing the experiments in a time-efficient manner.
be redacted prior to declassification. The redaction rules
We describe each of these challenges, and our approach
were exceedingly complex and formulating them was
to meeting them, in more detail below.
reportedly nearly as time-consuming as applying them.
Web-based inference detection is an appealing approach
to automatically expand a small set of seed redaction
5.1
Experimental Design Challenges and
rules. For example, assuming that the keyword "mis-
Tools
sile" is sensitive, web-based inference detection could
automatically retrieve other keywords related to missiles
Ideally, our idea of Web-based inference detection would
(e.g. "guidance system", "ballistics", "solid fuel") and
be tested on authentic documents for which privacy is a
add them to the redaction rule.
chief concern. For example, a corpus of medical records
being prepared for release in response to a subpoena
would be ideal for evaluating the ability of our tech-
PUBLIC IMAGE CONTROL. This application considers
niques to identify sensitive topics. However, such a cor-
the problem of verifying that a document conforms to
pus is hard to come by for obvious reasons. Similarly,
the intentions of its author, and does not accidentally re-
a collection of anonymous blogs would be ideal for test-
veal private information or information that could eas-
ing the ability of our techniques to identify individuals,
ily be misinterpreted or understood in the wrong con-
but such blogs are hard to locate efficiently. Indeed, the
text. This application, unlike others, does not assume
excitement over the recently released AOL search data,
that the set of unwanted inferences is known or explic-
as illustrated by the quick appearance of tools for min-
itly defined. Instead, the goal of this application is to
ing the data (see, for example, [44, 4]), demonstrates the
design a broad, general-purpose tool that helps contex-
widespread difficulty in finding data appropriate for vet-
tualize information and may draw an author's attention
ting data mining technologies, of which our inference de-
to a broad array of potentially unwanted inferences. For
tection technology is an example.4
example, Web-based inference detection could alert the
Given the difficulties of finding unequivocally sensi-
author of a blog to the fact that a particular posting con-
tive data on which to test our algorithms, we used in-
tains a combination of keywords that will make the blog
stead publicly available information about an individual,
appear prominently in the results of some search query.
which we anonymized by removing the individual's first
This problem is related to other approaches to public im-
and last names. In most cases, the public information
age management, such as [13, 31]. Few technical details
about the individual, thus anonymized, appeared to be a
have been published about these other approaches, but
decent substitute for text that the individual might have
they do not appear focused on inference detection and
authored on their blog or Web page.
control.
All of our experiments rely on Java code we wrote
for extracting text from html, on calculation of an ex-
tended form of TF.IDF (see definition below) for identi-
LEAK DETECTION. This application helps a data owner
fying keywords in documents and on the Google SOAP
avoid accidental releases of information that was not pre-
search API [18] for making Web queries based on those
viously public. In this application of Web-based infer-
ence control, the set of sensitive knowledge K  ∗ consists
keywords.
of all information that was not previously public. In other
Our code for extracting text from html uses standard
words, the release of private data should not add anything
techniques for removing html tags. Because our experi-
to public knowledge. This application may have helped
ments involved repeated extractions from similarly for-
prevent, for example, a recent incident in which Google
matted html pages (e.g Wikipedia biographies) it was
accidentally released confidential financial information
most expedient to write our own code, customized for
in the notes of a PowerPoint presentation distributed to
those pages, rather than retrofitting existing text extrac-
financial analysts [22].
tion code such as is available in [3].