'Unknown' by Unknown - Page 6 of 16

Experiments

FORMULATION OF REDACTION RULES. Our Web-based

inference detection tools can also be used to pre-compute

Our experiments focus on exploring the first two pri-

a set of redaction rules that is later applied to a collection

vacy monitor applications of section 4: redaction of med-

of private documents. For a large collection of private

ical records and preserving individual anonymity. In

documents, pre-computing redaction rules may be more

testing these ideas, we faced two main challenges that

efficient than using Web-based inference detection to an-

constrained our experimental design. First, and most

alyze each and every document. In 1995 for example,

challenging, was designing relevant experiments that we

executive order 12958 mandated the declassification of

could execute given available data. The second, more

large amounts of government data [9] (hundreds of mil-

pragmatic, challenge was getting the right tools in place

lions of pages). Sensitive portions of documents were to

and executing the experiments in a time-efficient manner.

be redacted prior to declassification. The redaction rules

We describe each of these challenges, and our approach

were exceedingly complex and formulating them was

to meeting them, in more detail below.

reportedly nearly as time-consuming as applying them.

Web-based inference detection is an appealing approach

to automatically expand a small set of seed redaction

5.1

Experimental Design Challenges and

rules. For example, assuming that the keyword "mis-

Tools

sile" is sensitive, web-based inference detection could

automatically retrieve other keywords related to missiles

Ideally, our idea of Web-based inference detection would

(e.g. "guidance system", "ballistics", "solid fuel") and

be tested on authentic documents for which privacy is a

add them to the redaction rule.

chief concern. For example, a corpus of medical records

being prepared for release in response to a subpoena

would be ideal for evaluating the ability of our tech-

PUBLIC IMAGE CONTROL. This application considers

niques to identify sensitive topics. However, such a cor-

the problem of verifying that a document conforms to

pus is hard to come by for obvious reasons. Similarly,

the intentions of its author, and does not accidentally re-

a collection of anonymous blogs would be ideal for test-

veal private information or information that could eas-

ing the ability of our techniques to identify individuals,

ily be misinterpreted or understood in the wrong con-

but such blogs are hard to locate efficiently. Indeed, the

text. This application, unlike others, does not assume

excitement over the recently released AOL search data,

that the set of unwanted inferences is known or explic-

as illustrated by the quick appearance of tools for min-

itly defined. Instead, the goal of this application is to

ing the data (see, for example, [44, 4]), demonstrates the

design a broad, general-purpose tool that helps contex-

widespread difficulty in finding data appropriate for vet-

tualize information and may draw an author's attention

ting data mining technologies, of which our inference de-

to a broad array of potentially unwanted inferences. For

tection technology is an example.4

example, Web-based inference detection could alert the

Given the difficulties of finding unequivocally sensi-

author of a blog to the fact that a particular posting con-

tive data on which to test our algorithms, we used in-

tains a combination of keywords that will make the blog

stead publicly available information about an individual,

appear prominently in the results of some search query.

which we anonymized by removing the individual's first

This problem is related to other approaches to public im-

and last names. In most cases, the public information

age management, such as [13, 31]. Few technical details

about the individual, thus anonymized, appeared to be a

have been published about these other approaches, but

decent substitute for text that the individual might have

they do not appear focused on inference detection and

authored on their blog or Web page.

control.

All of our experiments rely on Java code we wrote

for extracting text from html, on calculation of an ex-

tended form of TF.IDF (see definition below) for identi-

LEAK DETECTION. This application helps a data owner

fying keywords in documents and on the Google SOAP

avoid accidental releases of information that was not pre-

search API [18] for making Web queries based on those

viously public. In this application of Web-based infer-

ence control, the set of sensitive knowledge K ∗ consists

keywords.

of all information that was not previously public. In other

Our code for extracting text from html uses standard

words, the release of private data should not add anything

techniques for removing html tags. Because our experi-

to public knowledge. This application may have helped

ments involved repeated extractions from similarly for-

prevent, for example, a recent incident in which Google

matted html pages (e.g Wikipedia biographies) it was

accidentally released confidential financial information

most expedient to write our own code, customized for

in the notes of a PowerPoint presentation distributed to

those pages, rather than retrofitting existing text extrac-

financial analysts [22].

tion code such as is available in [3].