'Unknown' by Unknown - Page 5 of 16

that are most representative of Ci. Let Si denote the set

sensitive keywords. One way of doing so is to issue a

of the top α keywords extracted from document Ci, and

search query for documents in the reference collection

let S = ∪n 1Si.

R that contain the sensitive topic, then use TF.IDF

to extract from these documents an expanded set of

INFERENCE DETECTION. The list L of inferences is ini-

sensitive keywords.

tially empty. We consider in turn every subset S ⊆

S of size |S | ≤ β. For every such subset S =

(W1, . . . , Wk ), with k ≤ β, we do the following:

Example Applications

1. We use a search engine to retrieve from the collec-

tion R of reference documents the top γ documents

This section describes a wide array of potential applica-

that contain all the keywords W1, . . . , Wk .

tions for Web-based inference detection. All these appli-

cations are based on the fundamental algorithm of sec-

2. With TF.IDF, we extract the top δ keywords from

tion 3. The first two applications are the subjects of the

this collection of γ documents. Note that these key-

experiments described in detail in section 5. Experiment-

words are extracted from the aggregate collection of

ing with other applications will be the subject of future

γ documents (as if all these documents were con-

work.

catenated into a single large document), not from

REDACTION OF MEDICAL RECORDS. Medical records

each individual document.

are often released to third parties such as insurance com-

∗

3. Let K0 denote the intersection of the δ keywords

panies, research institutions or legal counsel in the case

from step 2 with the set K ∗ of sensitive keywords.

of malpractice lawsuits. State and federal legislation

If K0 is non-empty, we add to L the inference C ⇒

∗

mandates the redaction of sensitive information from

∗

K0 .

medical records prior to release. For example, all ref-

erences to drugs and alcohol, mental health and HIV sta-

The algorithm outputs the list L and terminates.

tus must typically be redacted. This redaction task is far

more complex than it may initially appear. Extensive and

3.3

Variants of the Algorithm

up-to-date knowledge of diseases and drugs is required to

detect all clues and combinations of clues that may allow

The algorithm of section 3.2 can be tailored to a variety

for inference of sensitive information. Since this medical

of applications. Two such applications are discussed in

information is readily available on public websites, the

exhaustive detail in section 5. Here, we discuss briefly

process of redacting sensitive information from medical

other possible variants of the basic algorithm.

records can be partially automated with Web-based infer-

ence control. Section 5.3 reports on our experiments with

DETECTING ALL INFERENCES. In some applications,

the set of sensitive knowledge K ∗ may not be known or

Web-based inference detection for medical redaction.

may not be specified. Instead, the goal is to identify all

PRESERVING INDIVIDUAL ANONYMITY. Intelligence

possible inferences that arise from knowledge of the col-

and other governmental agencies are often forced by law

lection of documents C and the reference collection R.

(such as the Freedom of Information Act) to release pub-

A simple variation of the algorithm given in 3.2 handles

licly documents that pertain to a particular individual or

this case. In step 3 of the inference detection phase, we

group of individuals. To protect the privacy of those con-

record all inferences instead of only inferences that in-

cerned, the documents must be released in a form that

volve keywords in K ∗. Note that this is equivalent to

does not allow for unique identification. This problem is

assuming that the set K ∗ of sensitive knowledge consists

notoriously difficult, because seemingly innocuous infor-

of all knowledge. The algorithm may also track the num-

mation may allow for unique identification, as illustrated

ber of occurrences of each inference, so that the list L can

by the poorly redacted Osama Bin Laden biography [8]

be sorted from most to least frequent inference.

discussed in the introduction. Web-based inference con-

ALTERNATIVE REPRESENTATION OF SENSITIVE

trol is perfectly suited to the detection of indirect infer-

KNOWLEDGE. The algorithm of section 3.2 assumes

ences based on publicly available data. Our tools can

that the sensitive knowledge K ∗ is given as a set of

be used to determine how much information can be re-

leased about a person, entity or event while preserving k-

keywords. Other representations of sensitive knowledge

are possible. In some applications for example, sensitive

anonymity, i.e. ensuring that it remains hidden in a group

of like-entities of size at least k, and cannot be identified

knowledge may consist of a topic (e.g. alcoholism,

or sexually transmitted diseases) instead of a list of

any more precisely within the group. Section 5.2 reports

keywords. To handle this case, we need a pre-processing

on our experiments with Web-based inference detection

step which converts a sensitive topic into a list of

for preserving individual anonymity.