'Unknown' by Unknown - Page 7 of 16

As mentioned above, in order to determine if a word

ters of our basic experiments to either do more filtering

is a keyword we use the well known TF.IDF metric (see,

of the query results or analyze more of the query results

for example, [28]). The TF.IDF "rank" of a word in a

and require a majority contain the sensitive word(s).

document is defined with respect to a corpus, C . We

We describe each experiment in detail below.

state the definition next.

Definition 1 Let D be a document that contains the

5.2

Web-based De-anonymization

word W and is part of a corpus of documents, C . The

As discussed in section 4 one of our goals is to demon-

term frequency (TF) of W with respect to D is the num-

strate how keyword extraction can be used to warn the

ber of times W occurs in D. The document frequency

end-user of impending identification. Our inference

(DF) of W with respect to the corpus, C , is the total num-

detection technology accomplishes this by constantly

ber of documents in C that contain the keyword W . The

amassing keywords from online content proposed for

TF.IDF value associated with W is the ratio: T F /DF .

posting by the user (e.g. blog entries) and issuing Web

Our code implements a variant of TF.IDF in which we

queries based on those keywords. The user is alerted

first use the British National Corpus (BNC) [27] to stem

when the hits returned by those queries return their name,

lexical tokens (e.g. the tokens "accuse", "accused", "ac-

and thus is warned about the risk of posting the content.

cuses" and "accusing" would all be mapped to the stem

We simulated this setting with Wikipedia biographies

"accuse"). We then use the BNC again to associate with

standing in for user-authored content. We removed

each token the DF of the corresponding stem (i.e. "ac-

the biography subject's name from the biography and

cuse" in the earlier example).

viewed the personal content in the biography as being

As with text extraction from html, there are open

a condensed version of the information an individual

source (and commercial) offerings for calculating

might reveal over many posts to their blog, for example.

TF.IDF based on a reference corpus. We did not, how-

From these "anonymized" biographies we extracted key-

ever, have a reference corpus on which to base our cal-

words. Subsets of keywords formed queries to Google.

culations, and thus opted to write our own code to com-

A portion of the returned hits were then searched for the

pute TF.IDF based on the DF values reported in the BNC

biography subject's name and a flag was raised when a

(which is an excellent model for the English language as

hit that was not a Wikipedia page contained a mention

a whole, and thus presumably also for text found on the

of the biography's subject. For efficiency reasons, we

Web).

limited the portion and number of Web pages that were

Our final challenge was experimental run-time. Al-

examined. In more detail, our experiment consists of the

though we did not invest time optimizing our text ex-

following steps:

traction code for speed it nevertheless proved remark-

ably efficient in comparison with the time needed to ex-

Input: a Wikipedia biography, B:

ecute Google queries and download Web pages. In addi-

tion, Google states that they place a constraint of 1, 000

1. Extract the subject, N , of the biography, B, and

queries per day for each registered developer on the

parse N into a first name, N1, optional middle name

or middle initial, N1, and a last name, N2 (where

Google SOAP Search API service [18]. This constraint

required us to amass enough Google registrations in or-

Nj is empty if a name in that position is not given

in the biography).5

der to ensure our experiments could run uninterrupted; in

our case, given the varying running times of our experi-

2. Extract the top 20 keywords from the Wikipedia bi-

ments, 17 registrations proved enough. The delay caused

ography, B, forming the set, SB , through the fol-

by query execution and Web page download caused us to

lowing steps:

modify our algorithms to do a less thorough search for

(a) Extract the text from the html.

inferences than we had originally intended. These modi-

(b) Calculate the enhanced TF.IDF ranking of

fications almost certainly cause our algorithms to gener-

each word in the extracted text (section 5.1).

ate an incomplete set of inferences. However, it is also

If present, remove N1, N1 and N2 from this

important to note that despite our efforts, our results con-

list, and select the top 20 words from the re-

tain some links that should have been discarded because

maining text as the ordered set, SB .

they either don't represent new information (e.g. scrapes

3. For x = 20, 19, . . . , 1, issue a Google query on the

of the site from which we extracted keywords) or don't

top x keywords in SB . Denote this query by Qx.

connect the keywords in our query to the sensitive words

For example, if W1, W2, W3 are the top 3 keywords,

in a meaningful way (e.g. an online dictionary covering a

the Google query Q3 is: W1 W2 W3, with no

broad swath of the English language). Hence, it is possi-

additional punctuation. Let Hx be the set of hits

ble to improve upon our results by changing the parame-