As mentioned above, in order to determine if a word
ters of our basic experiments to either do more filtering
is a keyword we use the well known TF.IDF metric (see,
of the query results or analyze more of the query results
for example, [28]). The TF.IDF "rank" of a word in a
and require a majority contain the sensitive word(s).
document is defined with respect to a corpus, C . We
We describe each experiment in detail below.
state the definition next.
Definition 1 Let D be a document that contains the
5.2
Web-based De-anonymization
word W and is part of a corpus of documents, C . The
As discussed in section 4 one of our goals is to demon-
term frequency (TF) of W with respect to D is the num-
strate how keyword extraction can be used to warn the
ber of times W occurs in D. The document frequency
end-user of impending identification. Our inference
(DF) of W with respect to the corpus, C , is the total num-
detection technology accomplishes this by constantly
ber of documents in C that contain the keyword W . The
amassing keywords from online content proposed for
TF.IDF value associated with W is the ratio: T F /DF .
posting by the user (e.g. blog entries) and issuing Web
Our code implements a variant of TF.IDF in which we
queries based on those keywords. The user is alerted
first use the British National Corpus (BNC) [27] to stem
when the hits returned by those queries return their name,
lexical tokens (e.g. the tokens "accuse", "accused", "ac-
and thus is warned about the risk of posting the content.
cuses" and "accusing" would all be mapped to the stem
We simulated this setting with Wikipedia biographies
"accuse"). We then use the BNC again to associate with
standing in for user-authored content. We removed
each token the DF of the corresponding stem (i.e. "ac-
the biography subject's name from the biography and
cuse" in the earlier example).
viewed the personal content in the biography as being
As with text extraction from html, there are open
a condensed version of the information an individual
source (and commercial) offerings for calculating
might reveal over many posts to their blog, for example.
TF.IDF based on a reference corpus. We did not, how-
From these "anonymized" biographies we extracted key-
ever, have a reference corpus on which to base our cal-
words. Subsets of keywords formed queries to Google.
culations, and thus opted to write our own code to com-
A portion of the returned hits were then searched for the
pute TF.IDF based on the DF values reported in the BNC
biography subject's name and a flag was raised when a
(which is an excellent model for the English language as
hit that was not a Wikipedia page contained a mention
a whole, and thus presumably also for text found on the
of the biography's subject. For efficiency reasons, we
Web).
limited the portion and number of Web pages that were
Our final challenge was experimental run-time. Al-
examined. In more detail, our experiment consists of the
though we did not invest time optimizing our text ex-
following steps:
traction code for speed it nevertheless proved remark-
ably efficient in comparison with the time needed to ex-
Input: a Wikipedia biography, B:
ecute Google queries and download Web pages. In addi-
tion, Google states that they place a constraint of 1, 000
1. Extract the subject, N , of the biography, B, and
queries per day for each registered developer on the
parse N into a first name, N1, optional middle name
or middle initial, N1, and a last name, N2 (where
Google SOAP Search API service [18]. This constraint
required us to amass enough Google registrations in or-
Nj is empty if a name in that position is not given
in the biography).5
der to ensure our experiments could run uninterrupted; in
our case, given the varying running times of our experi-
2. Extract the top 20 keywords from the Wikipedia bi-
ments, 17 registrations proved enough. The delay caused
ography, B, forming the set, SB , through the fol-
by query execution and Web page download caused us to
lowing steps:
modify our algorithms to do a less thorough search for
(a) Extract the text from the html.
inferences than we had originally intended. These modi-
(b) Calculate the enhanced TF.IDF ranking of
fications almost certainly cause our algorithms to gener-
each word in the extracted text (section 5.1).
ate an incomplete set of inferences. However, it is also
If present, remove N1, N1 and N2 from this
important to note that despite our efforts, our results con-
list, and select the top 20 words from the re-
tain some links that should have been discarded because
maining text as the ordered set, SB .
they either don't represent new information (e.g. scrapes
3. For x = 20, 19, . . . , 1, issue a Google query on the
of the site from which we extracted keywords) or don't
top x keywords in SB . Denote this query by Qx.
connect the keywords in our query to the sensitive words
For example, if W1, W2, W3 are the top 3 keywords,
in a meaningful way (e.g. an online dictionary covering a
the Google query Q3 is: W1 W2 W3, with no
broad swath of the English language). Hence, it is possi-
additional punctuation. Let Hx be the set of hits
ble to improve upon our results by changing the parame-