Check out the new USENIX Web site.
4The
[30] P. Nakov and M. Hearst. Using the Web as an Implicit Train-
AOL data can potentially be used to demonstrate the Web's
ing Set: Application to Structural Ambiguity Resolution. HLT-
ability to de-anonymize ([5] may be one such example), which is one
NAACL, 2005.
of the goals of our algorithms, however because our target application
is the protection of English language content, we opted not to vet our
[31] Nstein Technologies. https://www.nstein.com/pim.
algorithms with that data.
asp
5The
vast majority of the biographies we used identified their sub-
[32] G. Pant, S. Bradshaw and F. Menczer. Search engine-crawler
ject by both a first and last name with no middle name or initial. Also,
symbiosis: adapting to community interests.7th European Con-
name suffixes (e.g. Jr. or annotations made by Wikipedia authors re-
ference on Digital Libraries, 2003.
garding profession), were ignored.
[33] X. Qian, M. Stickel, P. Karp, T. Lunt and T. Garvey. Detection
6This
was done to avoid difficulties parsing non-ascii pages.
and elimination of inference channels in multilevel relational
database systems. IEEE Symposium on Security and Privacy,
7These
are the first three links that appear on the results page,
1993.
whether or not one URL is a substring of another.
[34] M. Steyvers, P. Smyth, M. Rosen-Zvi and T. Griffiths. Proba-
8Here
"known site" means any site with "medterm" or "medword"
bilistic author-topic models for information discovery. KDD `04.
in the URL. As this certainly not sufficient to remove all medical terms
sites, we manually reviewed the results before generating the example
[35] L. Sweeney. AI Technologies to Defeat Identity Theft Vulnera-
keyword pairs in Figure 3.
bilities. AAAI Spring Symposium on AI TEchnologies for Home-
land Security, 2005.
9Note this extracted non-word indicates a flaw in our text-from-html
extraction algorithm.
[36] L. Sweeney. Uniqueness of Simple Demographics in the U.S.
Population. LIDAP-WP4. Carnegie Mellon University, Labora-
a manual review of the word pairs from WB yielding a top hit
10In
tory for International Data Privacy, Pittsburgh, PA, 2000.
containing word(s) in KST D , we did not find any hits using the word
pair in a meaningful way in relation to a sensitive word. Rather, the hits
[37] P. Turney. Coherent Keyphrase Extraction via Web Mining. IJ-
generally turned out to be medical term lists.
CAI, 2002.
11Since
all of our sensitive words pertain to the same topic, alco-
[38] Unified Medical Language System. https://www.umm.
holism, we did not record which particular sensitive word was con-
edu/glossary/a/index.html
tained in the top hit (if any).
[39] Personal communication.
this is the 4th returned hit, indicating a change in our search
12Note
[40] Wikipedia. Alcoholism. https://en.wikipedia.org/
strategy would improve recall.
wiki/Alcoholism
13The
biography only mentions "Boston" in a citation, so this is a
[41] Wikipedia.  Sexually  transmitted  disease.  https://en.
conservative redaction choice.
wikipedia.org/wiki/Sexually transmitted
disease
14Alternative
metrics for validity are of course possible. For exam-
ple, a more thorough algorithms might look for shared topic (e.g. the
[42] https://wordweb.info/free/
events of September 11, 2001) amongst links, and retain any links per-
taining to the most popular topic as valid.
[43] R. Yip and K. Levitt. Data level inference detection in database
systems. IEEE Eleventh Computer Security Foundations Work-
15We
tended to experience problems communicating with Google
shop, 1998.
when when executing algorithm runs that exceeded 1500 queries,
hence we chose values of {ni}i that yielded query counts in the range
[44] D. Zhao and T. Sapp. AOL Search Database. https://www.
of 1000 - 1500.
aolsearchdatabase.com/
[45] https://www.zoominfo.com/
Notes
1https://www.popandpolitics.com/2005/09/06/and-lite-jazz-singers-
shall-lead-the-way/, www.popandpolitics.com/2006/10/06/our-paris/
2https://en.wikipedia.org/wiki/
Madonna and the gay community,
https://gaybookreviews.info/review/2807/615,
https://www.youtube.com/results?search type=related
&search query=madonna%20oh%20father
3Example
results from our experiments appear in section 5. Be-
cause of the dynamic nature of the Web, issuing the same queries today
may yield somewhat different results.