Scalable Document Fingerprinting

Next: Lunch with Invited Speaker Up: Session VI: Experience Previous: Financial EDI Over the

Scalable Document Fingerprinting

Nevin Heintze, Bell Labs

In part as a response to being plagiarized, Nevin became interested in finding a scalable means to catch copied (or partially copied) papers. The fingerprinting should be sufficiently robust to catch moderate variations of the same paper. A sliding window of chunks of a certain length (e.g. 30 characters) is used to go through the document, creating a fingerprint against which other documents' fingerprints can be matched. Each document's full fingerprint can't be stored if there are many documents, so the method used was to keep a fixed number (say 100) of chunks from each document in the database. In order to reduce false positives infrequently occurring chunks were used in the fingerprints. This was combined with hashing to select certain chunks and increase matches.

The method worked well when tested on a corpus of CMU technical reports and on a larger corpus of reports available electronically. Even a one percent match indicated a likely similarity. In practice, techniques such as ignoring headers, introductions, references, and other highly repetitious information helped to reduce false positives and provide more precise detection. In order to defeat an iterative attack, in which a plagiarized document is repeatedly modified until the matches are eliminated, occasional random resamplings should be performed, updating the chunks associated with each document in the database.

Next: Lunch with Invited Speaker Up: Session VI: Experience Previous: Financial EDI Over the

Alma Whitten
1998-07-21