In Section 3, we calculated the probability of a hash collision under the assumption that our inputs were random and uniformly distributed. While this assumption simplifies the math, it is also wrong.
Real data is not random, unless all applications produce random data. This may seem like a trivial and facile statement, but it is actually the key insight into the weakness of compare-by-hash. If real data were actually random, each possible input block would be equally likely to occur, whereas in real data, input blocks that contain only ASCII characters or begin with an ELF header are more common than in random data. Knowing that real data isn't random, can we think of some cases where it is non-random in an interesting way?
Consider an application, let's call it SHA1@home, that attempts to find a collision in the SHA-1 hash function. SHA1@home is a distributed application, so it runs many instances in parallel on many machines, using a distributed file system to share data when necessary. When two inputs are found that hash to the same value, one program reads and compares both input blocks to find out if they differ. If the file system uses compare-by-hash with SHA-1 and the same block size as the inputs for SHA1@home, this application will be unable to detect a collision, ever. For example, if SHA1@home used a 2KB block size, it would run incorrectly if it used LBFS as the underlying file system4.
This is only one very crude, very simple example of an entire class of applications that are very useful, especially to cryptanalysts. In their 1998 paper, Chabaud and Joux implemented several programs designed specifically to find collisions in various hashing algorithms, including SHA-0 and several relatives. They end by hinting at avenues of research for attacking SHA-1. Somewhat ironically, this paper is referenced by one of the papers using compare-by-hash.