The alternatives to compare-by-hash can be summarized as ``Keep some state!'' Compare-by-hash attempts to establish similarities between two unknown sets of blocks. If we keep track of which blocks we are sure are identical (because we directly compared them), we don't have to guess. Unfortunately, keeping state is hard. Part of the popularity of compare-by-hash is undoubtably due to its ease of implementation compared to a stateful solution. However, simplicity of implementation should not come at the cost of correctness.
One of the applications of compare-by-hash is reducing network bandwidth used by distributed file systems. To accomplish nearly the same effect, we can resolve to only send any particular block over the link once, keeping sent and received data in a cache in both sender and receiver. Before sending a block, the sender checks to see if it has already sent the block and if so, sends the block id rather than the block itself. This idea is proposed by Spring and Wetherall in . We might also agree in advance on certain universal block ids, for example, block id 0 is always the zero block of length 4096 bytes. The initial start-up cost is higher, depending on the degree of actually shared blocks between the two machines, but after cache warm-up, performance should be quite similar to compare-by-hash.
In combination with an intelligent blocking technique, such as Rabin fingerprints, which divide up blocks at ``anchor'' points (patterns in the input) rather than at fixed intervals, we can experiment with byte and block level differencing techniques that require similar amounts of computation time as computing cryptographic hashes. Using fingerprints to determine block boundaries allows us to more easily detect insertions and deletions within blocks.
Compression may still have more mileage left in it, since we are willing to trade off large amounts of computation for reduced bandwidth. We might try compressing with several different algorithms optimized for different inputs.
Note that content-based addressing, where the address of a block is generated using the contents of the block itself, does not necessarily imply compare-by-hash. Content-based addressing schemes can easily check for collisions in addresses and compensate for them.