A Study in Practical Deduplication

Winner of the FAST'23 Test-of-Time Award

April 18, 2023

Research

Authors:

Article shepherded by:

Rik Farrow

When I heard that a paper by Dutch Meyer and Bill Bolosky had won the FAST 2023 Test of Time Award, I actually remembered that paper. Dutch did a great job of presenting the results of research into the storage usage of 875 Microsoft employees [1]. I also thought that I remembered Dutch saying that one of the usage patterns they uncovered was WORN: Write Once, Read Never. I was wrong about that, but Bill Bolosky thought that he had said that about a similar research project published at FAST several years earlier.

I wondered what results in a paper getting a Test of Time Award? Logically, it should be a paper that other researchers tended to cite in future papers. I decided to ask someone who should know.

Rik Farrow: Can you tell me why this paper won a Test of Time Award?

Bill Bolosky: The test of time award [2] goes to the paper that the test of time committee selects. Being on the committee (but obviously being recused from my own paper) I know how the process works. The committee is made up of former FAST PC chairs. Any paper published in FAST at least 10 years ago that hasn’t already won the award is eligible. Committee members can nominate any eligible paper. These papers get put in a special HotCRP site where committee members can “review” them, which in this case means rating them on a 1 to 5 scale (“never should get the award” to “really should get it right now”) and then the committee chairs read the reviews and select the winner.

In practice, having the most citations helps a lot for winning in my experience. It’s an imperfect measure, but it’s more or less the only objective one. It certainly informs (though does not determine) my votes.

RF: That's pretty much what I had thought. Although in this particular case, your research provided a lot of data about how a population of mostly programmers and other desktop users actually used the local storage in their desktops.

In the paper, you analyze storage patterns from MS workstations, so you had to have some way to collect that information. I assumed that was because of some company policy, perhaps a network backup, that made data about file sizes, types, and access patterns available. Is that true?

Dutch Meyer: Not so much! We knew what we wanted for the paper, so I wrote a tool purpose-built that walked every file of every directory of every hard drive, and I built an installer that would schedule that tool to run in the background at times when we figured it would be less disruptive. Once we had that tool, we pulled the entire directory of Microsoft's Redmond employees and randomly sampled ten thousand of them. Then I emailed them a very innocent message: "Please install this tool that will scan your hard drive in the most invasive way imaginable". The installer also had mechanisms for data retrieval, basically just copying the resulting files back to a network file server. Then of course you need odds-and-ends features like jittering the upload so it doesn't take our server out, and checking a remote file we controlled so that we could remotely trigger an uninstall.

Of course it would be much better for us if this already existed but since it didn't we had to build it to operate in a time-limited way. We essentially only had one shot and getting the tool to work and to run.

RF: Wow, that's not what I was expecting. But it does make sense, in that to collect the type of data you presented in that paper, you really needed to be able to do more than just analyze backups. How did you determine file sizes and types?

DM: We knew, because it wasn't particular hard to know: we looked at the metadata available in the file system APIs and took almost everything. For extensions we read the file names and took the string following the period.

On the deduplication side we had to think more carefully about space and time constraints. The scanner itself running on the users' computer read the file data and turned it into content hashes. If you break the files into different sized chunks you get different hashes, but we had to do all that on the user's computer because we couldn't copy all their data back to our server. So we had to pick several different ways to break up those files and hash the data, such that we would get decent coverage of what different systems would do, but also so that the scan would finish for most users in a timely way. We also hashed the user's file names (but not extensions) for privacy.

RF: I am guessing that there was no need for institutional review boards as this happened within a company. Did participants get offered assurances of privacy?

DM: We told them clearly what we were doing and how we were hashing the data that left their computer. The scanner had access to all their data, but we only transferred hashes of sensitive data back to our server.

RF: My recollection of deduplication is that the more blocks of data you have to work with, the better the deduplication, and that would work better when workstations/desktops are using file servers.

DM: That's right, dedup works better with big file servers. What we did was take all the hashes from every file for every PC we scanned and pulled it back to our server, then treated all of them as a single large dataset. From there we could look at what a single PC's deduplication rate would be, and also sample so that we could consider deduplication across 100 of those PCs, or 200, or all of them. And since we did all of that 9-times over (for 9 different dedup parameters) we could compare how different approaches to dedup changed as the amount of data you have to deduplicate changes.

RF: Wasn't there already a tool or product for deduplication in NTFS?

BB: There was an NTFS whole-file deduplication system called single instance storage (this was before the word "deduplication" was coined) that I led the development of that shipped in Windows 2000, but it only ran on a particular version that was intended to serve images for remote booting.

So, it existed and had for more than 10 years, but it was neither widely known nor widely deployed. Most of the dedup community doesn't know about to this day, probably because of the lack of the word "deduplication" in the paper title [3].

And SIS was in later versions of Windows than just Win2K, but it was never widely used. They eventually built a chunking deduplication system but I wasn’t involved in that.

RF: Dutch does a good job of explaining chunking for deduplication in his presentations [1]. He also covers some of the other things you discovered, for example, that median file size had stayed the same (4K) for decades, but there a second peak in the graph of file sizes had appeared, likely because of virtual disk images for virtual machines.

Dutch also mentions in his presentation that you planned on sharing the data you collected. Were you able to do that?

BB: I thought that we’d released the data through SNIA, but I looked on their website and it’s not there. Maybe they took it down because it was so big. My memory is that it took a while to get them to process and release it and in the interim I sent out a few copies for people who mailed me disk drives. I talked with someone at FAST who had used the data, so it for sure got out there at least some.

DM: I do have the dataset still—if someone wants it they can ship me a hard drive or some high-capacity flash. It's a lot of data.

Appendix

References:

[1] Dutch Meyer and William Bolosky, A Study in Practical Deduplication, 9th USENIX Conference on File and Storage Technologies (FAST 11): https://www.usenix.org/conference/fast11/study-practical-deduplication

[2] Test of Time Awards: https://www.usenix.org/conferences/test-of-time-awards

[3] William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur, Single Instance Storage in Windows 2000, 4th USENIX Windows Systems Symposium: https://www.usenix.org/legacy/publications/library/proceedings/usenix-wi...

Article Categories:

Filesystem/storage

Last updated April 21, 2023

Authors:

Dutch Meyer received his Ph.D. in 2015 from the University of British Columbia, and has since developed products at Coho Data, Microsoft, and Amazon. His work spans block and object storage in both enterprise and cloud environments, at scales well into the exabyte range, and latencies measured in microseconds.

[email protected]

Bill Bolosky spent most of his career working on various systems topics, from portable operating systems and machine independent virtual memory at CMU in the 80s to NUMA at Rochester in the 90s to video servers, deduplication, distributed file systems, and characterization of storage systems (plus bioinformatics and cancer biology) at Microsoft from the 90s to the 2020s.

[email protected]

Rik Farrow has been a consultant for 42 years. He has written two books, as well as worked as the technical editor for a UNIX magazine and for two editions of a popular operating system book. He also taught UNIX system administration and Internet security during the 90s internationally, and worked as a volunteer for USENIX program and steering committees. Rik has been the editor of ;login: since 2005.

[email protected]