Asaf Levi, Technion - Israel Institute of Technology; Philip Shilane, Dell Technologies; Sarai Sheinvald, Braude College of Engineering; Gala Yadgar, Technion - Israel Institute of Technology
In the realm of information retrieval, the need to maintain reliable term-indexing has grown more acute in recent years, with vast amounts of ever-growing online data searched by a large number of search-engine users and used for data mining and natural language processing. At the same time, an increasing portion of primary storage systems employ data deduplication, where duplicate logical data chunks are replaced with references to a unique physical copy.
We show that indexing deduplicated data with deduplication-oblivious mechanisms might result in extreme inefficiencies: the index size would increase in proportion to the logical data size, regardless of its duplication ratio, consuming excessive storage and memory and slowing down lookups. In addition, the logically sequential accesses during index creation would be transformed into random and redundant accesses to the physical chunks. Indeed, to the best of our knowledge, term indexing is not supported by any deduplicating storage system.
In this paper, we propose the design of a deduplication-aware term-index that addresses these challenges. IDEA maps terms to the unique chunks that contain them, and maps each chunk to the files in which it is contained. This basic design concept improves the index performance and can support advanced functionalities such as inline indexing, result ranking, and proximity search. Our prototype implementation based on Lucene (the search engine at the core of Elasticsearch) shows that IDEA can reduce the index size and indexing time by up to 73% and 94%, respectively, and reduce term-lookup latency by up to 82% and 59% for single and multi-term queries, respectively.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Asaf Levi and Philip Shilane and Sarai Sheinvald and Gala Yadgar},
title = {Physical vs. Logical Indexing with {IDEA}: Inverted {Deduplication-Aware} Index},
booktitle = {22nd USENIX Conference on File and Storage Technologies (FAST 24)},
year = {2024},
isbn = {978-1-939133-38-0},
address = {Santa Clara, CA},
pages = {243--258},
url = {https://www.usenix.org/conference/fast24/presentation/levi},
publisher = {USENIX Association},
month = feb
}