Suitability of a new Bloom filter for numerical vectors with high dimensions

PLoS One. 2018 Dec 21;13(12):e0209159. doi: 10.1371/journal.pone.0209159. eCollection 2018.

Abstract

The notable increase in the size and dimensions of data have presented challenges for data storage and retrieval. The Bloom filter and its generations, due to efficient space overheads and constant query delays, have been broadly applied to querying memberships of a big data set. However, the Bloom filter and most of the variants regard each element as a 1-dimensional string and adopt multiple different string hashes to project the data. The interesting problem is when the inputs are numerical vectors with high dimensions, it remains unknown whether they can be projected into the Bloom filter in their original format. Furthermore, we investigate whether the projection is random and uniform. To address these problems, this paper presents a new uniform Prime-HD-BKDERhash family and a new Bloom filter (P-HDBF) to retrieve the membership of a big data set with the numerical high dimensions. Since the randomness and uniformity of data mapping determines the performance of the Bloom filter, to verify these properties, we first introduce information entropy. Our theoretical and experimental results show that the P-HDBF can randomly and uniformly map the data in their native formats. Moreover, the P-HDBF provides an efficient solution alternative to implement membership search with space-time overheads. This advantage may be suitable for engineering applications that are resource-constrained or identification of the nuances of the graphics and images.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Information Storage and Retrieval / methods
  • Models, Theoretical

Grants and funding

This work is supported by the National Key R&D Plan of China (No. 2017YFB0306405) to XO, the National Natural Science Foundation of China (No.61562056, No.61364008) to CS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.