Scalable Distributed Hashing for Approximate Nearest Neighbor Search

Yuan Cao; Junwei Liu; Heng Qi; Jie Gui; Keqiu Li; Jieping Ye; Chao Liu

doi:10.1109/TIP.2021.3130528

Scalable Distributed Hashing for Approximate Nearest Neighbor Search

IEEE Trans Image Process. 2022:31:472-484. doi: 10.1109/TIP.2021.3130528. Epub 2021 Dec 15.

Authors

Yuan Cao, Junwei Liu, Heng Qi, Jie Gui, Keqiu Li, Jieping Ye, Chao Liu

PMID: 34874853
DOI: 10.1109/TIP.2021.3130528

Abstract

Hashing has been widely applied to the large-scale approximate nearest neighbor search problem owing to its high efficiency and low storage requirement. Most investigations concentrate on learning hashing methods in a centralized setting. However, in existing big data systems, data is often stored across different nodes. In some situations, data is even collected in a distributed manner. A straightforward way to solve this problem is to aggregate all the data into the fusion center to obtain the search result (aggregating method). However, this strategy is not feasible because of the prohibitive communication cost. Although a few distributed hashing methods have been proposed to reduce this cost, they only focus on designing a distributed algorithm for a specific global optimization objective without considering scalability. Moreover, existing distributed hashing methods aim at finding a distributed solution to hashing, meanwhile avoiding accuracy loss, rather than improving accuracy. To address these challenges, we propose a Scalable Distributed Hashing (SDisH) model in which most existing hashing methods can be extended to process distributed data with no changes. Furthermore, to improve accuracy, we utilize the search radius as a global variable across different nodes to achieve a global optimum search result for every iteration. In addition, a voting algorithm is presented based on the results produced by multiple iterations to further reduce search errors. Theoretical analyses of communication, computation, and accuracy demonstrate the superiority of the proposed model. Numerical simulations on three large-scale and two relatively small benchmark datasets also show that the SDisH model achieves up to 44.75% and 10.23% accuracy gains compared to the aggregating method and state-of-the-art distributed hashing methods, respectively.