A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality

Proc Int Jt Conf Neural Netw. 2012 Feb 8;43(6):2351-2358. doi: 10.1016/j.patcog.2010.01.003.

Abstract

The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.