2

I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.

Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:

from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')

The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).

Is there any fastest way to do it?

I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.

1
  • You're trying to brute-force a problem you don't have anywhere near the resources to brute-force. There are much better ways to find nearest neighbors than by computing all pairwise distances and taking the low ones. Commented Nov 22, 2016 at 3:58

1 Answer 1

6

If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).

Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.

from sklearn.neighbors import BallTree
import numpy as np

# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))

# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)

print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.

Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.

Sign up to request clarification or add additional context in comments.

2 Comments

Balltree can query k-neighbours and over radius-r, that's great. I'll check how much time it saves, but already it's a way better solution than mine, thanks xD
It result to take a little more time that exhaustive search -.-

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.