Optimize Hamming Distance Python

Question

I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.

Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:

from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')

The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).

Is there any fastest way to do it?

I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.

You're trying to brute-force a problem you don't have anywhere near the resources to brute-force. There are much better ways to find nearest neighbors than by computing all pairwise distances and taking the low ones. — user2357112
– user2357112, Commented Nov 22, 2016 at 3:58

Mark Hannel · Accepted Answer · 2016-11-22 04:58:08Z

6

If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).

Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.

from sklearn.neighbors import BallTree
import numpy as np

# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))

# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)

print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.

Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.

answered Nov 22, 2016 at 4:58

Mark Hannel

7955 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jevanio Over a year ago

Balltree can query k-neighbours and over radius-r, that's great. I'll check how much time it saves, but already it's a way better solution than mine, thanks xD

jevanio Over a year ago

It result to take a little more time that exhaustive search -.-

Collectives™ on Stack Overflow

Optimize Hamming Distance Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related