python - improve linear search for KNN efficiency w/ NumPY -


i trying calculate distance of each point in testing set each point in training set:

this loop looks right now:

 x in testingset     y in trainingset         print numpy.linalg.norm(x-y) 

where testingset , trainingset numpy arrays each row of 2 sets hold feature data 1 example.

however, it's running extremely slowly, taking more 10 minutes since data set bigger (testing set of 3000, training set of ~10,000). have method or utilizing numpy incorrectly?

this because naively iterate on data, , loops slow in python. instead, use sklearn pairwise distance functions, or better - use sklearn efficient nearest neighbour search (like balltree or kdtree). if not want use sklearn, there module in scipy. can "matrix tricks" compute this, since

|| x - y ||^2 = <x-y, x-y> = <x,x> + <y,y> - 2<x,y> 

you can (assuming data in matrix form given x , y):

x2 = (x**2).sum(axis=1).reshape((-1, 1)) y2 = (y**2).sum(axis=1).reshape((1, -1)) distances = np.sqrt(x2 + y2 - 2*x.dot(y.t)) 

Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -