python - improve linear search for KNN efficiency w/ NumPY -
i trying calculate distance of each point in testing set each point in training set:
this loop looks right now:
x in testingset y in trainingset print numpy.linalg.norm(x-y)
where testingset , trainingset numpy arrays each row of 2 sets hold feature data 1 example.
however, it's running extremely slowly, taking more 10 minutes since data set bigger (testing set of 3000, training set of ~10,000). have method or utilizing numpy incorrectly?
this because naively iterate on data, , loops slow in python. instead, use sklearn pairwise distance functions, or better - use sklearn efficient nearest neighbour search (like balltree or kdtree). if not want use sklearn, there module in scipy. can "matrix tricks" compute this, since
|| x - y ||^2 = <x-y, x-y> = <x,x> + <y,y> - 2<x,y>
you can (assuming data in matrix form given x , y):
x2 = (x**2).sum(axis=1).reshape((-1, 1)) y2 = (y**2).sum(axis=1).reshape((1, -1)) distances = np.sqrt(x2 + y2 - 2*x.dot(y.t))
Comments
Post a Comment