有效地计算出Numpy中的欧氏距离矩阵?

问题描述 投票:1回答:2

我有一个很大的二维数据数组(〜20k条目),我想计算所有条目之间的成对欧几里得距离。我需要输出具有标准正方形形式。已经提出了针对此问题的多种解决方案,但是对于大型阵列,它们似乎都无法有效地工作。

使用complex transposing的方法不适用于大型阵列。

Scipy pdist似乎是使用numpy的最有效方法。但是,对结果使用squareform来获得方矩阵会使效率非常低。

所以我能想到的最好的方法是使用Scipy cdist,这有点尴尬,因为它确实两次计算每个成对距离。提供的时间测量结果显示了pdist在原始距离计算中的优势。

复杂:49.605 s

Cdist:4.820 s

Pdist 1.785 s

具有正方形10.212 s的Pdist

python numpy numpy-ndarray euclidean-distance
2个回答
0
投票
请注意,第一次运行时,JIT编译会产生开销。

from scipy.spatial import distance import pandas as pd from numba import njit, prange import numpy as np @njit(parallel=True) def euclidean_distance(coords1, coords2): # allocate output array c1_length, c2_length = len(coords1), len(coords2) out = np.empty(shape=(c1_length, c2_length), dtype=np.float64) # fill the lower triangle with euclidean distance formula # assuming coordiantes are (lat, lon) based on the example https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html for lat_ix in prange(c1_length): for lon_ix in prange(c2_length): if lat_ix >= lon_ix: # do the reverse for the upper triangle out[lat_ix, lon_ix] = ( (coords1[lat_ix, 0] - coords2[lon_ix, 0]) ** 2 + (coords1[lat_ix, 1] - coords2[lon_ix, 1]) ** 2 ) ** 0.5 else: out[lat_ix, lon_ix] = 0 return out for n in [10, 100, 5000, 20000]: arr = np.random.normal(0, 100, (n, 2)) print(n, arr.shape) %time out = euclidean_distance(arr, arr) %time out_cdist = distance.cdist(arr, arr, 'euclidean') if n < 1000: np.testing.assert_array_almost_equal(out, np.tril(out_cdist)) print()

输出:

10 (10, 2)
CPU times: user 987 ms, sys: 19.3 ms, total: 1.01 s
Wall time: 1.01 s
CPU times: user 79 µs, sys: 12 µs, total: 91 µs
Wall time: 95.1 µs

100 (100, 2)
CPU times: user 1.05 ms, sys: 404 µs, total: 1.45 ms
Wall time: 1.16 ms
CPU times: user 926 µs, sys: 254 µs, total: 1.18 ms
Wall time: 946 µs

5000 (5000, 2)
CPU times: user 125 ms, sys: 128 ms, total: 253 ms
Wall time: 75 ms
CPU times: user 184 ms, sys: 92.6 ms, total: 277 ms
Wall time: 287 ms

20000 (20000, 2)
CPU times: user 2.21 s, sys: 2.15 s, total: 4.36 s
Wall time: 2.55 s
CPU times: user 3.1 s, sys: 2.71 s, total: 5.81 s
Wall time: 31.9 s

使用20,000个元素的数组,UDF可以快得多,因为它可以节省一半的计算。 cdist对于Macbook Air上的这种特定数据分配而言,似乎特别/出乎意料地慢,但是无论如何,要点都是如此。


0
投票
首先尝试简单的一些简单的内存操作以获得一些参考时间。

import numba as nb import numpy as np from scipy.spatial import distance #Should be at least 0.47 (SVML-Bug) print(nb.__version__) @nb.njit(fastmath=True,parallel=True) def dist_simply_write(res): for i in nb.prange(A.shape[0]): for j in range(A.shape[0]): res[i,j]=1. return res res_1=np.empty((A.shape[0],A.shape[0])) res_2=np.empty((A.shape[0],A.shape[0])) #Copying the array to a new array, which has to be allocated %timeit res_2=np.copy(res_1) #1.32 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) #Copying the array to a new array, which is already allocated %timeit np.copyto(res_1,res_2) #328 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) #fill an array with 1., without calculating anything %timeit out=dist_simply_write(A,res) #246 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

计算欧氏距离而不是写1会花费更长的时间吗??

@nb.njit(fastmath=True,parallel=True) def dist_arr_1(A): res=np.empty((A.shape[0],A.shape[0])) for i in nb.prange(A.shape[0]): for j in range(A.shape[0]): acc=0 for k in range(A.shape[1]): acc+=(A[i,k]-A[j,k])**2 res[i,j]=np.sqrt(acc) return res @nb.njit(fastmath=True,parallel=True) def dist_arr_2(A,res): for i in nb.prange(A.shape[0]): for j in range(A.shape[0]): acc=0 for k in range(A.shape[1]): acc+=(A[i,k]-A[j,k])**2 res[i,j]=np.sqrt(acc) return res %timeit out=dist_arr_1(A) #559 ms ± 85.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) res=np.empty((A.shape[0],A.shape[0])) #If we can reuse the output memory %timeit out=dist_arr_2(A,res) #238 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如您所见,如果我们进行简单的计算(欧氏距离)或仅将Number写入数组,则根本没有关系。实际上,仅计算值的一半并随后将其复制会比较慢(内存中没有连续的迭代和重新加载数据)。
© www.soinside.com 2019 - 2024. All rights reserved.