我有一个很大的二维数据数组(〜20k条目),我想计算所有条目之间的成对欧几里得距离。我需要输出具有标准正方形形式。已经提出了针对此问题的多种解决方案,但是对于大型阵列,它们似乎都无法有效地工作。
使用complex transposing的方法不适用于大型阵列。
Scipy pdist似乎是使用numpy的最有效方法。但是,对结果使用squareform来获得方矩阵会使效率非常低。
所以我能想到的最好的方法是使用Scipy cdist,这有点尴尬,因为它确实两次计算每个成对距离。提供的时间测量结果显示了pdist在原始距离计算中的优势。
复杂:49.605 s
Cdist:4.820 s
Pdist 1.785 s
具有正方形10.212 s的Pdist
from scipy.spatial import distance
import pandas as pd
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def euclidean_distance(coords1, coords2):
# allocate output array
c1_length, c2_length = len(coords1), len(coords2)
out = np.empty(shape=(c1_length, c2_length), dtype=np.float64)
# fill the lower triangle with euclidean distance formula
# assuming coordiantes are (lat, lon) based on the example https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
for lat_ix in prange(c1_length):
for lon_ix in prange(c2_length):
if lat_ix >= lon_ix: # do the reverse for the upper triangle
out[lat_ix, lon_ix] = (
(coords1[lat_ix, 0] - coords2[lon_ix, 0]) ** 2
+ (coords1[lat_ix, 1] - coords2[lon_ix, 1]) ** 2
) ** 0.5
else:
out[lat_ix, lon_ix] = 0
return out
for n in [10, 100, 5000, 20000]:
arr = np.random.normal(0, 100, (n, 2))
print(n, arr.shape)
%time out = euclidean_distance(arr, arr)
%time out_cdist = distance.cdist(arr, arr, 'euclidean')
if n < 1000:
np.testing.assert_array_almost_equal(out, np.tril(out_cdist))
print()
输出:
10 (10, 2)
CPU times: user 987 ms, sys: 19.3 ms, total: 1.01 s
Wall time: 1.01 s
CPU times: user 79 µs, sys: 12 µs, total: 91 µs
Wall time: 95.1 µs
100 (100, 2)
CPU times: user 1.05 ms, sys: 404 µs, total: 1.45 ms
Wall time: 1.16 ms
CPU times: user 926 µs, sys: 254 µs, total: 1.18 ms
Wall time: 946 µs
5000 (5000, 2)
CPU times: user 125 ms, sys: 128 ms, total: 253 ms
Wall time: 75 ms
CPU times: user 184 ms, sys: 92.6 ms, total: 277 ms
Wall time: 287 ms
20000 (20000, 2)
CPU times: user 2.21 s, sys: 2.15 s, total: 4.36 s
Wall time: 2.55 s
CPU times: user 3.1 s, sys: 2.71 s, total: 5.81 s
Wall time: 31.9 s
使用20,000个元素的数组,UDF可以快得多,因为它可以节省一半的计算。 cdist
对于Macbook Air上的这种特定数据分配而言,似乎特别/出乎意料地慢,但是无论如何,要点都是如此。
import numba as nb
import numpy as np
from scipy.spatial import distance
#Should be at least 0.47 (SVML-Bug)
print(nb.__version__)
@nb.njit(fastmath=True,parallel=True)
def dist_simply_write(res):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[0]):
res[i,j]=1.
return res
res_1=np.empty((A.shape[0],A.shape[0]))
res_2=np.empty((A.shape[0],A.shape[0]))
#Copying the array to a new array, which has to be allocated
%timeit res_2=np.copy(res_1)
#1.32 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Copying the array to a new array, which is already allocated
%timeit np.copyto(res_1,res_2)
#328 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#fill an array with 1., without calculating anything
%timeit out=dist_simply_write(A,res)
#246 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
计算欧氏距离而不是写1会花费更长的时间吗??如您所见,如果我们进行简单的计算(欧氏距离)或仅将Number写入数组,则根本没有关系。实际上,仅计算值的一半并随后将其复制会比较慢(内存中没有连续的迭代和重新加载数据)。
@nb.njit(fastmath=True,parallel=True) def dist_arr_1(A): res=np.empty((A.shape[0],A.shape[0])) for i in nb.prange(A.shape[0]): for j in range(A.shape[0]): acc=0 for k in range(A.shape[1]): acc+=(A[i,k]-A[j,k])**2 res[i,j]=np.sqrt(acc) return res @nb.njit(fastmath=True,parallel=True) def dist_arr_2(A,res): for i in nb.prange(A.shape[0]): for j in range(A.shape[0]): acc=0 for k in range(A.shape[1]): acc+=(A[i,k]-A[j,k])**2 res[i,j]=np.sqrt(acc) return res %timeit out=dist_arr_1(A) #559 ms ± 85.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) res=np.empty((A.shape[0],A.shape[0])) #If we can reuse the output memory %timeit out=dist_arr_2(A,res) #238 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)