在Python中,我想获取一个较小的数字序列,并沿着一个非常大的数字序列找到与这个较小的数字序列具有最高相关性的区域。
除了暴力之外,还有其他有效的方法吗?取与较小序列长度相同的较大序列的子集,计算系数,然后递增起始索引,并一遍又一遍地这样做?
numpy 或其他数学包是否有一些功能可以有效地做到这一点?
我建议使用 numpy 来完成任务,例如:
import numba
# https://stackoverflow.com/a/73663164/10035985
@numba.njit
def corr_nb(data1, data2):
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
corr = ((data1 * data2).mean() - mean1 * mean2) / (std1 * std2)
return corr
@numba.njit
def find_corr(small, big):
n = len(small)
curr_max_corr = -np.inf
curr_max_index = -1
for i in range(0, len(big) - n):
c = corr_nb(small, big[i : i + n])
if c > curr_max_corr:
curr_max_corr = c
curr_max_index = i
return curr_max_index
7 个元素的小数组和 500_000 个元素的大数组的基准:
from timeit import timeit
import numba
import numpy as np
np.random.seed(42)
big_arr = np.random.randint(low=-10, high=10, size=500_000, dtype="int8")
# some small_arr that we try to correlate:
small_arr = np.array([1, -1, 2, 3, 4, -5, 6], dtype="int8")
# https://stackoverflow.com/a/73663164/10035985
@numba.njit
def corr_nb(data1, data2):
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
corr = ((data1 * data2).mean() - mean1 * mean2) / (std1 * std2)
return corr
@numba.njit
def find_corr(small, big):
n = len(small)
curr_max_corr = -np.inf
curr_max_index = -1
for i in range(0, len(big) - n):
c = corr_nb(small, big[i : i + n])
if c > curr_max_corr:
curr_max_corr = c
curr_max_index = i
return curr_max_index
i = find_corr(small_arr, big_arr)
print("Index:", i)
print("Small:", small_arr)
print("Big: ", big_arr[i : i + len(small_arr)])
t = timeit("find_corr(small_arr, big_arr)", number=1, globals=globals())
print(t)
在我的机器上打印(AMD 5700x):
Index: 74716
Small: [ 1 -1 2 3 4 -5 6]
Big: [ 1 -3 2 3 3 -7 7]
0.041199692990630865