考虑下面的非向量化形式操作的代码
import numpy as np
N = 110
observables = np.random.rand(N)
I = np.random.randint(np.array([0,0])[:,None], np.array([4,5])[:,None],
size=(2,N))
# you can think of of this as two groupings of the data with numbers
# indicating separate groups (one has 4 groups the other 5). I want
# to compute then all the averages for each i,j where i is the index
# from the first grouping and j the second
averages = np.zeros((4,5))
#unvectorized
for i in range(4):
for j in range(5):
J = np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten()
averages[i,j] = np.mean(observables[J])
我提出了以下向量化,但对于大 N 来说效率非常低,因为即使像 N=1000 这样的东西,最小公倍数也会增长到难以处理的大小
import math
#vectorized inefficiently
J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
for j in range(5)]
lengths = [len(j) for j in J]
L = math.lcm(*lengths)
J = np.array([np.tile(j,int(L/len(j))) for j in J])
averages_vectorized = np.reshape(np.mean(observables[J], axis=1), (4,5))
还有其他方法可以对其进行矢量化吗?例如,像 [0,1,2] 这样的索引列表可以用 [0, 1, 2, sth, sth] 之类的东西扩展,这样当我尝试使用这个索引列表访问 numpy 向量的元素时,没有考虑到什么?
ps:
还有以下方法,只是在两个列表推导式中隐藏 for 循环
J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
for j in range(5)]
averages_list_comrehension = np.reshape(np.array([np.mean(observables[j]) for j in J]), (4,5))
PS2: 一种方法是在可观测值的末尾添加一个 nan 值,然后用索引 N 扩展 J 的任何元素,直到所有元素的大小相同并使用 np.nanmean。然而我的最终目标是在 PyTensor 的张量上下文中应用它,所以不确定如何在那里做同样的事情(在这种情况下可观察量将是标量张量,我不知道 PyTensor 中是否有 nan 张量和 nanmean)
对组组合进行排序并拆分:
def mean_by_groups_sort_and_split(values, groups):
# ensure it can work with structures other then numpy.ndarray
values = np.asarray(values)
groups = np.atleast_2d(np.asarray(groups))
# ensure we have group mappings along the last value axis
assert values.shape[-1] == groups.shape[-1], 'Inappropriate shape of group mappings'
assert (groups >= 0).all(), 'group numbering should start from 0'
length = values.shape[-1]
groups_count = 1 + groups.max(-1)
sorted_by_group = np.lexsort(groups)
# determine where combinations of group mappings are changing
# here prepend=-1 to include the first combination (group number != -1)
diff_from_prev = np.diff(groups[:, sorted_by_group], prepend=-1).any(0)
bins = np.arange(length)[diff_from_prev]
coordinates = np.transpose(groups[:, sorted_by_group][:, bins])
# here we use bins[1:] to avoid splitting in front of the first value
value_groups = np.split(values[sorted_by_group], bins[1:])
ans = np.full(shape=groups_count, fill_value=np.nan)
for group_combination, arr in zip(coordinates, value_groups):
# here we use tuple to address a single cell
ans[tuple(group_combination)] = arr.mean()
return ans
answer = mean_by_groups(observables, I)
按照这种方式,我们可以有两个以上的组映射。还要考虑
numpy.full
而不是 numpy.zeros
。我们可以拥有映射中不存在的组组合,并且应该保留 nan
而不是 0
。