以向量形式计算在多个分组特征上选择的每个数据子集的统计数据

Question

考虑下面的非向量化形式操作的代码

import numpy as np

N = 110
observables = np.random.rand(N)

I = np.random.randint(np.array([0,0])[:,None], np.array([4,5])[:,None], 
                     size=(2,N))

# you can think of of this as two groupings of the data with numbers
# indicating separate groups (one has 4 groups the other 5). I want 
# to compute then all the averages for each i,j where i is the index 
# from the first grouping and j the second

averages = np.zeros((4,5))

#unvectorized
for i in range(4):
 for j in range(5):
   J = np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten()
   averages[i,j] = np.mean(observables[J])

我提出了以下向量化，但对于大 N 来说效率非常低，因为即使像 N=1000 这样的东西，最小公倍数也会增长到难以处理的大小

import math

#vectorized inefficiently
J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]
lengths = [len(j) for j in J]

L = math.lcm(*lengths)

J = np.array([np.tile(j,int(L/len(j))) for j in J])

averages_vectorized = np.reshape(np.mean(observables[J], axis=1), (4,5))

还有其他方法可以对其进行矢量化吗？例如，像 [0,1,2] 这样的索引列表可以用 [0, 1, 2, sth, sth] 之类的东西扩展，这样当我尝试使用这个索引列表访问 numpy 向量的元素时，没有考虑到什么？

ps：

还有以下方法，只是在两个列表推导式中隐藏 for 循环

J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]

averages_list_comrehension = np.reshape(np.array([np.mean(observables[j]) for j in J]), (4,5))

PS2：一种方法是在可观测值的末尾添加一个 nan 值，然后用索引 N 扩展 J 的任何元素，直到所有元素的大小相同并使用 np.nanmean。然而我的最终目标是在 PyTensor 的张量上下文中应用它，所以不确定如何在那里做同样的事情（在这种情况下可观察量将是标量张量，我不知道 PyTensor 中是否有 nan 张量和 nanmean）

Answer 1

更新：以 3 维数组表示数据

如果数据和组数不大，我们可以将它们表示为 3D 数组，其中第一个轴是值的索引，最后两个轴是组数。例如，如果第一个值 x₀ 属于组 1 和组 2，则数组值

arr[0, 1, 2]

等于 x₀，而所有其他值

arr[0, ...]

均为空。如果下一个 x₁ 值属于组 2 和 0，则

arr[1, 2, 0]

等于 x₁，而其他

arr[1, ...] == nan

，依此类推。

import numpy as np
from numpy.random import default_rng
rng = default_rng(0)

# parameters
N = 110
group_counts = (4, 5)

# init
values = rng.random(N)
groups = rng.integers(0, group_counts, size=(N, len(group_counts)))

# transform values into n-dimensional array, where n = 1 + number_of_groups
arr = np.full((N, *group_counts), np.nan, dtype=float)
arr[range(N), *groups.T] = values

# calculate mean values along axis 0 ignoring nan values
answer = np.nanmean(arr, axis=0)

在大数据的情况下，我们可以尝试n维稀疏数组：

import sparse

arr = sparse.COO([range(N), *groups.T], values, shape=(N, *group_counts), fill_value=np.nan)
sparse.nanmean(arr, axis=0).todense()

之前的答案：拆分排序组组合

对组组合进行排序并拆分：

def mean_by_groups(values, groups):
    # ensure it can work with structures other then numpy.ndarray
    values = np.asarray(values)
    groups = np.atleast_2d(np.asarray(groups))
    
    # ensure we have group mappings along the last value axis 
    assert values.shape[-1] == groups.shape[-1], 'Inappropriate shape of group mappings'
    assert (groups >= 0).all(), 'group numbering should start from 0'
    
    length = values.shape[-1]
    groups_count = 1 + groups.max(-1)
    # by default, lexical sorting is performed on the last axis
    sorted_by_group = np.lexsort(groups)
    # determine where combinations of group mappings are changing
    # here prepend=-1 to include the first combination (group number != -1)
    diff_from_prev = np.diff(groups[:, sorted_by_group], prepend=-1).any(0)
    bins = np.arange(length)[diff_from_prev]
    coordinates = np.transpose(groups[:, sorted_by_group][:, bins])
    # here we use bins[1:] to avoid splitting in front of the first value
    value_groups = np.split(values[sorted_by_group], bins[1:])
    ans = np.full(shape=groups_count, fill_value=np.nan)
    for group_combination, arr in zip(coordinates, value_groups):
        # here we use tuple to address a single cell
        ans[tuple(group_combination)] = arr.mean()
        
    return ans


answer = mean_by_groups(observables, I)

按照这种方式，我们可以有两个以上的组映射。还要考虑使用

numpy.full

而不是

numpy.zeros

。我们可以拥有映射中不存在的组组合，并且应该保留

nan

而不是

。

以向量形式计算在多个分组特征上选择的每个数据子集的统计数据

问题描述投票：0回答：1

1个回答

更新：以 3 维数组表示数据

之前的答案：拆分排序组组合

最新问题

以向量形式计算在多个分组特征上选择的每个数据子集的统计数据

问题描述 投票：0回答：1

1个回答

更新：以 3 维数组表示数据

之前的答案：拆分排序组组合

最新问题

问题描述投票：0回答：1