背景
我正在尝试编写可在 ndarray 上运行且功能类似于 pandas groupby 的 Cython 代码。给定两个 1D 或 2D ndarray“因子”和“组”,我想要“因子”数组的摘要描述,例如每行(如果是 2D)和每个组的平均值和标准差,或者对“因子”数组进行操作,例如去均值或标准化。
>>> factor
array([[3, 8, 6],
[6, 1, 3],
[3, 7, 6],
[1, 3, 4]])
>>> group
array([[0, 1, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0]])
>>> c_group.group_mean(factor, group)
array([[3, 7],
[2, 6],
[4, 7],
[2, 0]])
>>> c_group.group_demean(factor, group)
array([[ 0, 1, -1],
[ 0, -1, 1],
[-1, 0, 2],
[-1, 1, 2]])
>>> c_group.group_demean(factor[0], group[0])
array([ 0, 1, -1])
问题
除了算法的实现,还有两个问题,处理多数据类型和输入的形状。在这里,我遵循cython 文档,使用fused type 作为多数据类型的模板,并为多数据形状编写两种函数。 (请忽略整数的平均值可能是浮点数。)
# import ...
ctypedef factor_type:
int
long long
double
def myfunc(factor, group):
if np.ndim(factor) == 1 and np.ndim(group) == 1:
return myfunc_1d(factor, group)
if np.ndim(factor) == 2 and np.ndim(group) == 2:
return myfunc_1d(factor, group)
def myfunc_1d(factor_type[:] factor, int[:] group):
...
if factor_type is int:
dtype = np.intc
elif factor_type is double:
dtype = np.double
elif factor_type is cython.longlong:
dtype = np.longlong
result = np.array(size, dtype=dtype)
factor_type[:] result_view = result
...
def myfunc_2d(factor_type[:, :] factor, int[:, :] group):
...
if factor_type is int:
dtype = np.intc
elif factor_type is double:
dtype = np.double
elif factor_type is cython.longlong:
dtype = np.longlong
result = np.array(size, dtype=dtype)
factor_type[:, ::1] result_view = result
...
问题
def type_conversion(type_):
if type_ is int:
return np.intc
elif type_ is double:
return np.double
elif type_ is cython.longlong:
return np.longlong
非常感谢