我面临以下问题: 我希望能够获得一组值的标准差。困难在于通过应用通常的方程获得标准差,但用数据集平均值替换样本平均值。
为了澄清问题的根源,这是一个示例数据集:
import pandas as pd
import numpy as np
data = {
'X': ['asdf'] * 15,
'Y': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'A': [58781, 60775, 61424, 61620, 60882, 58788, 57939, 60212, 59086, 59119, 59119, 59119, 59119, 59119, 59119],
'B': [1.2, 1.6, 1.7, 2.1, 2.3, 2.8, 2.2, 1.9, 2.3, 2.2, 2.2, 2, 2.3, 2.4, 2.5],
'C': [4.4, 4.2, 5.8, 4, 4.3, 4.5, 4.3, 5.2, 5, 3.8, 4.1, 4.5, 4.4, 4.5, 4.3]
}
df = pd.DataFrame(data)
我想要获得的标准差是按特征“X”和“Y”分组的值(还有更多的“X”值我没有在示例中添加),但是我想在方程是对应于整个组“X”的方程。
这促使我开发了以下代码:
def custom_std(df, means, features, axis=0, ddof=1):
x = df['X'].iloc[0]
x_means = pd.concat([means[means['X'] == x][f]] * len(df[features]), ignore_index=True)
sum_diff_sqr = np.sum(np.square(df[features] - x_means), axis=axis)
variance = sum_diff_sqr/(len(df[features]) - ddof)
std_dev = np.sqrt(variance)
return std_dev
df_means = df.groupby(['X'])['A', 'B', 'C'].mean(numeric_only=True).reset_index()
df_custom_std = df.groupby(['X', 'Y']).apply(custom_std, df_means, ['A', 'B', 'C']).reset_index()
经过多次测试,我唯一得到的结果是空标准。我来验证减法
df[features] - x_means
会产生无效值 (NaN)。我希望得到的是一个标准差,除非我自己实现这个方法,否则我认为这是不可能的。
预期的输出应该如下所示(值不对应,但这就是想法):
X Y A B C
asdf 1 372.856882 0.408455 0.038759
2 369.726386 0.307087 0.005963
3 221.698686 0.038759 0.172923
qwer 1 1811.662275 0.028749 0.009583
2 1811.662275 0.373743 0.456797
3 1019.000000 0.167857 0.910714
zxcv 1 430.577250 0.268328 0.199755
2 714.110669 0.044721 0.426344
3 931.635362 0.313050 0.336901
uiop 1 916.638800 0.026833 0.921260
2 559.762350 0.107331 0.831817
3 640.558940 0.062610 0.509823
lkjh 1 782.975174 0.041527 0.779429
2 104.104936 0.003194 0.651654
3 378.107143 0.046429 0.328571
mnbv 1 223.964569 0.512805 0.432306
2 88.011636 0.602248 0.655913
3 398.556756 0.333919 0.432306
fghj 1 465.549353 0.140553 0.527073
2 898.500000 0.192857 0.289286
3 338.093478 0.185274 0.214024
而且我也怀疑正确的方法是
apply
还是transform
。
有人能帮我吗?请。我已经研究了任何已经完成的工作,但我无法找到符合标准的东西。
由于您没有提供计算示例,因此很难验证,但下面的答案利用了数据帧之间的索引对齐:
features = ["A", "B", "C"]
ddof = 1
# For each group, calculate the difference between each value and the custom mean.
# The two data frames are aligned on X since it's common in both indices.
diff = df.set_index(["X", "Y"])[features] - df.groupby("X")[features].mean()
# Variance
group = diff.pow(2).groupby(["X", "Y"])
variance = group.sum() / (group.count() - ddof)
# Standard Deviation
std = variance ** 0.5