如何计算每组的移位扩展平均值

问题描述 投票:0回答:2

我希望基于 groupby('col1') 扩展 col2 的平均值,但我希望平均值不包含行本身(仅包含其上方的行)

dummy = pd.DataFrame({"col1": ["a",'a','a','b','b','b','c','c'],"col2":['1','2','3','4','5','6','7','8'] }, index=list(range(8)))
print(dummy)
dummy['one_liner'] = dummy.groupby('col1').col2.shift().expanding().mean().reset_index(level=0, drop=True)
dummy['two_liner'] = dummy.groupby('col1').col2.shift()
dummy['two_liner'] = dummy.groupby('col1').two_liner.expanding().mean().reset_index(level=0, drop=True)
print(dummy)
---------------------------
here is result of first print statement:
col1 col2
0    a    1
1    a    2
2    a    3
3    b    4
4    b    5
5    b    6
6    c    7
7    c    8
here is result of the second print:
 col1 col2  one_liner  two_liner
0    a    1        NaN        NaN
1    a    2   1.000000        1.0
2    a    3   1.500000        1.5
3    b    4   1.500000        NaN
4    b    5   2.333333        4.0
5    b    6   3.000000        4.5
6    c    7   3.000000        NaN
7    c    8   3.800000        7.0

我原以为他们的结果会是相同的。 two_liner 是预期的结果。 one_liner 在组之间混合数字。

这个解决方案花了很长时间,谁能解释一下逻辑吗?为什么 one_liner 没有给出预期结果?

python pandas pandas-groupby
2个回答
2
投票

您正在

expanding().mean()
中寻找
shift()
groupby()

groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())

df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())

输出:

  col1  col2  one_liner  two_liner
0    a     1        NaN        NaN
1    a     2        1.0        NaN
2    a     3        1.5        1.0
3    b     4        NaN        NaN
4    b     5        4.0        NaN
5    b     6        4.5        4.0
6    c     7        NaN        NaN
7    c     8        7.0        NaN

说明:

(dummy.groupby('col1').col2.shift()   # this shifts col2 within the groups 
     .expanding().mean()              # this ignores the grouping and expanding on the whole series
     .reset_index(level=0, drop=True) # this is not really important
)

因此上面的链式命令相当于

s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)

如您所见,只有

s1
考虑按
col1
进行分组。


0
投票

避免使用

apply
的另一种解决方案及其对大型数据集执行时间的影响:

temp = pd.concat(
    [
        df,
        df.rename(columns={"col2": "one_liner"})
        .set_index("col1", append=True)
        .groupby("col1")["one_liner"].shift()
        .groupby("col1").expanding().mean()
        .droplevel(-1)
        .droplevel(0),
    ],
    axis=1,
)

pd.concat(
    [
        temp,
        temp.rename(columns={"one_liner": "two_liner"})
        .set_index("col1", append=True)
        .groupby("col1")["two_liner"].shift()
        .groupby("col1").expanding().mean()
        .droplevel(-1)
        .droplevel(0),
    ],
    axis=1,
)

输出:

  col1 col2  one_liner  two_liner
0    a    1        NaN        NaN
1    a    2        1.0        NaN
2    a    3        1.5        1.0
3    b    4        NaN        NaN
4    b    5        4.0        NaN
5    b    6        4.5        4.0
6    c    7        NaN        NaN
7    c    8        7.0        NaN

虽然这不是很性感,但确实需要做。

© www.soinside.com 2019 - 2024. All rights reserved.