我希望基于 groupby('col1') 扩展 col2 的平均值,但我希望平均值不包含行本身(仅包含其上方的行)
dummy = pd.DataFrame({"col1": ["a",'a','a','b','b','b','c','c'],"col2":['1','2','3','4','5','6','7','8'] }, index=list(range(8)))
print(dummy)
dummy['one_liner'] = dummy.groupby('col1').col2.shift().expanding().mean().reset_index(level=0, drop=True)
dummy['two_liner'] = dummy.groupby('col1').col2.shift()
dummy['two_liner'] = dummy.groupby('col1').two_liner.expanding().mean().reset_index(level=0, drop=True)
print(dummy)
---------------------------
here is result of first print statement:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 c 8
here is result of the second print:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.000000 1.0
2 a 3 1.500000 1.5
3 b 4 1.500000 NaN
4 b 5 2.333333 4.0
5 b 6 3.000000 4.5
6 c 7 3.000000 NaN
7 c 8 3.800000 7.0
我原以为他们的结果会是相同的。 two_liner 是预期的结果。 one_liner 在组之间混合数字。
这个解决方案花了很长时间,谁能解释一下逻辑吗?为什么 one_liner 没有给出预期结果?
您正在
expanding().mean()
中寻找 shift()
和 groupby()
:
groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())
df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())
输出:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.0 NaN
2 a 3 1.5 1.0
3 b 4 NaN NaN
4 b 5 4.0 NaN
5 b 6 4.5 4.0
6 c 7 NaN NaN
7 c 8 7.0 NaN
说明:
(dummy.groupby('col1').col2.shift() # this shifts col2 within the groups
.expanding().mean() # this ignores the grouping and expanding on the whole series
.reset_index(level=0, drop=True) # this is not really important
)
因此上面的链式命令相当于
s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)
如您所见,只有
s1
考虑按 col1
进行分组。
避免使用
apply
的另一种解决方案及其对大型数据集执行时间的影响:
temp = pd.concat(
[
df,
df.rename(columns={"col2": "one_liner"})
.set_index("col1", append=True)
.groupby("col1")["one_liner"].shift()
.groupby("col1").expanding().mean()
.droplevel(-1)
.droplevel(0),
],
axis=1,
)
pd.concat(
[
temp,
temp.rename(columns={"one_liner": "two_liner"})
.set_index("col1", append=True)
.groupby("col1")["two_liner"].shift()
.groupby("col1").expanding().mean()
.droplevel(-1)
.droplevel(0),
],
axis=1,
)
输出:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.0 NaN
2 a 3 1.5 1.0
3 b 4 NaN NaN
4 b 5 4.0 NaN
5 b 6 4.5 4.0
6 c 7 NaN NaN
7 c 8 7.0 NaN
虽然这不是很性感,但确实需要做。