我有一组体育比赛的数据,其形式如下:
winner = ['A', 'A', 'B', 'C', 'A', 'C', 'C', 'B']
loser = ['B', 'C', 'A', 'A', 'B', 'A', 'B', 'C']
P1 = ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B']
P2 = ['B', 'C', 'B', 'C', 'B', 'C', 'C', 'C']
P1_win = [ 1, 1, 0, 0, 1, 0, 0, 0]
df = pd.DataFrame({'winner': winner, 'loser': loser, 'P1':P1, 'P2':P2, 'P1_win':P1_win})
df
我想计算 P1 和 P2 的持续连胜。然而,当我这样做时,当 P_win == 0 时,连胜不会重置。
我用来计算条纹的代码是:
condition = df.P1_win.eq(0)
df['Reset'] = condition.groupby(df.P1_win).cumsum() #reset need to be 0. If P_win == 0, reset the line
df['P1_win_Streak'] = df.P1_win.mask(condition, 0).groupby([df.winner, df.Reset]).cumsum()
发生的情况是,每当一个 streak 结束时,0 就会成功输入到 streak 列中,但 streak 会从之前的值开始,如图所示:
非常感谢任何帮助取消这个问题!
进行矢量化似乎很困难,但这可能适用于较小的情况(在迭代以使其更快之前<1 million row) dfs (you could also convert to numpy with
df.to_numpy
):
def calc_streaks(df: pd.DataFrame) -> pd.DataFrame:
player_streaks = {}
p1_streak = []
p2_streak = []
for _, row in df.iterrows():
player_streaks[row['loser']] = 0
player_streaks[row['winner']] = player_streaks.get(row['winner'], 0) + 1
p1_streak.append(player_streaks[row['P1']])
p2_streak.append(player_streaks[row['P2']])
df['P1_streak'] = p1_streak
df['P2_streak'] = p2_streak
return df
输出:
winner loser P1 P2 P1_win P1_streak P2_streak
0 A B A B 1 1 0
1 A C A C 1 2 0
2 B A A B 0 0 1
3 C A A C 0 0 1
4 A B A B 1 1 0
5 C A A C 0 0 2
6 C B B C 0 0 3
7 B C B C 0 1 0