如何将 groupby.aggregrate 中的列值重新分配回 dask 中的原始数据帧?

问题描述 投票:0回答:1

我有一个像这样的数据集,其中每一行都是玩家数据:

>>> df.head()
游戏大小 match_id 派对规模 球员助攻 player_kills 玩家姓名 团队 ID 团队安置
0 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 鼻烟壶 4 18
1 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 臭氧3r 4 18
2 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 牛化 5 33
3 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 sbahn87 5 33
4 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 2 双子座ZZZ 14 11

来源:完整数据集 - 压缩后 126MB,解压后 1.18GB

我需要创建一个名为

weights
的新列,其中每行都是 0 到 1 之间的数字。需要将其计算为每个玩家的总击杀数 (
player_kills
) 除以每队的总击杀数.

我的尝试

我最初的想法是根据 groupby 聚合和创建一个名为

total_kills
的新列。创建
weights
列很容易,其中每行只需
player_kills
除以
total_kills
。这是到目前为止计算 groupby 总和的代码。

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

df = dd.read_csv("pubg.csv")
print(df.compute().head().to_markdown())
total_kills = df.groupby(
    ['match_id', 'team_id']
).aggregate({"player_kills": 'sum'}).reset_index()
print(total_kills.compute().head().to_markdown())
match_id 团队 ID player_kills
0 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 4 2
1 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 5 0
2 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 14 2
3 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 15 0
4 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 17 1

到目前为止,一切都很好。尝试使用这行代码重新分配新的

player_kills
列不起作用:

df['total_kills'] = total_kills['player_kills']

它产生这个错误:

Traceback (most recent call last):
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 11, in <module>
    df['total_kills'] = total_kills['player_kills']
    ~~^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 4952, in __setitem__
    df = self.assign(**{key: value})
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 5401, in assign
    data = elemwise(methods.assign, data, *pairs, meta=df2)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 6505, in elemwise
    args = _maybe_align_partitions(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 176, in _maybe_align_partitions
    dfs2 = iter(align_partitions(*dfs)[0])
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 130, in align_partitions
    raise ValueError(
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

如何解决这个问题?

python pandas dataframe dask dask-dataframe
1个回答
0
投票

我认为数据帧形状不一样,所以存在索引问题。你可以尝试这些;

total_kills = df.groupby(['match_id', 'team_id']).agg(player_total_kills=("player_kills", 'sum')).reset_index()
df_final = pd.merge(left=df,right=total_kills,on=["match_id","team_id"])

我希望它有效。

© www.soinside.com 2019 - 2024. All rights reserved.