我有一个像这样的数据集,其中每一行都是玩家数据:
>>> df.head()
游戏大小 | match_id | 派对规模 | 球员助攻 | player_kills | 玩家姓名 | 团队 ID | 团队安置 | |
---|---|---|---|---|---|---|---|---|
0 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | 鼻烟壶 | 4 | 18 |
1 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | 臭氧3r | 4 | 18 |
2 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | 牛化 | 5 | 33 |
3 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | sbahn87 | 5 | 33 |
4 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 2 | 双子座ZZZ | 14 | 11 |
来源:完整数据集 - 压缩后 126MB,解压后 1.18GB
我需要创建一个名为
weights
的新列,其中每行都是 0 到 1 之间的数字。需要将其计算为每个玩家的总击杀数 (player_kills
) 除以每队的总击杀数.
我最初的想法是根据 groupby 聚合和创建一个名为
total_kills
的新列。创建 weights
列很容易,其中每行只需 player_kills
除以 total_kills
。这是到目前为止计算 groupby 总和的代码。
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
df = dd.read_csv("pubg.csv")
print(df.compute().head().to_markdown())
total_kills = df.groupby(
['match_id', 'team_id']
).aggregate({"player_kills": 'sum'}).reset_index()
print(total_kills.compute().head().to_markdown())
match_id | 团队 ID | player_kills | |
---|---|---|---|
0 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 4 | 2 |
1 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 5 | 0 |
2 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 14 | 2 |
3 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 15 | 0 |
4 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 17 | 1 |
到目前为止,一切都很好。尝试使用这行代码重新分配新的
player_kills
列不起作用:
df['total_kills'] = total_kills['player_kills']
它产生这个错误:
Traceback (most recent call last):
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 11, in <module>
df['total_kills'] = total_kills['player_kills']
~~^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 4952, in __setitem__
df = self.assign(**{key: value})
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 5401, in assign
data = elemwise(methods.assign, data, *pairs, meta=df2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 6505, in elemwise
args = _maybe_align_partitions(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 176, in _maybe_align_partitions
dfs2 = iter(align_partitions(*dfs)[0])
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 130, in align_partitions
raise ValueError(
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
如何解决这个问题?
我认为数据帧形状不一样,所以存在索引问题。你可以尝试这些;
total_kills = df.groupby(['match_id', 'team_id']).agg(player_total_kills=("player_kills", 'sum')).reset_index()
df_final = pd.merge(left=df,right=total_kills,on=["match_id","team_id"])
我希望它有效。