我想在分组后构建两个数据帧的笛卡尔积。
import pandas as pd
df1 = pd.DataFrame({"group": [1, 1, 1, 1, 2], "A": [3, 4, 3, 4, 3], "B": [5, 5, 6, 6, 7]})
df2 = pd.DataFrame({"group": [1, 1, 2, 2], "A": [3, 4, 5, 6], "B": [6, 6, 6, 7]})
target = pd.DataFrame({"A_x": [3] * 4 + [4] * 4 + [5, 6],
"B_x": [6] * 8 + [6, 7],
"A_y": [3, 4, 3, 4] * 2 + [3, 3],
"B_y": [5, 5, 6, 6] * 2 + [7, 7],
"group": [1] * 8 + [2, 2],
})
>>> df1
group A B
0 1 3 5
1 1 4 5
2 1 3 6
3 1 4 6
4 2 3 7
>>> df2
group A B
0 1 3 6
1 1 4 6
2 2 5 6
3 2 6 7
>>> target
A_x B_x A_y B_y group
0 3 6 3 5 1
1 3 6 4 5 1
2 3 6 3 6 1
3 3 6 4 6 1
4 4 6 3 5 1
5 4 6 4 5 1
6 4 6 3 6 1
7 4 6 4 6 1
8 5 6 3 7 2
9 6 7 3 7 2
目前我正在循环中执行此操作,但我确信一定有一个更像 pandas 的解决方案,在更大的数据集上性能更高
df = pd.DataFrame()
for group in df1.group.unique():
df_merged = df2.loc[df2.group == group].merge(df1.loc[df1.group == group], "cross")
df = pd.concat([df, df_merged], axis=0)
df["group"] = df.group_x
df.drop(["group_x", "group_y"], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
>>> df.equals(target)
True
您可以在 Pandas 中使用
groupby
操作。这使您可以更有效地操作每组行,而无需诉诸显式循环。
import pandas as pd
df1 = pd.DataFrame({"group": [1, 1, 1, 1, 2], "A": [3, 4, 3, 4, 3], "B": [5, 5, 6, 6, 7]})
df2 = pd.DataFrame({"group": [1, 1, 2, 2], "A": [3, 4, 5, 6], "B": [6, 6, 6, 7]})
# Group by 'group' and then compute cartesian product within each group
result = (
df1.groupby("group")
.apply(lambda group_df1: df2.query("group == @group_df1.name").merge(group_df1, "cross"))
.reset_index(drop=True)
)
# Cleanup column names
result["group"] = result["group_x"]
result.drop(["group_x", "group_y"], axis=1, inplace=True)
print(result)
输出:
A_x B_x A_y B_y group
0 3 6 3 5 1
1 3 6 4 5 1
2 3 6 3 6 1
3 3 6 4 6 1
4 4 6 3 5 1
5 4 6 4 5 1
6 4 6 3 6 1
7 4 6 4 6 1
8 5 6 3 7 2
9 6 7 3 7 2