执行两个数据帧的笛卡尔交叉连接分组[重复]

问题描述 投票:0回答:1

我想在分组后构建两个数据帧的笛卡尔积。

import pandas as pd

df1 = pd.DataFrame({"group": [1, 1, 1, 1, 2], "A": [3, 4, 3, 4, 3], "B": [5, 5, 6, 6, 7]})
df2 = pd.DataFrame({"group": [1, 1, 2, 2], "A": [3, 4, 5, 6], "B": [6, 6, 6, 7]})

target = pd.DataFrame({"A_x": [3] * 4 + [4] * 4 + [5, 6],
                       "B_x": [6] * 8 + [6, 7],
                       "A_y": [3, 4, 3, 4] * 2 + [3, 3],
                       "B_y": [5, 5, 6, 6] * 2 + [7, 7],
                       "group": [1] * 8 + [2, 2],
                       })

>>> df1
   group  A  B
0      1  3  5
1      1  4  5
2      1  3  6
3      1  4  6
4      2  3  7


>>> df2
   group  A  B
0      1  3  6
1      1  4  6
2      2  5  6
3      2  6  7

>>> target
   A_x  B_x  A_y  B_y  group
0    3    6    3    5      1
1    3    6    4    5      1
2    3    6    3    6      1
3    3    6    4    6      1
4    4    6    3    5      1
5    4    6    4    5      1
6    4    6    3    6      1
7    4    6    4    6      1
8    5    6    3    7      2
9    6    7    3    7      2

目前我正在循环中执行此操作,但我确信一定有一个更像 pandas 的解决方案,在更大的数据集上性能更高

df = pd.DataFrame()
for group in df1.group.unique():
    df_merged = df2.loc[df2.group == group].merge(df1.loc[df1.group == group], "cross")
    df = pd.concat([df, df_merged], axis=0)

df["group"] = df.group_x
df.drop(["group_x", "group_y"], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)

>>> df.equals(target)
True
python pandas merge cartesian-product
1个回答
0
投票

您可以在 Pandas 中使用

groupby
操作。这使您可以更有效地操作每组行,而无需诉诸显式循环。

import pandas as pd


df1 = pd.DataFrame({"group": [1, 1, 1, 1, 2], "A": [3, 4, 3, 4, 3], "B": [5, 5, 6, 6, 7]})
df2 = pd.DataFrame({"group": [1, 1, 2, 2], "A": [3, 4, 5, 6], "B": [6, 6, 6, 7]})

# Group by 'group' and then compute cartesian product within each group
result = (
    df1.groupby("group")
    .apply(lambda group_df1: df2.query("group == @group_df1.name").merge(group_df1, "cross"))
    .reset_index(drop=True)
)

# Cleanup column names
result["group"] = result["group_x"]
result.drop(["group_x", "group_y"], axis=1, inplace=True)

print(result)

输出:

    A_x  B_x  A_y  B_y  group
0     3    6    3    5      1
1     3    6    4    5      1
2     3    6    3    6      1
3     3    6    4    6      1
4     4    6    3    5      1
5     4    6    4    5      1
6     4    6    3    6      1
7     4    6    4    6      1
8     5    6    3    7      2
9     6    7    3    7      2
© www.soinside.com 2019 - 2024. All rights reserved.