如何计算 Python DataFrame 中每月组合类别中重叠的唯一 customer_id 的数量？

Question

我正在使用包含金融交易信息的Python DataFrame。 DataFrame 具有以下相关列：customer_id、year、month、product_type。每行代表客户在特定月份进行的一笔交易，product_type 表示交易类别。

我的目标是计算每个月组合类别中重叠的唯一 customer_id 值的数量。换句话说，我想知道有多少客户在特定月份内进行了多个类别的交易。

例如，如果 2022 年 6 月“类别 A”中有 10 个唯一的 customer_id 值，“类别 B”中有 6 个唯一的 customer_id 值，我如何计算当月这两个类别中有多少个客户_id 值重叠？

这是我尝试解决此问题的部分代码：

import pandas as pd
from itertools import combinations

# DataFrame df with transaction information

categories = ["Category A", "Category B", "Category C", "Category D", "Category E"]

combined_tables = pd.DataFrame()

for i in range(1, len(categories) + 1):
    for combo in combinations(categories, i):
        combo_name = ' & '.join(combo)
        filtered_df = df[df['product_type'].isin(combo)]
        
        # Count how many customer_ids overlap in the same category in each month
        grouped = filtered_df.groupby(['year', 'month', 'product_type'])['customer_id'].nunique().reset_index()
        grouped.rename(columns={'customer_id': combo_name}, inplace=True)
        
        if combined_tables.empty:
            combined_tables = grouped
        else:
            combined_tables = combined_tables.merge(grouped, on=['year', 'month', 'product_type'], how='left')

我希望获得一个 DataFrame，显示每个月组合类别中重叠的唯一 customer_id 值的数量。结果应包含年、月、类别组合的列，并且这些值应表示唯一 customer_id 值的计数。

我期待这样的塔布拉鼓：

年	月	A类	B类	C类	A类和B类	A 类和 C 类	B 类和 C 类	A类&B类&C类
2022	6	10	6	8	2	5	3	1
2022	7	7	9	4	3	2	1	0
2022	8	6	5	7	2	4	3	1
2022	9	9	4	8	1	3	2	0
2022	10	8	7	6	2	2	1	0
2022	11	7	8	5	1	2	1	0
2022	12	8	9	4	2	1	1	0
2023	1	9	6	7	1	2	1	0
2023	2	7	5	8	1	3	2	0
2023	3	8	4	9	2	4	3	1
2023	4	6	7	6	2	2	1	0
2023	5	5	6	7	1	3	2	0
2023	6	6	5	8	2	4	3	1
2023	7	7	4	9	1	2	1	0

类别之间所有可能的组合

Answer 1

关于输入数据帧的信息并不多，但我尝试使用嵌套字典以及

len()

和

set()

函数为您整理出一个想法。见下文。

data={}
for category in set(df.product_type):
    data[category]={}
    for y in set(df.year):
        data[category][y]={}
        for m in set(df.month):
            data[category][y][m]=[]
for row in range(len(df)):
    category=df.iat[row,list(df.columns).index('product_type')]
    y=df.iat[row,list(df.columns).index('year')]
    m=df.iat[row,list(df.columns).index('month')]
    cid=df.iat[row,list(df.columns).index('customer_id')]
    data[category][y][m]+=[cid]

header=['year','month']+list(set(df.product_type))
rows=[]
for y in set(df.year):
    for m in set(df.month):
        l=[y,m]
        for c in header[2:]:
            l+=[len(set(data[c][y][m]))]
        rows+=[l]
pd.DataFrame(data=rows,columns=header)

它绝对可以改进和优化，但效果很好。我在包含 100,000 行的示例输入上运行它，在我的计算机上运行大约需要 8 秒。希望这有帮助:)

如何计算 Python DataFrame 中每月组合类别中重叠的唯一 customer_id 的数量？

问题描述投票：0回答：1

1个回答

最新问题

如何计算 Python DataFrame 中每月组合类别中重叠的唯一 customer_id 的数量？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1