熊猫。如何提高性能。迭代太慢

问题描述 投票:0回答:1

我有一个像这样的数据集

id 开放月 月末 数量
001 2023-01-31 2023-02-28 1
002 2023-01-31 2023-03-31 5
003 2023-01-31 2023-04-30 4
004 2023-02-28 2023-02-28 2
005 2023-02-28 2023-03-31 3
006 2023-02-28 2023-04-30 6
007 2023-03-31 2023-03-31 7
008 2023-03-31 2023-04-30 9
x = pd.DataFrame({
  'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
  'open_month': ['2023-01-31', '2023-01-31', '2023-01-31', '2023-02-28', '2023-02-28', '2023-02-28', '2023-03-31', '2023-03-31'],
  'end_month': ['2023-02-28', '2023-03-31', '2023-04-30', '2023-02-28', '2023-03-31', '2023-04-30', '2023-03-31', '2023-04-30'],
  'quantity': [1, 5, 4, 2, 3, 6, 7, 9]
}
)

我需要得到一个表格,其中行是从0开始的月数,列是打开的月份,值是结束项目的数量。

想要的结果

2023-01-31 2023-02-28 2023-03-31
0 0 2 7
1 1 5 16
2 6 11 16
table = pd.DataFrame()
for i in range(len(x['open_month'].unique())):
    for month in x['open_month'].unique():
        date = month + pd.offsets.MonthEnd(i)
        table.at[i, month] = x.query('open_month == @month and end_month <= @date')['quantity'].sum()

我的代码按我的预期工作,但在实际数据(> 2 百万个 ID)上速度太慢

pandas iteration
1个回答
0
投票

尝试构建一个基于列表理解的DataFrame:

open_months = x['open_month'].unique()
df = pd.DataFrame(np.array([[x[x['open_month'].eq(m) 
                               & x['end_month'].le(m + pd.offsets.MonthEnd(i))]['quantity'].sum() 
                             for i in range(len(open_months))]
                             for m in open_months]).T, columns=open_months)

   2023-01-31  2023-02-28  2023-03-31
0           0           2           7
1           1           5          16
2           6          11          16
© www.soinside.com 2019 - 2024. All rights reserved.