我有一个像这样的数据集
id | 开放月 | 月末 | 数量 |
---|---|---|---|
001 | 2023-01-31 | 2023-02-28 | 1 |
002 | 2023-01-31 | 2023-03-31 | 5 |
003 | 2023-01-31 | 2023-04-30 | 4 |
004 | 2023-02-28 | 2023-02-28 | 2 |
005 | 2023-02-28 | 2023-03-31 | 3 |
006 | 2023-02-28 | 2023-04-30 | 6 |
007 | 2023-03-31 | 2023-03-31 | 7 |
008 | 2023-03-31 | 2023-04-30 | 9 |
x = pd.DataFrame({
'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
'open_month': ['2023-01-31', '2023-01-31', '2023-01-31', '2023-02-28', '2023-02-28', '2023-02-28', '2023-03-31', '2023-03-31'],
'end_month': ['2023-02-28', '2023-03-31', '2023-04-30', '2023-02-28', '2023-03-31', '2023-04-30', '2023-03-31', '2023-04-30'],
'quantity': [1, 5, 4, 2, 3, 6, 7, 9]
}
)
我需要得到一个表格,其中行是从0开始的月数,列是打开的月份,值是结束项目的数量。
想要的结果
2023-01-31 | 2023-02-28 | 2023-03-31 | |
---|---|---|---|
0 | 0 | 2 | 7 |
1 | 1 | 5 | 16 |
2 | 6 | 11 | 16 |
table = pd.DataFrame()
for i in range(len(x['open_month'].unique())):
for month in x['open_month'].unique():
date = month + pd.offsets.MonthEnd(i)
table.at[i, month] = x.query('open_month == @month and end_month <= @date')['quantity'].sum()
我的代码按我的预期工作,但在实际数据(> 2 百万个 ID)上速度太慢
尝试构建一个基于列表理解的DataFrame:
open_months = x['open_month'].unique()
df = pd.DataFrame(np.array([[x[x['open_month'].eq(m)
& x['end_month'].le(m + pd.offsets.MonthEnd(i))]['quantity'].sum()
for i in range(len(open_months))]
for m in open_months]).T, columns=open_months)
2023-01-31 2023-02-28 2023-03-31
0 0 2 7
1 1 5 16
2 6 11 16