我有一个包含日期列和折扣值列的数据框,如下所示:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04', '2023-09-05', '2023-09-06','2023-09-07', '2023-09-08', '2023-09-09', '2023-09-10'],
'discount': [30, 25, 0, 10, 15, 15,0,25,30,0]})
df
我需要添加包含零之前的值的附加列,其中附加列的数量由 0 分隔符的数量确定,这样生成的 df 看起来像这样...
df2 = pd.DataFrame({'date': ['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04', '2023-09-05', '2023-09-06','2023-09-07', '2023-09-08', '2023-09-09', '2023-09-10'],
'discount': [30, 25, 0, 10, 15, 15,0,25,30,0],
'split1': [30,25,0,0,0,0,0,0,0,0],
'split2': [0,0,0,10,15,15,0,0,0,0],
'split3': [0,0,0,0,0,0,0,25,30,0]})
df2
到目前为止,我的尝试取得了以下成果;
for date, group in df.groupby('date'):
num_splits = len(group)
splits = group['discount'].tolist()
for i in range(num_splits):
df.loc[group.index[i], f'split{i+1}'] = splits[i]
df
注意 - “折扣”列中的非零值后面可能有多个连续的零,因此应使用第一个非零值来定义组。
感谢指导。
pivot
:
m = df['discount'].eq(0)
out = df.join(df[~m]
.assign(col=m.cumsum().add(1))
.pivot(columns='col', values='discount')
.add_prefix('split')
)
输出:
date discount split1 split2 split3
0 2023-09-01 30 30.0 NaN NaN
1 2023-09-02 25 25.0 NaN NaN
2 2023-09-03 0 NaN NaN NaN
3 2023-09-04 10 NaN 10.0 NaN
4 2023-09-05 15 NaN 15.0 NaN
5 2023-09-06 15 NaN 15.0 NaN
6 2023-09-07 0 NaN NaN NaN
7 2023-09-08 25 NaN NaN 25.0
8 2023-09-09 30 NaN NaN 30.0
9 2023-09-10 0 NaN NaN NaN
中间体:
m
标识零,~m
用于执行布尔索引的非零,cumsum
形成将用作列的组。
date discount m ~m col
0 2023-09-01 30 False True 1
1 2023-09-02 25 False True 1
2 2023-09-03 0 True False 2
3 2023-09-04 10 False True 2
4 2023-09-05 15 False True 2
5 2023-09-06 15 False True 2
6 2023-09-07 0 True False 3
7 2023-09-08 25 False True 3
8 2023-09-09 30 False True 3
9 2023-09-10 0 True False 4
IIUC,你可以使用:
TARGET = "discount"
block = df[TARGET].eq(0).shift(fill_value=True).cumsum()
out = (
df.assign(**{f"split{idx}": g
for idx, g in df.groupby(block)[TARGET]})
.fillna(0)#.with optional astype() ?
)
输出:
print(out)
date discount split1 split2 split3
0 2023-09-01 30 30.00 0.00 0.00
1 2023-09-02 25 25.00 0.00 0.00
2 2023-09-03 0 0.00 0.00 0.00
3 2023-09-04 10 0.00 10.00 0.00
4 2023-09-05 15 0.00 15.00 0.00
5 2023-09-06 15 0.00 15.00 0.00
6 2023-09-07 0 0.00 0.00 0.00
7 2023-09-08 25 0.00 0.00 25.00
8 2023-09-09 30 0.00 0.00 30.00
9 2023-09-10 0 0.00 0.00 0.00
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04',
'2023-09-05', '2023-09-06', '2023-09-07', '2023-09-08', '2023-09-09', '2023-09-10'],
'discount': [30, 25, 0, 10, 15, 15, 0, 25, 30, 0]})
def create_split_columns_efficiently(df):
target_col = "discount"
groups = df[target_col].eq(0).shift(fill_value=True).cumsum()
#Create split columns, ensuring zeros trigger new splits and handling consecutive zeros:
max_split = groups.max()
#print(max_split) #3
for i in range(1, max_split + 1):
# Initial assignment
df[f'split{i}'] = np.where(groups == i, df['discount'], np.nan)
#print(df[f'split{i}'])
# Correctly handle zeros within splits:
# Set zeros and NaNs to NaN
df.loc[df[f'split{i}'].eq(0) | df[f'split{i}'].isna(), f'split{i}'] = np.nan
# Forward-fill NaNs
df.loc[df[f'split{i}'].isna(), f'split{i}'].ffill(inplace=True)
return df
df_with_split_cols = create_split_columns_efficiently(df)
print(df_with_split_cols)
""" date discount split1 split2 split3
0 2023-09-01 30 30.0 NaN NaN
1 2023-09-02 25 25.0 NaN NaN
2 2023-09-03 0 NaN NaN NaN
3 2023-09-04 10 NaN 10.0 NaN
4 2023-09-05 15 NaN 15.0 NaN
5 2023-09-06 15 NaN 15.0 NaN
6 2023-09-07 0 NaN NaN NaN
7 2023-09-08 25 NaN NaN 25.0
8 2023-09-09 30 NaN NaN 30.0
9 2023-09-10 0 NaN NaN NaN"""
Intermediates(for analysis):
print(df[f'split{i}'])
0 30.0
1 25.0
2 0.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Name: split1, dtype: float64
0 NaN
1 NaN
2 NaN
3 10.0
4 15.0
5 15.0
6 0.0
7 NaN
8 NaN
9 NaN
Name: split2, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 25.0
8 30.0
9 0.0
Name: split3, dtype: float64
说明:
df[f'split{i}'] = np.where(groups == i, df['discount'], np.nan):
1.将“折扣”列中的非零值分配给相应的拆分列 基于组中的组标签。
2.用 NaN 填充剩余值。
df.loc[df[f'split{i}'].eq(0) | df[f'split{i}'].isna(), f'split{i}'] = np.nan:
1.在每个分割列中将零值和现有 NaN 替换为 NaN。
df.loc[df[f'split{i}'].isna(), f'split{i}'].ffill(inplace=True):
1.在每个分割列中前向填充 NaN 以传播非零值。