我有一个数据框。看起来像这样:
prod_id prod_type timestamp1 timestamp2
1 a1 2023-12-02 2023-12-01
2 a2 2023-10-10 2023-09-02
3 a1 2023-12-11 2023-12-22
4 a3 2023-05-11 2023-06-21
.....
如果它们具有相同的“prod_type”,我需要将 prod_id 放入相同的组(新的 group_id 参数)。并且timestamp1的日期分布不得超过一个月(因此group_id内的max和min之间的差值不得大于30天)。同样,timestamp2 的日期分布不得超过一个月(因此 group_id 内的最大值和最小值之间的差异不得超过 30 天)。我需要最大化每个 group_id 的 prod_id 的平均数量
我尝试了这个,但我仍然得到 group_id,其中时间戳 1 的日期分布超过 30 天:
# Convert timestamp columns to datetime objects
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])
# Function to check if the range of dates within a group exceeds 30 days
def check_date_range(group):
if (group['timestamp1'].max() - group['timestamp1'].min()).days > 30:
return True
if (group['timestamp2'].max() - group['timestamp2'].min()).days > 30:
return True
return False
# Group by 'prod_type' and create new 'group_id' satisfying conditions
group_id = {}
current_group = 1
for _, group in df.groupby('prod_type'):
group = group.sort_values(by=['timestamp1', 'timestamp2'])
if check_date_range(group):
current_group += 1
for index, row in group.iterrows():
group_id[row['prod_id']] = current_group
# Add 'group_id' column to DataFrame
df['group_id'] = df['prod_id'].map(group_id)
如何正确做?
附注
# Larger Sample DataFrame
data = {
'prod_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'prod_type': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2', 'a2', 'a3', 'a3', 'a3', 'a3'],
'timestamp1': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-02-15',
'2023-01-10', '2023-02-05', '2023-03-01', '2023-03-20',
'2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
'timestamp2': ['2023-01-05', '2023-01-20', '2023-02-10', '2023-02-25',
'2023-01-15', '2023-02-10', '2023-03-05', '2023-03-25',
'2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10']
}
您几乎走在正确的道路上,但我认为您需要做的是将数据分成组。这是一种可能的方法。很高兴看到预期的输出:
import pandas as pd
data = {
'prod_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'prod_type': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2', 'a2', 'a3', 'a3', 'a3', 'a3'],
'timestamp1': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-02-15',
'2023-01-10', '2023-02-05', '2023-03-01', '2023-03-20',
'2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
'timestamp2': ['2023-01-05', '2023-01-20', '2023-02-10', '2023-02-25',
'2023-01-15', '2023-02-10', '2023-03-05', '2023-03-25',
'2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])
def split_into_groups(group):
group = group.sort_values(by=['timestamp1', 'timestamp2'])
group_ids = []
current_group_id = 1
start_idx = 0
while start_idx < len(group):
sub_group = group.iloc[start_idx:]
end_idx = 1
while end_idx < len(sub_group):
sub_sub_group = sub_group.iloc[:end_idx+1]
if (sub_sub_group['timestamp1'].max() - sub_sub_group['timestamp1'].min()).days <= 30 and \
(sub_sub_group['timestamp2'].max() - sub_sub_group['timestamp2'].min()).days <= 30:
end_idx += 1
else:
break
group_ids.extend([current_group_id] * end_idx)
current_group_id += 1
start_idx += end_idx
return group_ids
grouped_df = df.groupby('prod_type').apply(lambda x: pd.Series(split_into_groups(x), index=x.index))
df['group_id'] = grouped_df.reset_index(level=0, drop=True)
print(df)
这给出了
prod_id prod_type timestamp1 timestamp2 group_id
0 1 a1 2023-01-01 2023-01-05 1
1 2 a1 2023-01-15 2023-01-20 1
2 3 a1 2023-02-01 2023-02-10 2
3 4 a1 2023-02-15 2023-02-25 2
4 5 a2 2023-01-10 2023-01-15 1
5 6 a2 2023-02-05 2023-02-10 1
6 7 a2 2023-03-01 2023-03-05 2
7 8 a2 2023-03-20 2023-03-25 2
8 9 a3 2023-01-01 2023-01-10 1
9 10 a3 2023-02-01 2023-02-10 2
10 11 a3 2023-03-01 2023-03-10 2
11 12 a3 2023-04-01 2023-04-10 3