如何根据时间戳条件对值进行分组？

Question

我有一个数据框。看起来像这样：

prod_id  prod_type  timestamp1    timestamp2
1         a1        2023-12-02   2023-12-01 
2         a2        2023-10-10   2023-09-02
3         a1        2023-12-11   2023-12-22
4         a3        2023-05-11   2023-06-21
.....

如果它们具有相同的“prod_type”，我需要将 prod_id 放入相同的组（新的 group_id 参数）。并且timestamp1的日期分布不得超过一个月（因此group_id内的max和min之间的差值不得大于30天）。同样，timestamp2 的日期分布不得超过一个月（因此 group_id 内的最大值和最小值之间的差异不得超过 30 天）。我需要最大化每个 group_id 的 prod_id 的平均数量

我尝试了这个，但我仍然得到 group_id，其中时间戳 1 的日期分布超过 30 天：

# Convert timestamp columns to datetime objects
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

# Function to check if the range of dates within a group exceeds 30 days
def check_date_range(group):
    if (group['timestamp1'].max() - group['timestamp1'].min()).days > 30:
        return True
    if (group['timestamp2'].max() - group['timestamp2'].min()).days > 30:
        return True
    return False

# Group by 'prod_type' and create new 'group_id' satisfying conditions
group_id = {}
current_group = 1
for _, group in df.groupby('prod_type'):
    group = group.sort_values(by=['timestamp1', 'timestamp2'])
    if check_date_range(group):
        current_group += 1
    for index, row in group.iterrows():
        group_id[row['prod_id']] = current_group

# Add 'group_id' column to DataFrame
df['group_id'] = df['prod_id'].map(group_id)

如何正确做？

附注

# Larger Sample DataFrame
data = {
    'prod_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'prod_type': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2', 'a2', 'a3', 'a3', 'a3', 'a3'],
    'timestamp1': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-02-15', 
                   '2023-01-10', '2023-02-05', '2023-03-01', '2023-03-20',
                   '2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
    'timestamp2': ['2023-01-05', '2023-01-20', '2023-02-10', '2023-02-25', 
                   '2023-01-15', '2023-02-10', '2023-03-05', '2023-03-25',
                   '2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10']
}

Answer 1

您几乎走在正确的道路上，但我认为您需要做的是将数据分成组。这是一种可能的方法。很高兴看到预期的输出：

import pandas as pd

data = {
    'prod_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'prod_type': ['a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a2', 'a2', 'a3', 'a3', 'a3', 'a3'],
    'timestamp1': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-02-15', 
                   '2023-01-10', '2023-02-05', '2023-03-01', '2023-03-20',
                   '2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
    'timestamp2': ['2023-01-05', '2023-01-20', '2023-02-10', '2023-02-25', 
                   '2023-01-15', '2023-02-10', '2023-03-05', '2023-03-25',
                   '2023-01-10', '2023-02-10', '2023-03-10', '2023-04-10']
}
df = pd.DataFrame(data)

df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

def split_into_groups(group):
    group = group.sort_values(by=['timestamp1', 'timestamp2'])
    group_ids = []
    current_group_id = 1
    start_idx = 0
    
    while start_idx < len(group):
        sub_group = group.iloc[start_idx:]
        end_idx = 1  
        
        while end_idx < len(sub_group):
            sub_sub_group = sub_group.iloc[:end_idx+1]
            if (sub_sub_group['timestamp1'].max() - sub_sub_group['timestamp1'].min()).days <= 30 and \
               (sub_sub_group['timestamp2'].max() - sub_sub_group['timestamp2'].min()).days <= 30:
                end_idx += 1
            else:
                break
                
        group_ids.extend([current_group_id] * end_idx)
        current_group_id += 1
        start_idx += end_idx
    
    return group_ids

grouped_df = df.groupby('prod_type').apply(lambda x: pd.Series(split_into_groups(x), index=x.index))
df['group_id'] = grouped_df.reset_index(level=0, drop=True)

print(df)

这给出了

    prod_id prod_type timestamp1 timestamp2  group_id
0         1        a1 2023-01-01 2023-01-05         1
1         2        a1 2023-01-15 2023-01-20         1
2         3        a1 2023-02-01 2023-02-10         2
3         4        a1 2023-02-15 2023-02-25         2
4         5        a2 2023-01-10 2023-01-15         1
5         6        a2 2023-02-05 2023-02-10         1
6         7        a2 2023-03-01 2023-03-05         2
7         8        a2 2023-03-20 2023-03-25         2
8         9        a3 2023-01-01 2023-01-10         1
9        10        a3 2023-02-01 2023-02-10         2
10       11        a3 2023-03-01 2023-03-10         2
11       12        a3 2023-04-01 2023-04-10         3

如何根据时间戳条件对值进行分组？

问题描述投票：0回答：1

1个回答

最新问题

如何根据时间戳条件对值进行分组？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1