如何按两个时间戳愤怒大小对数据集中的行进行分组?

问题描述 投票:0回答:1

我有一个带有两个时间戳的数据帧:

timestamp1             timestamp2
2022-02-18            2023-01-02    
2022-02-19            2023-01-04    
2022-02-21            2023-01-11    
2022-03-11            2024-02-05    
2022-03-12            2024-02-06    
2022-03-30            2024-02-07        

我想按时间戳范围对它们进行分组(创建新参数组)。在每组的每个时间戳中,日期范围不得超过 4 天(因此组内的最大值和最小值之间的差异不得超过 4 天)。所以想要的结果可以是这样的:

timestamp1             timestamp2       group
2022-02-18            2023-01-02            1
2022-02-19            2023-01-04            1
2022-02-21            2023-01-11            2
2022-03-11            2024-02-05            3
2022-03-12            2024-02-06            3
2022-03-30            2024-02-07            4

如您所见,第三行没有组 1,因为时间戳 2 是 2023-01-11(比组 1 中的最小时间戳 2 大 9 天)。最后一行不属于第 3 组,因为它的时间戳 1 比第 1 组中的最小时间戳 1 大 19 天

目标是在前面提到的约束下,每组有尽可能多的行

如何做到这一点?我尝试了这个,但它没有给出期望的结果,特别是在较大的数据集上:

import pandas as pd

# Sample DataFrame
data = {
    'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
    'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

# Function to assign groups
def assign_groups(df):
    groups = []
    current_group = 1
    start_timestamp2 = df.iloc[0]['timestamp2']
    for i, row in df.iterrows():
        if (row['timestamp2'] - start_timestamp2).days > 4:
            current_group += 1
            start_timestamp2 = row['timestamp2']
        groups.append(current_group)
    return groups

# Assign groups
df['group'] = assign_groups(df)
python python-3.x dataframe algorithm function
1个回答
0
投票

您的方法仅考虑连续

timestamp2
值之间的差异。为了满足约束,您还需要检查当前行的
timestamp1
与组内最小
timestamp1
之间的差异。

import pandas as pd

# Sample DataFrame
data = {
    'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
    'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

# Function to assign groups
def assign_groups(df):
    groups = []
    current_group = 1
    start_timestamp2 = df.iloc[0]['timestamp2']
    start_timestamp1 = df.iloc[0]['timestamp1']
    for i, row in df.iterrows():
        if (row['timestamp2'] - start_timestamp2).days > 4 or (row['timestamp1'] - start_timestamp1).days > 4:
            current_group += 1
            start_timestamp2 = row['timestamp2']
            start_timestamp1 = row['timestamp1']
        groups.append(current_group)
    return groups

# Assign groups
df['group'] = assign_groups(df)

print(df)
© www.soinside.com 2019 - 2024. All rights reserved.