如何按两个时间戳愤怒大小对数据集中的行进行分组？

Question

我有一个带有两个时间戳的数据帧：

timestamp1             timestamp2
2022-02-18            2023-01-02    
2022-02-19            2023-01-04    
2022-02-21            2023-01-11    
2022-03-11            2024-02-05    
2022-03-12            2024-02-06    
2022-03-30            2024-02-07

我想按时间戳范围对它们进行分组（创建新参数组）。在每组的每个时间戳中，日期范围不得超过 4 天（因此组内的最大值和最小值之间的差异不得超过 4 天）。所以想要的结果可以是这样的：

timestamp1             timestamp2       group
2022-02-18            2023-01-02            1
2022-02-19            2023-01-04            1
2022-02-21            2023-01-11            2
2022-03-11            2024-02-05            3
2022-03-12            2024-02-06            3
2022-03-30            2024-02-07            4

如您所见，第三行没有组 1，因为时间戳 2 是 2023-01-11（比组 1 中的最小时间戳 2 大 9 天）。最后一行不属于第 3 组，因为它的时间戳 1 比第 1 组中的最小时间戳 1 大 19 天

目标是在前面提到的约束下，每组有尽可能多的行

如何做到这一点？我尝试了这个，但它没有给出期望的结果，特别是在较大的数据集上：

import pandas as pd

# Sample DataFrame
data = {
    'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
    'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

# Function to assign groups
def assign_groups(df):
    groups = []
    current_group = 1
    start_timestamp2 = df.iloc[0]['timestamp2']
    for i, row in df.iterrows():
        if (row['timestamp2'] - start_timestamp2).days > 4:
            current_group += 1
            start_timestamp2 = row['timestamp2']
        groups.append(current_group)
    return groups

# Assign groups
df['group'] = assign_groups(df)

Answer 1

您的方法仅考虑连续

timestamp2

值之间的差异。为了满足约束，您还需要检查当前行的

timestamp1

与组内最小

timestamp1

之间的差异。

import pandas as pd

# Sample DataFrame
data = {
    'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
    'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])

# Function to assign groups
def assign_groups(df):
    groups = []
    current_group = 1
    start_timestamp2 = df.iloc[0]['timestamp2']
    start_timestamp1 = df.iloc[0]['timestamp1']
    for i, row in df.iterrows():
        if (row['timestamp2'] - start_timestamp2).days > 4 or (row['timestamp1'] - start_timestamp1).days > 4:
            current_group += 1
            start_timestamp2 = row['timestamp2']
            start_timestamp1 = row['timestamp1']
        groups.append(current_group)
    return groups

# Assign groups
df['group'] = assign_groups(df)

print(df)

如何按两个时间戳愤怒大小对数据集中的行进行分组？

问题描述投票：0回答：1

1个回答

最新问题

如何按两个时间戳愤怒大小对数据集中的行进行分组？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1