我有一个带有两个时间戳的数据帧:
timestamp1 timestamp2
2022-02-18 2023-01-02
2022-02-19 2023-01-04
2022-02-21 2023-01-11
2022-03-11 2024-02-05
2022-03-12 2024-02-06
2022-03-30 2024-02-07
我想按时间戳范围对它们进行分组(创建新参数组)。在每组的每个时间戳中,日期范围不得超过 4 天(因此组内的最大值和最小值之间的差异不得超过 4 天)。所以想要的结果可以是这样的:
timestamp1 timestamp2 group
2022-02-18 2023-01-02 1
2022-02-19 2023-01-04 1
2022-02-21 2023-01-11 2
2022-03-11 2024-02-05 3
2022-03-12 2024-02-06 3
2022-03-30 2024-02-07 4
如您所见,第三行没有组 1,因为时间戳 2 是 2023-01-11(比组 1 中的最小时间戳 2 大 9 天)。最后一行不属于第 3 组,因为它的时间戳 1 比第 1 组中的最小时间戳 1 大 19 天
目标是在前面提到的约束下,每组有尽可能多的行
如何做到这一点?我尝试了这个,但它没有给出期望的结果,特别是在较大的数据集上:
import pandas as pd
# Sample DataFrame
data = {
'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])
# Function to assign groups
def assign_groups(df):
groups = []
current_group = 1
start_timestamp2 = df.iloc[0]['timestamp2']
for i, row in df.iterrows():
if (row['timestamp2'] - start_timestamp2).days > 4:
current_group += 1
start_timestamp2 = row['timestamp2']
groups.append(current_group)
return groups
# Assign groups
df['group'] = assign_groups(df)
您的方法仅考虑连续
timestamp2
值之间的差异。为了满足约束,您还需要检查当前行的 timestamp1
与组内最小 timestamp1
之间的差异。
import pandas as pd
# Sample DataFrame
data = {
'timestamp1': ['2022-02-18', '2022-02-19', '2022-02-21', '2022-03-11', '2022-03-12', '2022-03-30'],
'timestamp2': ['2023-01-02', '2023-01-04', '2023-01-11', '2024-02-05', '2024-02-06', '2024-02-07']
}
df = pd.DataFrame(data)
df['timestamp1'] = pd.to_datetime(df['timestamp1'])
df['timestamp2'] = pd.to_datetime(df['timestamp2'])
# Function to assign groups
def assign_groups(df):
groups = []
current_group = 1
start_timestamp2 = df.iloc[0]['timestamp2']
start_timestamp1 = df.iloc[0]['timestamp1']
for i, row in df.iterrows():
if (row['timestamp2'] - start_timestamp2).days > 4 or (row['timestamp1'] - start_timestamp1).days > 4:
current_group += 1
start_timestamp2 = row['timestamp2']
start_timestamp1 = row['timestamp1']
groups.append(current_group)
return groups
# Assign groups
df['group'] = assign_groups(df)
print(df)