假设我有一个包含
id
和 datetime
列的数据框:
df = pd.DataFrame({"id": ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2", "a3", "a3", "a3", "a3"],
"datetime": ["2016-01-01 00:01:00.156",
"2016-01-01 12:00:00.425",
"2016-01-02 00:59:00.123",
"2016-01-02 14:16:00.548",
"2016-01-01 12:00:00.147",
"2016-01-01 13:59:00.123",
"2016-01-02 08:01:00.147",
"2016-01-02 18:49:00.123",
"2016-02-01 12:00:00.147",
"2016-02-01 13:59:00.123",
"2016-02-02 08:01:00.147",
"2016-02-02 18:49:00.123"]})
df["datetime"] = pd.to_datetime(df["datetime"])
df
这是数据框:
id datetime
0 a1 2016-01-01 00:01:00.156
1 a1 2016-01-01 12:00:00.425
2 a1 2016-01-02 00:59:00.123
3 a1 2016-01-02 14:16:00.548
4 a2 2016-01-01 12:00:00.147
5 a2 2016-01-01 13:59:00.123
6 a2 2016-01-02 08:01:00.147
7 a2 2016-01-02 18:49:00.123
8 a3 2016-02-01 12:00:00.147
9 a3 2016-02-01 13:59:00.123
10 a3 2016-02-02 08:01:00.147
11 a3 2016-02-02 18:49:00.123
我想生成具有 timedelta 值的列
timedelta
。这是我期望得到的输出:
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
以下是
timedelta
值的计算方式:1)代码需要在相同的 id
和日期('YYYY-MM-DD')内识别第一个日期时间,以及 2)将其用作基线( datetime_baseline
) 计算 timedelta(以分钟为单位)w.r.t.同一id
和同一日期内的其他日期时间。对于 id
='a1' 和 date='2016-01-01',datetime_baseline
='2016-01-01 00:01:00.156'。因此,在索引=0 时,timedelta
的值=0,因为 '2016-01-01 00:01:00.156' - datetime_baseline
=0。同时,在索引=1 时,timedelta
的值为 719,因为 '2016-01-01 12:00:00.425' - datetime_baseline
=719(分钟)。在 index=2 处,id
与之前相同,但日期现在为“2016-01-02”,因此将使用新基线:“2016-01-02 00:59:00.123”。 timedelta
='2016-01-02 00:59:00.123' - datetime_baseline
=0。在索引=3时,timedelta
='2016-01-02 14:16:00.548' - datetime_baseline
=797.
虽然我知道应该如何计算
timedelta
值(timedelta=datetime-datetime_baseline
),但我不知道如何确定基线值(即如何生成datetime_baseline
列)。如果您需要任何进一步的解释,请告诉我。
ps> 实际数据框有 +50 万行。
GroupBy.transform
制作基线:
df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
["datetime"].transform("first"))
dt.total_seconds
计算timedelta:
df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
.dt.total_seconds().div(60).round(0).astype(int))
输出:
print(df)
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
尝试:
df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
print(df)
印花:
id datetime datetime_baseline timedelta
0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0.0
1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719.0
2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0.0
3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797.0
4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0.0
5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119.0
6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0.0
7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648.0
8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0.0
9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119.0
10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0.0
11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648.0