在下面的代码中,我创建了一个 DataFrame df,其中包含包含值和时间戳的示例数据。此外,我添加了一个新列“value_timespan”并用 -1 对其进行初始化。然后,我迭代 DataFrame 以计算“值”列中连续正值之间的时间跨度。
有两点需要注意,
首先是即使有多个连续的正值,也只会计算第一个正值到第二个正值的时间差,而不会考虑后续的正值。
第二个是连续正值之间可以有任意数量的零。
import pandas as pd
from datetime import datetime
# Sample data
data = {
'datetime': [
datetime(2023, 11, 11, 8, 0, 0),
datetime(2023, 11, 11, 8, 5, 0),
datetime(2023, 11, 11, 8, 10, 0),
datetime(2023, 11, 11, 8, 15, 0),
datetime(2023, 11, 11, 8, 20, 0),
datetime(2023, 11, 11, 8, 25, 0),
datetime(2023, 11, 11, 8, 30, 0),
datetime(2023, 11, 11, 8, 35, 0),
datetime(2023, 11, 11, 8, 40, 0),
datetime(2023, 11, 11, 8, 45, 0),
datetime(2023, 11, 11, 8, 50, 0),
],
'value': [1, 3, 1, 2, -1, 1, 0, 2, -3, 0, -3],
}
# Create the DataFrame
df = pd.DataFrame(data)
df['value_timespan'] = -1
# Initialize variables to keep track of the last positive value and its timestamp
last_positive_value = None
last_positive_timestamp = None
# Iterate through the DataFrame
for index, row in df.iterrows():
if row['value'] > 0:
if last_positive_value is not None:
# Calculate the time span between the current positive value and the last positive value
time_difference = (row['datetime'] - last_positive_timestamp).total_seconds()
df.at[index, 'value_timespan'] = time_difference
last_positive_value = None
last_positive_timestamp = None
else:
last_positive_value = row['value']
last_positive_timestamp = row['datetime']
if row['value'] < 0:
last_positive_value = None
last_positive_timestamp = None
print(df)
打印出如下结果
datetime | value | value_timespan
--------------------------------------------
0 2023-11-11 08:00:00 | 1 | -1
1 2023-11-11 08:05:00 | 3 | 300
2 2023-11-11 08:10:00 | 1 | -1
3 2023-11-11 08:15:00 | 2 | 300
4 2023-11-11 08:20:00 | -1 | -1
5 2023-11-11 08:25:00 | 1 | -1
6 2023-11-11 08:30:00 | 0 | -1
7 2023-11-11 08:35:00 | 2 | 600
8 2023-11-11 08:40:00 | -3 | -1
9 2023-11-11 08:45:00 | 0 | -1
10 2023-11-11 08:50:00 | -3 | -1
现在,我想对我的代码进行矢量化。 我怎样才能正确地做到这一点?
下面的代码显示了矢量化方法。由于您定义的复杂逻辑,它看起来相当“笨重”,但我无法立即想到更聪明的方法。
import pandas as pd
from datetime import datetime
import numpy as np
data = {
'datetime': [
datetime(2023, 11, 11, 8, 0, 0),
datetime(2023, 11, 11, 8, 5, 0),
datetime(2023, 11, 11, 8, 10, 0),
datetime(2023, 11, 11, 8, 15, 0),
datetime(2023, 11, 11, 8, 20, 0),
datetime(2023, 11, 11, 8, 25, 0),
datetime(2023, 11, 11, 8, 30, 0),
datetime(2023, 11, 11, 8, 35, 0),
datetime(2023, 11, 11, 8, 40, 0),
datetime(2023, 11, 11, 8, 45, 0),
datetime(2023, 11, 11, 8, 50, 0),
],
'value': [1, 3, 1, 2, -1, 1, 0, 2, -3, 0, -3],
}
df = pd.DataFrame(data)
# form new df with non-zero 'value'
df2 = df[df['value'].ne(0)].reset_index()
#create temp column to mark rows to be changed
df2['m'] = (df2['value'].gt(0)) & (df2['value'].shift(1).ge(0) & (df2['value'].shift(-1) < df2['value']))
#add timespan to marked rows or else -1 if not marked
df2['timespan'] = np.where(df2['m'], (df2['datetime']-df2['datetime'].shift(1)).dt.total_seconds(), -1)
#drop marked column
df2 = df2.drop('m', axis = 1)
#merge timespan results back into original DF, filling 'value == 0 rows with -1
df = pd.merge(df,df2, on = ['value','datetime'], how = 'outer').fillna(-1).drop('index', axis =1)
print(df)
给予
datetime value timespan
0 2023-11-11 08:00:00 1 -1.0
1 2023-11-11 08:05:00 3 300.0
2 2023-11-11 08:10:00 1 -1.0
3 2023-11-11 08:15:00 2 300.0
4 2023-11-11 08:20:00 -1 -1.0
5 2023-11-11 08:25:00 1 -1.0
6 2023-11-11 08:30:00 0 -1.0
7 2023-11-11 08:35:00 2 600.0
8 2023-11-11 08:40:00 -3 -1.0
9 2023-11-11 08:45:00 0 -1.0
10 2023-11-11 08:50:00 -3 -1.0