在下面的代码中,我创建了一个 DataFrame df,其中包含包含值和时间戳的示例数据。此外,我添加了一个新列“value_timespan”并用 -1 对其进行初始化。然后,我迭代 DataFrame 以计算“值”列中连续正值之间的时间跨度。
有两点需要注意,
首先,即使有多个连续的正值,也只会计算连续正值成对的时间差,而后续不成对的正值则不会被计算。 (见下面的例子)
第二个是连续正值之间可以有任意数量的零。
import pandas as pd
from datetime import datetime
# Sample data
data = {
'datetime': [
datetime(2023, 11, 11, 8, 0, 0),
datetime(2023, 11, 11, 8, 5, 0),
datetime(2023, 11, 11, 8, 10, 0),
datetime(2023, 11, 11, 8, 15, 0),
datetime(2023, 11, 11, 8, 20, 0),
datetime(2023, 11, 11, 8, 25, 0),
datetime(2023, 11, 11, 8, 30, 0),
datetime(2023, 11, 11, 8, 35, 0),
datetime(2023, 11, 11, 8, 40, 0),
datetime(2023, 11, 11, 8, 45, 0),
datetime(2023, 11, 11, 8, 50, 0),
],
'value': [1, 3, 4, 2, -1, 1, 0, 2, -3, 0, -3],
}
# Create the DataFrame
df = pd.DataFrame(data)
df['value_timespan'] = -1
# Initialize variables to keep track of the last positive value and its timestamp
last_positive_value = None
last_positive_timestamp = None
# Iterate through the DataFrame
for index, row in df.iterrows():
if row['value'] > 0:
if last_positive_value is not None:
# Calculate the time span between the current positive value and the last positive value
time_difference = (row['datetime'] - last_positive_timestamp).total_seconds()
df.at[index, 'value_timespan'] = time_difference
last_positive_value = None
last_positive_timestamp = None
else:
last_positive_value = row['value']
last_positive_timestamp = row['datetime']
if row['value'] < 0:
last_positive_value = None
last_positive_timestamp = None
print(df)
打印出如下结果,(1, 3), (4, 2), (1, 2) 被认为是对
datetime | value | value_timespan
--------------------------------------------
0 2023-11-11 08:00:00 | 1 | -1
1 2023-11-11 08:05:00 | 3 | 300
2 2023-11-11 08:10:00 | 4 | -1
3 2023-11-11 08:15:00 | 2 | 300
4 2023-11-11 08:20:00 | -1 | -1
5 2023-11-11 08:25:00 | 1 | -1
6 2023-11-11 08:30:00 | 0 | -1
7 2023-11-11 08:35:00 | 2 | 600
8 2023-11-11 08:40:00 | -3 | -1
9 2023-11-11 08:45:00 | 0 | -1
10 2023-11-11 08:50:00 | -3 | -1
现在,我想对我的代码进行矢量化。 我怎样才能正确地做到这一点?
更新2023/12/07
例如,对于“值”:[1, 3, 1, -2, -1, 1, 0, 0, 3, 0, -3],
正确的结果是,因为 (1, 3), (1, 3) 形成一对
datetime | value | timespan
-----------------------------------------
0 2023-11-11 08:00:00 | 1 | -1.0
1 2023-11-11 08:05:00 | 3 | 300.0
2 2023-11-11 08:10:00 | 1 | -1.0
3 2023-11-11 08:15:00 | -2 | -1.0
4 2023-11-11 08:20:00 | -1 | -1.0
5 2023-11-11 08:25:00 | 1 | -1.0
6 2023-11-11 08:30:00 | 0 | -1.0
7 2023-11-11 08:35:00 | 0 | -1.0
8 2023-11-11 08:40:00 | 3 | 900.0
9 2023-11-11 08:45:00 | 0 | -1.0
10 2023-11-11 08:50:00 | -3 | -1.0
希望我的要求很明确。
我认为代码不能完全矢量化,因为需要连续的成对处理和渐进式回溯 - 这就是您在自己的代码中使用 last_ 值的原因。首先请注意,
itertuples
比 iterrows
更快,因为 Pandas Series 不是由每一行形成的。但最好只对提取的 Series 进行必要的循环,而不是对 DF 行进行循环。下面的代码演示了这一点。这将在大约 0.5 秒内处理 100 万行数据,我想这对于独立应用程序来说已经足够了。
import pandas as pd
from datetime import datetime
CHANGE = 1
NOCHANGE = -1
MAYBE = 0
# Sample data
data = {
'datetime': [
datetime(2023, 11, 11, 8, 0, 0),
datetime(2023, 11, 11, 8, 5, 0),
datetime(2023, 11, 11, 8, 10, 0),
datetime(2023, 11, 11, 8, 15, 0),
datetime(2023, 11, 11, 8, 20, 0),
datetime(2023, 11, 11, 8, 25, 0),
datetime(2023, 11, 11, 8, 30, 0),
datetime(2023, 11, 11, 8, 35, 0),
datetime(2023, 11, 11, 8, 40, 0),
datetime(2023, 11, 11, 8, 45, 0),
datetime(2023, 11, 11, 8, 50, 0),
],
'value': [1, 3, 4, 2, -1, 1, 0, 2, -3, 0, -3]
# 'value': [1, 3, 1, -2, -1, 1, 0, 0, 3, 0, -3]
}
df = pd.DataFrame(data)
# form working df with only non-zero 'value' using copy to maintain index for later re-insertion
df2 = df[df['value'].ne(0)].copy()
#create temp column and mark rows with negative values as NOCHANGE and others as MAYBE
df2['markers'] = NOCHANGE
df2['markers'] = df2['markers'].mask(df2['value'].gt(0), MAYBE)
#loop through df column and mark values to be changed with CHANGE, others with NOCHANGE
prev = CHANGE
res = [] #temp store for modified marks
for entry in df2['markers']:
if entry == NOCHANGE:
res.append(NOCHANGE)
prev = NOCHANGE
elif entry == MAYBE and prev == MAYBE:
res.append(CHANGE)
prev = CHANGE
else:
res.append(NOCHANGE)
prev = MAYBE
df2['markers'] = res
#add timespan to rows marked with CHANGE
df2['markers'] = df2['markers'].mask(df2['markers'].eq(CHANGE), (df2['datetime']-df2['datetime'].shift(1)).dt.total_seconds())
#merge timespan results back into original DF using indices then fill rows missing from DF2 (value 0) with -1
df['timespan'] = df2['markers']
df['timespan'] = df['timespan'].fillna(NOCHANGE).astype(int)
print(df)
给出:
datetime value timespan
0 2023-11-11 08:00:00 1 -1
1 2023-11-11 08:05:00 3 300
2 2023-11-11 08:10:00 4 -1
3 2023-11-11 08:15:00 2 300
4 2023-11-11 08:20:00 -1 -1
5 2023-11-11 08:25:00 1 -1
6 2023-11-11 08:30:00 0 -1
7 2023-11-11 08:35:00 2 600
8 2023-11-11 08:40:00 -3 -1
9 2023-11-11 08:45:00 0 -1
10 2023-11-11 08:50:00 -3 -1
首先,简化设置:
value = [1, 3, 4, 2, -1, 1, 0, 2, -3, 0, -3]
t = pd.date_range('2023-11-11 08:00:00', freq='5min', periods=len(value))
df = pd.DataFrame({'t': t, 'value': value})
二、全矢量化解决方案:
def pred(g):
ix = np.where(g.to_numpy() > 0)[0]
a = ix[::2]
b = ix[1::2]
v = np.zeros(len(g), dtype=int)
v[b] = b - a
return v
off = df.groupby((df['value'] < 0).cumsum())['value'].transform(pred)
t = df['t'].to_numpy().astype('int64') // 1e9
newdf = df.assign(timespan=np.where(off > 0, (t - t[np.arange(len(t)) - off]), -1))
# on your example:
>>> newdf
t value timespan
0 2023-11-11 08:00:00 1 -1.0
1 2023-11-11 08:05:00 3 300.0
2 2023-11-11 08:10:00 4 -1.0
3 2023-11-11 08:15:00 2 300.0
4 2023-11-11 08:20:00 -1 -1.0
5 2023-11-11 08:25:00 1 -1.0
6 2023-11-11 08:30:00 0 -1.0
7 2023-11-11 08:35:00 2 600.0
8 2023-11-11 08:40:00 -3 -1.0
9 2023-11-11 08:45:00 0 -1.0
10 2023-11-11 08:50:00 -3 -1.0
groupby
形成连续非负值组,通常以单个负值开始。
off
是我们需要回溯多远才能找到一对中的第一个项目的偏移量。如果当前行不是成对的,则为 0:
>>> df.assign(off=off)
t value off
0 2023-11-11 08:00:00 1 0
1 2023-11-11 08:05:00 3 1
2 2023-11-11 08:10:00 4 0
3 2023-11-11 08:15:00 2 1
4 2023-11-11 08:20:00 -1 0
5 2023-11-11 08:25:00 1 0
6 2023-11-11 08:30:00 0 0
7 2023-11-11 08:35:00 2 2
8 2023-11-11 08:40:00 -3 0
9 2023-11-11 08:45:00 0 0
10 2023-11-11 08:50:00 -3 0
之后,就只是一些算术了。