数据帧矢量化

问题描述 投票:0回答:2

在下面的代码中,我创建了一个 DataFrame df,其中包含包含值和时间戳的示例数据。此外,我添加了一个新列“value_timespan”并用 -1 对其进行初始化。然后,我迭代 DataFrame 以计算“值”列中连续正值之间的时间跨度。

有两点需要注意, 首先,即使有多个连续的正值,也只会计算连续正值成对的时间差,而后续不成对的正值则不会被计算。 (见下面的例子)
第二个是连续正值之间可以有任意数量的零。

import pandas as pd
from datetime import datetime

# Sample data
data = {    
    'datetime': [
        datetime(2023, 11, 11, 8, 0, 0),
        datetime(2023, 11, 11, 8, 5, 0),
        datetime(2023, 11, 11, 8, 10, 0),
        datetime(2023, 11, 11, 8, 15, 0),
        datetime(2023, 11, 11, 8, 20, 0),
        datetime(2023, 11, 11, 8, 25, 0),
        datetime(2023, 11, 11, 8, 30, 0),
        datetime(2023, 11, 11, 8, 35, 0),
        datetime(2023, 11, 11, 8, 40, 0),
        datetime(2023, 11, 11, 8, 45, 0),
        datetime(2023, 11, 11, 8, 50, 0),
    ],
    'value': [1,  3, 4, 2, -1, 1, 0, 2, -3, 0, -3],                   
}

# Create the DataFrame
df = pd.DataFrame(data)

df['value_timespan'] = -1
# Initialize variables to keep track of the last positive value and its timestamp
last_positive_value = None
last_positive_timestamp = None
# Iterate through the DataFrame
for index, row in df.iterrows():
    if row['value'] > 0:
        if last_positive_value is not None:
            # Calculate the time span between the current positive value and the last positive value
            time_difference = (row['datetime'] - last_positive_timestamp).total_seconds()
            df.at[index, 'value_timespan'] = time_difference
            last_positive_value = None
            last_positive_timestamp = None
        else:
            last_positive_value = row['value']
            last_positive_timestamp = row['datetime']
    if row['value'] < 0:
        last_positive_value = None
        last_positive_timestamp = None        
print(df)

打印出如下结果,(1, 3), (4, 2), (1, 2) 被认为是对

            datetime        | value | value_timespan
        --------------------------------------------
    0   2023-11-11 08:00:00 |   1   | -1
    1   2023-11-11 08:05:00 |   3   | 300
    2   2023-11-11 08:10:00 |   4   | -1
    3   2023-11-11 08:15:00 |   2   | 300
    4   2023-11-11 08:20:00 |  -1   | -1
    5   2023-11-11 08:25:00 |   1   | -1
    6   2023-11-11 08:30:00 |   0   | -1
    7   2023-11-11 08:35:00 |   2   | 600
    8   2023-11-11 08:40:00 |  -3   | -1
    9   2023-11-11 08:45:00 |   0   | -1
    10  2023-11-11 08:50:00 |  -3   | -1

现在,我想对我的代码进行矢量化。 我怎样才能正确地做到这一点?

更新2023/12/07
例如,对于“值”:[1, 3, 1, -2, -1, 1, 0, 0, 3, 0, -3],
正确的结果是,因为 (1, 3), (1, 3) 形成一对

            datetime    | value | timespan
-----------------------------------------
0   2023-11-11 08:00:00 |  1    | -1.0
1   2023-11-11 08:05:00 |  3    | 300.0
2   2023-11-11 08:10:00 |  1    | -1.0
3   2023-11-11 08:15:00 | -2    | -1.0
4   2023-11-11 08:20:00 | -1    | -1.0
5   2023-11-11 08:25:00 |  1    | -1.0
6   2023-11-11 08:30:00 |  0    | -1.0
7   2023-11-11 08:35:00 |  0    | -1.0
8   2023-11-11 08:40:00 |  3    | 900.0
9   2023-11-11 08:45:00 |  0    | -1.0
10  2023-11-11 08:50:00 | -3    | -1.0

希望我的要求很明确。

python pandas dataframe vectorization
2个回答
0
投票

我认为代码不能完全矢量化,因为需要连续的成对处理和渐进式回溯 - 这就是您在自己的代码中使用 last_ 值的原因。首先请注意,

itertuples
iterrows
更快,因为 Pandas Series 不是由每一行形成的。但最好只对提取的 Series 进行必要的循环,而不是对 DF 行进行循环。下面的代码演示了这一点。这将在大约 0.5 秒内处理 100 万行数据,我想这对于独立应用程序来说已经足够了。

import pandas as pd
from datetime import datetime

CHANGE = 1
NOCHANGE = -1
MAYBE = 0

# Sample data
data = {    
    'datetime': [
        datetime(2023, 11, 11, 8, 0, 0),
        datetime(2023, 11, 11, 8, 5, 0),
        datetime(2023, 11, 11, 8, 10, 0),
        datetime(2023, 11, 11, 8, 15, 0),
        datetime(2023, 11, 11, 8, 20, 0),
        datetime(2023, 11, 11, 8, 25, 0),
        datetime(2023, 11, 11, 8, 30, 0),
        datetime(2023, 11, 11, 8, 35, 0),
        datetime(2023, 11, 11, 8, 40, 0),
        datetime(2023, 11, 11, 8, 45, 0),
        datetime(2023, 11, 11, 8, 50, 0),
    ],
   'value': [1,  3, 4, 2, -1, 1, 0, 2, -3, 0, -3]
#  'value': [1, 3, 1, -2, -1, 1, 0, 0, 3, 0, -3]
}

df = pd.DataFrame(data)

# form working df with only non-zero 'value' using copy to maintain index for later re-insertion
df2 = df[df['value'].ne(0)].copy()

#create temp column and mark rows with negative values as NOCHANGE and others as MAYBE
df2['markers'] = NOCHANGE
df2['markers'] = df2['markers'].mask(df2['value'].gt(0), MAYBE)

#loop through df column and mark values to be changed with CHANGE, others with NOCHANGE
prev = CHANGE
res = []        #temp store for modified marks
for entry in df2['markers']:
    if entry == NOCHANGE:
        res.append(NOCHANGE)
        prev = NOCHANGE
    elif entry == MAYBE and prev == MAYBE:
        res.append(CHANGE)
        prev = CHANGE
    else:
        res.append(NOCHANGE)
        prev = MAYBE
df2['markers'] = res

#add timespan to rows marked with CHANGE
df2['markers'] = df2['markers'].mask(df2['markers'].eq(CHANGE), (df2['datetime']-df2['datetime'].shift(1)).dt.total_seconds())

#merge timespan results back into original DF using indices then fill rows missing from DF2 (value 0) with -1
df['timespan'] = df2['markers']
df['timespan'] = df['timespan'].fillna(NOCHANGE).astype(int)

print(df)

给出:

              datetime  value  timespan
0  2023-11-11 08:00:00      1        -1
1  2023-11-11 08:05:00      3       300
2  2023-11-11 08:10:00      4        -1
3  2023-11-11 08:15:00      2       300
4  2023-11-11 08:20:00     -1        -1
5  2023-11-11 08:25:00      1        -1
6  2023-11-11 08:30:00      0        -1
7  2023-11-11 08:35:00      2       600
8  2023-11-11 08:40:00     -3        -1
9  2023-11-11 08:45:00      0        -1
10 2023-11-11 08:50:00     -3        -1

0
投票

首先,简化设置:

value = [1,  3, 4, 2, -1, 1, 0, 2, -3, 0, -3]
t = pd.date_range('2023-11-11 08:00:00', freq='5min', periods=len(value))
df = pd.DataFrame({'t': t, 'value': value})

二、全矢量化解决方案:

def pred(g):
    ix = np.where(g.to_numpy() > 0)[0]
    a = ix[::2]
    b = ix[1::2]
    v = np.zeros(len(g), dtype=int)
    v[b] = b - a
    return v

off = df.groupby((df['value'] < 0).cumsum())['value'].transform(pred)
t = df['t'].to_numpy().astype('int64') // 1e9
newdf = df.assign(timespan=np.where(off > 0, (t - t[np.arange(len(t)) - off]), -1))

# on your example:
>>> newdf
                     t  value  timespan
0  2023-11-11 08:00:00      1      -1.0
1  2023-11-11 08:05:00      3     300.0
2  2023-11-11 08:10:00      4      -1.0
3  2023-11-11 08:15:00      2     300.0
4  2023-11-11 08:20:00     -1      -1.0
5  2023-11-11 08:25:00      1      -1.0
6  2023-11-11 08:30:00      0      -1.0
7  2023-11-11 08:35:00      2     600.0
8  2023-11-11 08:40:00     -3      -1.0
9  2023-11-11 08:45:00      0      -1.0
10 2023-11-11 08:50:00     -3      -1.0

说明

groupby
形成连续非负值组,通常以单个负值开始。

off
是我们需要回溯多远才能找到一对中的第一个项目的偏移量。如果当前行不是成对的,则为 0:

>>> df.assign(off=off)
                     t  value  off
0  2023-11-11 08:00:00      1    0
1  2023-11-11 08:05:00      3    1
2  2023-11-11 08:10:00      4    0
3  2023-11-11 08:15:00      2    1
4  2023-11-11 08:20:00     -1    0
5  2023-11-11 08:25:00      1    0
6  2023-11-11 08:30:00      0    0
7  2023-11-11 08:35:00      2    2
8  2023-11-11 08:40:00     -3    0
9  2023-11-11 08:45:00      0    0
10 2023-11-11 08:50:00     -3    0

之后,就只是一些算术了。

© www.soinside.com 2019 - 2024. All rights reserved.