Python - 高效计算,其中一行的结束值是另一行的起始值

问题描述 投票:0回答:4

我想在滚动的基础上进行简单的计算,但是当我尝试使用嵌套的 for 循环解决这个问题时会遇到严重的性能问题。我需要对非常大的数据执行这种操作,但必须使用标准 Python(包括 Pandas)。

我有一个 pd.DataFrame (df1),它包含(由某些维度构成,我们称它们为 key1 和 key2)一个起始列、一个结束列和介于两者之间的一些操作列,它们应该用于计算结束基于起始列的列。

基本上,简单的逻辑是:start + plus - minus = end,其中每行的结束值是下一行的开始值。

这需要通过两个键来完成,即分别用于 AX、AY 和 BX。

df2 显示了预期的结果,但如果在更大的表上完成此任务,我不知道如何以有效的方式到达那里而不会破坏我的记忆。

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df1
  key1 key2 start plus minus end
0    A    X     3    6     4   0
1    A    X     0    2     1   0
2    A    X     0    5     7   0
3    A    Y     8    3     1   0
4    A    Y     0    2     3   0
5    B    X     4    4     2   0
6    B    X     0    1     0   0
    

df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,1,6], ["A", "X", 6,5,7,4], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 0,4,2,2], ["B", "X", 2,1,0,3]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df2
  key1 key2 start plus minus end
0    A    X     3    6     4   5
1    A    X     5    2     1   6
2    A    X     6    5     7   4
3    A    Y     8    3     1  10
4    A    Y    10    2     3   9
5    B    X     0    4     2   2
6    B    X     2    1     0   3
python pandas performance for-loop apply
4个回答
0
投票

您可以结合

astype
df.iterrows()
for
循环来执行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

# Conver columns to integer
df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(int)

# Start the row iterator
row_iterator = df.iterrows()
# take first item from row_iterator
_, last = next(row_iterator)
# Modify the first element
last['end'] = last['start'] + last['plus'] - last['minus']
df.loc[0, :] = last
# Iterate through the rest of the rows
for i, row in row_iterator:
    # Check the keys match
    if row['key1'] == last['key1'] and row['key2'] == last['key2']:
        # Add the end of last to the start of the next row
        row['start'] = last['end']
    # Caluculate new end for row
    row['end'] = row['start'] + row['plus'] - row['minus']
    # Ensure the changes are shown in the original dataframe
    df.loc[i, :] = row
    # Last row is now the current row
    last = row

执行后,

df
现在是:

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7

注意:您的

df2
有错误。如果我们遵循您提供的逻辑,
start
row 5
条目应该是
4
not
0


0
投票

由于

apply
方法是逐行工作的,所以可以按如下方式使用:

# a dictionary to follow-up keys and start value
d = {'AX': None,
     'AY': None,
     'BX': None}

def helper(row):
    # modify d inside this function
    global d
    # get key by concatenating key1+key2
    key = row.key1+row.key2
    # if key is already seen, use the stored value as start value
    if d[key]:
        start = d[key]
    # if key is unseen, use the df1 start value
    else:
        start=row.start
    
    # calculate end value
    end = start + row.plus - row.minus
    
    # store the end value in dictionary
    # so that it can be used as start in next corresponding row
    d[key] = end
    # update
    return start,end

# update df1 start and end row-wise
df1[['start','end']] = df1.apply(helper,axis=1,result_type='expand')

最后,更新后的df1相当于你的df2。


0
投票

代码(+一些数学)

keys = ['key1', 'key2']

cs = df1.groupby(keys)[['plus', 'minus']].cumsum()
start = df1.groupby(keys)['start'].transform('first')

df1['end'] = start + cs['plus'] - cs['minus']

结果

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      0     2      1    6
2    A    X      0     5      7    4
3    A    Y      8     3      1   10
4    A    Y      0     2      3    9
5    B    X      4     4      2    6
6    B    X      0     1      0    7

说明

让我们使用公式计算每一行的值

end1 = `start1 + plus1 - minus1`
end2 = `end1 + plus2 - minus2` 
     = `start1 + (plus1 + plus2) - (minus1 + minus2)`
end3 = `end3 + plus3 - minus2`
     = `start1 + (plus1 + plus2 + plus3) - (minus1 + minus2 + minus3)`
....

如果你观察公式,有一个可见的模式,即结束值等于组的起始值加上“加”行的累积和减去“减”行的累积和


0
投票

用途:

s = df1.groupby(['key1','key2'])['start'].transform('first')
df1['end'] = df1['plus'].sub(df1['minus']).groupby([df1['key1'],df1['key2']]).cumsum().add(s)
df1['start'] = df1.groupby(['key1','key2'])['end'].shift().fillna(df1['start']).astype(int)
print (df1)
  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7
© www.soinside.com 2019 - 2024. All rights reserved.