Python - 高效计算，其中一行的结束值是另一行的起始值

Question

我想在滚动的基础上进行简单的计算，但是当我尝试使用嵌套的 for 循环解决这个问题时会遇到严重的性能问题。我需要对非常大的数据执行这种操作，但必须使用标准 Python（包括 Pandas）。

我有一个 pd.DataFrame (df1)，它包含（由某些维度构成，我们称它们为 key1 和 key2）一个起始列、一个结束列和介于两者之间的一些操作列，它们应该用于计算结束基于起始列的列。

基本上，简单的逻辑是：start + plus - minus = end，其中每行的结束值是下一行的开始值。

这需要通过两个键来完成，即分别用于 AX、AY 和 BX。

df2 显示了预期的结果，但如果在更大的表上完成此任务，我不知道如何以有效的方式到达那里而不会破坏我的记忆。

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df1
  key1 key2 start plus minus end
0    A    X     3    6     4   0
1    A    X     0    2     1   0
2    A    X     0    5     7   0
3    A    Y     8    3     1   0
4    A    Y     0    2     3   0
5    B    X     4    4     2   0
6    B    X     0    1     0   0
    

df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,1,6], ["A", "X", 6,5,7,4], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 0,4,2,2], ["B", "X", 2,1,0,3]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df2
  key1 key2 start plus minus end
0    A    X     3    6     4   5
1    A    X     5    2     1   6
2    A    X     6    5     7   4
3    A    Y     8    3     1  10
4    A    Y    10    2     3   9
5    B    X     0    4     2   2
6    B    X     2    1     0   3

Answer 1

您可以结合

astype

、

df.iterrows()

和

for

循环来执行以下操作：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

# Conver columns to integer
df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(int)

# Start the row iterator
row_iterator = df.iterrows()
# take first item from row_iterator
_, last = next(row_iterator)
# Modify the first element
last['end'] = last['start'] + last['plus'] - last['minus']
df.loc[0, :] = last
# Iterate through the rest of the rows
for i, row in row_iterator:
    # Check the keys match
    if row['key1'] == last['key1'] and row['key2'] == last['key2']:
        # Add the end of last to the start of the next row
        row['start'] = last['end']
    # Caluculate new end for row
    row['end'] = row['start'] + row['plus'] - row['minus']
    # Ensure the changes are shown in the original dataframe
    df.loc[i, :] = row
    # Last row is now the current row
    last = row

执行后，

df

现在是：

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7

注意：您的

df2

有错误。如果我们遵循您提供的逻辑，

start

的

row 5

条目应该是

not

。

Answer 2

由于

apply

方法是逐行工作的，所以可以按如下方式使用：

# a dictionary to follow-up keys and start value
d = {'AX': None,
     'AY': None,
     'BX': None}

def helper(row):
    # modify d inside this function
    global d
    # get key by concatenating key1+key2
    key = row.key1+row.key2
    # if key is already seen, use the stored value as start value
    if d[key]:
        start = d[key]
    # if key is unseen, use the df1 start value
    else:
        start=row.start
    
    # calculate end value
    end = start + row.plus - row.minus
    
    # store the end value in dictionary
    # so that it can be used as start in next corresponding row
    d[key] = end
    # update
    return start,end

# update df1 start and end row-wise
df1[['start','end']] = df1.apply(helper,axis=1,result_type='expand')

最后，更新后的df1相当于你的df2。

Answer 3

代码（+一些数学）

keys = ['key1', 'key2']

cs = df1.groupby(keys)[['plus', 'minus']].cumsum()
start = df1.groupby(keys)['start'].transform('first')

df1['end'] = start + cs['plus'] - cs['minus']

结果

  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      0     2      1    6
2    A    X      0     5      7    4
3    A    Y      8     3      1   10
4    A    Y      0     2      3    9
5    B    X      4     4      2    6
6    B    X      0     1      0    7

说明

让我们使用公式计算每一行的值

end1 = `start1 + plus1 - minus1`
end2 = `end1 + plus2 - minus2` 
     = `start1 + (plus1 + plus2) - (minus1 + minus2)`
end3 = `end3 + plus3 - minus2`
     = `start1 + (plus1 + plus2 + plus3) - (minus1 + minus2 + minus3)`
....

如果你观察公式，有一个可见的模式，即结束值等于组的起始值加上“加”行的累积和减去“减”行的累积和

Answer 4

用途：

s = df1.groupby(['key1','key2'])['start'].transform('first')
df1['end'] = df1['plus'].sub(df1['minus']).groupby([df1['key1'],df1['key2']]).cumsum().add(s)
df1['start'] = df1.groupby(['key1','key2'])['end'].shift().fillna(df1['start']).astype(int)
print (df1)
  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7

Python - 高效计算，其中一行的结束值是另一行的起始值

问题描述投票：0回答：4

4个回答

代码（+一些数学）

结果

说明

最新问题

Python - 高效计算，其中一行的结束值是另一行的起始值

问题描述 投票：0回答：4

4个回答

代码（+一些数学）

结果

说明

最新问题

问题描述投票：0回答：4