我想在滚动的基础上进行简单的计算,但是当我尝试使用嵌套的 for 循环解决这个问题时会遇到严重的性能问题。我需要对非常大的数据执行这种操作,但必须使用标准 Python(包括 Pandas)。
我有一个 pd.DataFrame (df1),它包含(由某些维度构成,我们称它们为 key1 和 key2)一个起始列、一个结束列和介于两者之间的一些操作列,它们应该用于计算结束基于起始列的列。
基本上,简单的逻辑是:start + plus - minus = end,其中每行的结束值是下一行的开始值。
这需要通过两个键来完成,即分别用于 AX、AY 和 BX。
df2 显示了预期的结果,但如果在更大的表上完成此任务,我不知道如何以有效的方式到达那里而不会破坏我的记忆。
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])
>>> df1
key1 key2 start plus minus end
0 A X 3 6 4 0
1 A X 0 2 1 0
2 A X 0 5 7 0
3 A Y 8 3 1 0
4 A Y 0 2 3 0
5 B X 4 4 2 0
6 B X 0 1 0 0
df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,1,6], ["A", "X", 6,5,7,4], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 0,4,2,2], ["B", "X", 2,1,0,3]]),
columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])
>>> df2
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 5 2 1 6
2 A X 6 5 7 4
3 A Y 8 3 1 10
4 A Y 10 2 3 9
5 B X 0 4 2 2
6 B X 2 1 0 3
您可以结合
astype
、df.iterrows()
和for
循环来执行以下操作:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])
# Conver columns to integer
df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(int)
# Start the row iterator
row_iterator = df.iterrows()
# take first item from row_iterator
_, last = next(row_iterator)
# Modify the first element
last['end'] = last['start'] + last['plus'] - last['minus']
df.loc[0, :] = last
# Iterate through the rest of the rows
for i, row in row_iterator:
# Check the keys match
if row['key1'] == last['key1'] and row['key2'] == last['key2']:
# Add the end of last to the start of the next row
row['start'] = last['end']
# Caluculate new end for row
row['end'] = row['start'] + row['plus'] - row['minus']
# Ensure the changes are shown in the original dataframe
df.loc[i, :] = row
# Last row is now the current row
last = row
执行后,
df
现在是:
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 5 2 1 6
2 A X 6 5 7 4
3 A Y 8 3 1 10
4 A Y 10 2 3 9
5 B X 4 4 2 6
6 B X 6 1 0 7
注意:您的
df2
有错误。如果我们遵循您提供的逻辑,start
的 row 5
条目应该是 4
not 0
。
由于
apply
方法是逐行工作的,所以可以按如下方式使用:
# a dictionary to follow-up keys and start value
d = {'AX': None,
'AY': None,
'BX': None}
def helper(row):
# modify d inside this function
global d
# get key by concatenating key1+key2
key = row.key1+row.key2
# if key is already seen, use the stored value as start value
if d[key]:
start = d[key]
# if key is unseen, use the df1 start value
else:
start=row.start
# calculate end value
end = start + row.plus - row.minus
# store the end value in dictionary
# so that it can be used as start in next corresponding row
d[key] = end
# update
return start,end
# update df1 start and end row-wise
df1[['start','end']] = df1.apply(helper,axis=1,result_type='expand')
最后,更新后的df1相当于你的df2。
keys = ['key1', 'key2']
cs = df1.groupby(keys)[['plus', 'minus']].cumsum()
start = df1.groupby(keys)['start'].transform('first')
df1['end'] = start + cs['plus'] - cs['minus']
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 0 2 1 6
2 A X 0 5 7 4
3 A Y 8 3 1 10
4 A Y 0 2 3 9
5 B X 4 4 2 6
6 B X 0 1 0 7
让我们使用公式计算每一行的值
end1 = `start1 + plus1 - minus1`
end2 = `end1 + plus2 - minus2`
= `start1 + (plus1 + plus2) - (minus1 + minus2)`
end3 = `end3 + plus3 - minus2`
= `start1 + (plus1 + plus2 + plus3) - (minus1 + minus2 + minus3)`
....
如果你观察公式,有一个可见的模式,即结束值等于组的起始值加上“加”行的累积和减去“减”行的累积和
用途:
s = df1.groupby(['key1','key2'])['start'].transform('first')
df1['end'] = df1['plus'].sub(df1['minus']).groupby([df1['key1'],df1['key2']]).cumsum().add(s)
df1['start'] = df1.groupby(['key1','key2'])['end'].shift().fillna(df1['start']).astype(int)
print (df1)
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 5 2 1 6
2 A X 6 5 7 4
3 A Y 8 3 1 10
4 A Y 10 2 3 9
5 B X 4 4 2 6
6 B X 6 1 0 7