当'param'和'df1'对应变量的组合匹配时,用FORF循环将df ['value']与param ['factor']相乘

问题描述 投票:0回答:1

出于性能原因,我正在尝试通过“应用”或其他合适的函数替换FOR循环。 Serge Ballesta解释了为什么带有“ Apply”(应用)的以下代码片段无效,这要归功于他。第一个代码段带有FOR循环,并给出了预期的结果。FOR循环的目的是当param ['table'],param ['y'],param ['的组合时,在新列df1 ['new_vallue']中将df1 ['value']与param ['factor']相乘。 x'],param ['t'] match df1 ['table'],df1 ['y'],df1 ['x'],df1 ['t']吗?

['df1'包含500万行,'param'1000时,需要30分钟才能运行。也许带有字典或映射?第二个代码带有“应用”功能。

  import numpy as np
import pandas as pd 

df1 = pd.DataFrame( {
   'date': ['31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019'],
   'id': ['X1','X1','X1','X1','X2','X2','X2','X2','X1','X1','X1','X1','X2','X2','X2','X2'],
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310,300,310,300,310,300,310,300,310,300,310,300,310],
   'x': [10,20,10,10,20,20,10,10,20,20,40,40,40,10,10,10],
   't': ['o','o','o','o','o','o','o','o','o','o','o','o','o','o','o','o'],
   'value': [0.37,0.98,3,45,0.76,12,14,31,51,1.7,12,14,12,19,123,43]
    } );

param = pd.DataFrame( {
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310],
   'x': [10,20,30,10],
   't': ['o','o','o','o'],
   'factor': [12,34,22,43]
    } );


df1['new_value'] = 0


def CALC(df, table, y, x, t, factor):
    df.loc[(df['table'] == table ) & 
            (df['y']== y) 
            & (df['x']== x ) & (df['t']== t),['new_value']] = df['value']*factor

param_l = param.values.tolist()

for row in param_l[0:]:           
    table = row[2]
    y = row[4]
    x = row[3]
    t = row[1]
    factor = row[0]
    CALC(df1,table,y,x,t,factor)

# second snippet: with the apply

import numpy as np
import pandas as pd 

df1 = pd.DataFrame( {
   'date': ['31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019'],
   'id': ['X1','X1','X1','X1','X2','X2','X2','X2','X1','X1','X1','X1','X2','X2','X2','X2'],
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310,300,310,300,310,300,310,300,310,300,310,300,310],
   'x': [10,20,10,10,20,20,10,10,20,20,40,40,40,10,10,10],
   't': ['o','o','o','o','o','o','o','o','o','o','o','o','o','o','o','o'],
   'value': [0.37,0.98,3,45,0.76,12,14,31,51,1.7,12,14,12,19,123,43]
    } );

param = pd.DataFrame( {
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310],
   'x': [10,20,30,10],
   't': ['o','o','o','o'],
   'factor': [12,34,22,43]
    } );


df1['new_value'] = 0


def CALC(df, table, y, x, t, factor):
    df.loc[(df['table'] == table ) & 
            (df['y']== y) 
            & (df['x']== x ) & (df['t']== t),['new_value']] = df['value']*factor
    return(df)

df1['new_value'] = param.apply(lambda row: CALC(df1,param['table'],param['y'],param['x'],param['t'],param['factor']))
python performance for-loop apply
1个回答
0
投票

糟糕,您似乎不了解apply的功能!

首先,要在数据帧的每一行上应用函数,必须使用axis=1作为参数,因为默认情况下apply使用列。

然后,(带有axis=1),该函数将应用于每行,并且返回值用于构建apply将返回的Series或DataFrame。因此,当CALC返回df1时,

df1['new_value'] = param.apply(lambda row: CALC(df1,row['table'],row['y'],row['x'],
                                                row['t'],row['factor']), axis=1)

将在new_value列的每一行中写入对df1的引用!由于循环引用,任何仅打印df1的尝试都会导致StackOverflow错误...

当calc修改df1['new_value']时,您不能那样分配它,而应分配给一个虚拟变量:

_ = param.apply(lambda row: CALC(df1,row['table'],row['y'],row['x'], row['t'],row['factor']),
                axis=1)
© www.soinside.com 2019 - 2024. All rights reserved.