当'param'和'df1'对应的变量组合匹配时,将FOR循环替换为将df['value']乘以param['factor']。

问题描述 投票:0回答:1

我想用其他适合的方式来代替FOR循环,因为执行的原因。

FOR循环的目的是当param['table'],param['y'],param['x'],param['t']的组合与df1['table'],df1['y'],df1['x'],df1['t']的组合相匹配时,将df1['value']乘以param['factor']在新列df1['new_vallue']。

第一个代码片段是用FOR循环,并给出了预期的结果。

当'df1'包含500万行,'param'1000行时,运行时间需要30分钟。也许使用字典,或者映射?

第二个代码片段是'Apply'函数.Serge Ballesta解释了为什么下面的代码片段与'Apply'不能工作,感谢他。

  import numpy as np
import pandas as pd 

df1 = pd.DataFrame( {

   'date': ['31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019'],
   'id': ['X1','X1','X1','X1','X2','X2','X2','X2','X1','X1','X1','X1','X2','X2','X2','X2'],
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310,300,310,300,310,300,310,300,310,300,310,300,310],
   'x': [10,20,10,10,20,20,10,10,20,20,40,40,40,10,10,10],
   't': ['o','o','o','o','o','o','o','o','o','o','o','o','o','o','o','o'],
   'value': [0.37,0.98,3,45,0.76,12,14,31,51,1.7,12,14,12,19,123,43]
    } );

param = pd.DataFrame( {
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310],
   'x': [10,20,30,10],
   't': ['o','o','o','o'],
   'factor': [12,34,22,43]
    } );


df1['new_value'] = 0


def CALC(df, table, y, x, t, factor):
    df.loc[(df['table'] == table ) & 
            (df['y']== y) 
            & (df['x']== x ) & (df['t']== t),['new_value']] = df['value']*factor

param_l = param.values.tolist()

for row in param_l[0:]:           
    table = row[2]
    y = row[4]
    x = row[3]
    t = row[1]
    factor = row[0]
    CALC(df1,table,y,x,t,factor)

# second snippet: with the apply

import numpy as np
import pandas as pd 

df1 = pd.DataFrame( {
   'date': ['31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2018','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019','31/12/2019'],
   'id': ['X1','X1','X1','X1','X2','X2','X2','X2','X1','X1','X1','X1','X2','X2','X2','X2'],
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2','TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310,300,310,300,310,300,310,300,310,300,310,300,310],
   'x': [10,20,10,10,20,20,10,10,20,20,40,40,40,10,10,10],
   't': ['o','o','o','o','o','o','o','o','o','o','o','o','o','o','o','o'],
   'value': [0.37,0.98,3,45,0.76,12,14,31,51,1.7,12,14,12,19,123,43]
    } );

param = pd.DataFrame( {
   'table': ['TABLE1','TABLE1','TABLE2','TABLE2'],
   'y': [300,310,300,310],
   'x': [10,20,30,10],
   't': ['o','o','o','o'],
   'factor': [12,34,22,43]
    } );


df1['new_value'] = 0


def CALC(df, table, y, x, t, factor):
    df.loc[(df['table'] == table ) & 
            (df['y']== y) 
            & (df['x']== x ) & (df['t']== t),['new_value']] = df['value']*factor
    return(df)

df1['new_value'] = param.apply(lambda row: CALC(df1,param['table'],param['y'],param['x'],param['t'],param['factor']))
python performance for-loop apply
1个回答
0
投票

哎呀,看来你还没明白什么是 "应用"。apply 就可以了

首先,要在数据框的每一行上应用一个函数,你必须使用 axis=1 作为参数,因为默认情况下 apply 使用列。

那么(用 axis=1),该函数应用于每条记录,返回值用于建立系列或DataFrame,即 apply 会返回。所以当CALC返回 df1,

df1['new_value'] = param.apply(lambda row: CALC(df1,row['table'],row['y'],row['x'],
                                                row['t'],row['factor']), axis=1)

将在每一行的 new_value 栏目的引用 df1! 任何试图只打印df1的尝试都会导致StackOverflow错误,因为循环引用......

由于calc修改了 df1['new_value'] 你不能这样分配,而是要把它分配给一个虚拟变量。

_ = param.apply(lambda row: CALC(df1,row['table'],row['y'],row['x'], row['t'],row['factor']),
                axis=1)

上面的apply调用给出了预期的结果,但并没有比第一个片段更有效率。事实上,习惯性的方法是使用一个叫做 merge:

df1['new_value'] = df1['value'] * df1.merge(
    param, on=['table', 'y', 'x', 't'], how='left')['factor'].fillna(0.)

这里不涉及外部循环。使用 timeit,对于一个 15 行的数据框,它的速度快了 3 倍,对于更大的数据框,它的效益可能更高。这并不奇怪,因为在Python层面,apply一般都涉及循环。这里有一些参考资料。

© www.soinside.com 2019 - 2024. All rights reserved.