如何通过具有100K行的两个不同数据帧改善我的代码迭代,以降低python中的处理速度?

问题描述 投票:-1回答:2

[能否请您看一下我的代码并给我一些建议,以改善我的代码,从而减少处理时间?主要目的是查看测试表的每一行(ID),并在列表表中找到相同的ID,如果匹配,则查看两个相同ID之间的时间差,并将它们标记为少于1小时(3600s)或不。在此先感谢

test.csv具有两个列(ID,时间)和100K行list.csv有两个列(ID,时间)和40k行

样本数据:身份证时间83d-36615fa05fb0 2019-12-11 10:41:48

a = -1
for row_index,row in test.iterrows():
   a = a + 1

   for row_index2,row2 in list.iterrows():

       if row['ID'] == row2['ID']:
           time_difference = row['Time'] - row2['Time']
           time_difference = time_difference.total_seconds() 

           if time_difference < 3601 and time_difference > 0:
               test.loc[a, 'result'] = "short waiting time"
python dataframe for-loop if-statement bigdata
2个回答
0
投票

该代码可以转换为以下代码。

def find_match(id_val, time_val, df):
    " This uses your algorithm for generating a string based upon time difference "
    # Find rows with matching ID in Dataframe df 
    #   (this will be list_ref in our usage)
    matches = df[df['ID'] == id_val]

    if not matches.empty:
        # Use iterrows on matched rows
        # This replaced your inner loop but only apply to rows with the same ID
        for row_index2, row2 in matches.iterrows():
            time_difference = time_val - row2['Time']
            time_difference = time_difference.total_seconds()
            if 0 < time_difference < 3601:
                return "short waiting time"  # Exit since found a time in window

    return ""  # Didn't find a time in our window

# Test
#     test - your test Dataframe
#     list_ref - (your list DataFrame, but using list_ref since list is a bad 
#                name for a Python variable)


# Create result using Apply
# More info on Apply available https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

test['result'] = test.apply(lambda row: find_match (row['ID'], row['Time'], list_ref), axis = 1)

0
投票

您根本不应该遍历数据帧的行。而是使用以下方法组合测试和列出数据框使用“ ID”作为“ on”联接键从您的数据框中进行合并或联接方法。您将数据存储在一个大表中,并创建了一个新列,在其中从第二个表的Time.y中从第一个表中减去Time.x。例如,将新列称为“ timediff”。然后,通过timediff <3601的此timediff列过滤结果数据帧。

假设数据帧分别命名为testdf和listdf。

joindf = testdf.merge(listdf, on='ID')
joindf['timediff'] = joindf['Time_x'] - joindf['Time_y']
joindf.loc[timediff < 3601, 'result'] = 'short waiting time'
© www.soinside.com 2019 - 2024. All rights reserved.