[能否请您看一下我的代码并给我一些建议,以改善我的代码,从而减少处理时间?主要目的是查看测试表的每一行(ID),并在列表表中找到相同的ID,如果匹配,则查看两个相同ID之间的时间差,并将它们标记为少于1小时(3600s)或不。在此先感谢
test.csv具有两个列(ID,时间)和100K行list.csv有两个列(ID,时间)和40k行
样本数据:身份证时间83d-36615fa05fb0 2019-12-11 10:41:48
a = -1
for row_index,row in test.iterrows():
a = a + 1
for row_index2,row2 in list.iterrows():
if row['ID'] == row2['ID']:
time_difference = row['Time'] - row2['Time']
time_difference = time_difference.total_seconds()
if time_difference < 3601 and time_difference > 0:
test.loc[a, 'result'] = "short waiting time"
该代码可以转换为以下代码。
def find_match(id_val, time_val, df):
" This uses your algorithm for generating a string based upon time difference "
# Find rows with matching ID in Dataframe df
# (this will be list_ref in our usage)
matches = df[df['ID'] == id_val]
if not matches.empty:
# Use iterrows on matched rows
# This replaced your inner loop but only apply to rows with the same ID
for row_index2, row2 in matches.iterrows():
time_difference = time_val - row2['Time']
time_difference = time_difference.total_seconds()
if 0 < time_difference < 3601:
return "short waiting time" # Exit since found a time in window
return "" # Didn't find a time in our window
# Test
# test - your test Dataframe
# list_ref - (your list DataFrame, but using list_ref since list is a bad
# name for a Python variable)
# Create result using Apply
# More info on Apply available https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
test['result'] = test.apply(lambda row: find_match (row['ID'], row['Time'], list_ref), axis = 1)
您根本不应该遍历数据帧的行。而是使用以下方法组合测试和列出数据框使用“ ID”作为“ on”联接键从您的数据框中进行合并或联接方法。您将数据存储在一个大表中,并创建了一个新列,在其中从第二个表的Time.y中从第一个表中减去Time.x。例如,将新列称为“ timediff”。然后,通过timediff <3601的此timediff列过滤结果数据帧。
假设数据帧分别命名为testdf和listdf。
joindf = testdf.merge(listdf, on='ID')
joindf['timediff'] = joindf['Time_x'] - joindf['Time_y']
joindf.loc[timediff < 3601, 'result'] = 'short waiting time'