用lambda函数替换大熊猫df的嵌套循环

问题描述 投票:0回答:1

我拥有用户和页面浏览级别的数据,并且正在尝试建立一个矩阵,该矩阵具有同一用户查看两项的时间百分比。毫不奇怪,我开发的嵌套for循环效率极低。我知道应该在这里使用lambda函数,但是在获取一个可以实际计算出我需要的函数时遇到了麻烦:

data = [['tom', '1'], ['nick', '1'], ['julie', '1'], ['tom', '2'], ['julie', '2'], ['tom', '3'], ['julie', '3']] 
visits_comb = pd.DataFrame(data, columns = ['USER_ID', 'PAGE_CLICKED']) 
visits_comb

page_id=pd.DataFrame(visits_comb.PAGE_CLICKED.unique(),columns=['PAGE_CLICKED'])
page_id

sim_mat=np.zeros(shape=(len(page_id),len(page_id)))

for index, row in page_id.iterrows():

    base_page=row['PAGE_CLICKED']

    for index2, row2 in page_id.iterrows():
        comparison_page=row2['PAGE_CLICKED']
        if base_page<comparison_page: 

            sessions=visits_comb[visits_comb['PAGE_CLICKED'].map(lambda x: x in [base_page,comparison_page])].groupby('USER_ID')['PAGE_CLICKED'].apply(lambda x: x.unique().shape[0])
            sim_mat[index][index2]=sessions.value_counts(2)[2]

print(sim_mat)
python for-loop lambda nested-loops
1个回答
0
投票

我已经找到一种减少运行时间约60%的方法。但是,这仍然是效率极低的解决方案,带有双for循环。很想得到任何人的想法:

data = [['tom', '1'], ['nick', '1'], ['julie', '1'], ['tom', '2'], ['julie', '2'], ['tom', '3'], ['julie', '3']] 
visits_comb = pd.DataFrame(data, columns = ['USER_ID', 'PAGE_CLICKED']) 
visits_comb

page_id=pd.DataFrame(visits_comb.PAGE_CLICKED.unique(),columns=['PAGE_CLICKED'])
page_id

sim_mat=np.zeros(shape=(len(page_id),len(page_id)))

for index, row in page_id.iterrows():

    base_page=row['PAGE_CLICKED']

    for index2, row2 in page_id.iterrows():
        comparison_page=row2['PAGE_CLICKED']
        if base_page<comparison_page: 

            sessions=visits_comb['USER_ID'][visits_comb['PAGE_CLICKED'].map(lambda x: x in [base_page,comparison_page])].value_counts()
            sim_mat[index][index2]=sessions.value_counts(normalize=True)[2]

print(sim_mat)
© www.soinside.com 2019 - 2024. All rights reserved.