如何优化这个功能,提高运行时间?

问题描述 投票:0回答:1

我的功能旨在创建一个三列的数据框;二元短语、计数(二元短语的)和 PMI 分数(二元短语的)。由于我想在包含超过一百万个短语的大型数据集上运行它,因此计算时间非常长。我认识到嵌套的 for 循环和匹配条件导致了计算困难。有没有其他方法可以做同样的事情并减少运行时间? 这是我的代码:


def pmi_count_phrase_create(pmi_tups,freq_list):

    import pandas as pd

    """pmi_tups is result of running pmi_tups = [i for i in finder.score_ngrams(bigram_measures.pmi)]  
       freq_list is a result of running freq_list= finder.ngram_fd.items() 
       
       -> df made up of columns for  pmi list, count list, phrase list"""
    pmi3_list =[]
    count3_list =[]
    phrase3_list =[]
    for phrase, pmi in pmi_tups: #pmi_tups is list of tuples of form:[((phrase),pmi),..]
        for item in freq_list:  
            quadgram,count = item
            if quadgram == phrase:
                pmi3_list.append(pmi)
                count3_list.append(count)
                phrase3_list.append(phrase)

                # create dataframe
    df = pd.DataFrame({'Phrase':phrase3_list,'PMI':pmi3_list,'Count':count3_list})
    return df 

在我的 pmi_tups 和 freq_list 上运行此代码,它仍在运行,并且已经超过 1000 分钟。我也愿意使用不同的库来评估二元短语、pmi 和频率。

python performance optimization nlp nltk
1个回答
0
投票

最终更改了我的函数,将 freq_list 转换为字典和列表推导式,而不是 for 循环,此代码立即返回一个数据帧:

def quicker_func(pmi_tups, freq_list):
    import pandas as pd
    freq_dict = dict(freq_list)  # Create a dictionary for faster lookups 

    pmi_list = [pmi for phrase, pmi in pmi_tups if phrase in freq_dict]
    count_list = [freq_dict[phrase] for phrase, pmi in pmi_tups if phrase in freq_dict]
    phrase_list = [phrase for phrase, pmi in pmi_tups if phrase in freq_dict]

    df = pd.DataFrame({'Phrase': phrase_list, 'PMI': pmi_list, 'Count': count_list})
    return df
© www.soinside.com 2019 - 2024. All rights reserved.