我的功能旨在创建一个三列的数据框;二元短语、计数(二元短语的)和 PMI 分数(二元短语的)。由于我想在包含超过一百万个短语的大型数据集上运行它,因此计算时间非常长。我认识到嵌套的 for 循环和匹配条件导致了计算困难。有没有其他方法可以做同样的事情并减少运行时间? 这是我的代码:
def pmi_count_phrase_create(pmi_tups,freq_list):
import pandas as pd
"""pmi_tups is result of running pmi_tups = [i for i in finder.score_ngrams(bigram_measures.pmi)]
freq_list is a result of running freq_list= finder.ngram_fd.items()
-> df made up of columns for pmi list, count list, phrase list"""
pmi3_list =[]
count3_list =[]
phrase3_list =[]
for phrase, pmi in pmi_tups: #pmi_tups is list of tuples of form:[((phrase),pmi),..]
for item in freq_list:
quadgram,count = item
if quadgram == phrase:
pmi3_list.append(pmi)
count3_list.append(count)
phrase3_list.append(phrase)
# create dataframe
df = pd.DataFrame({'Phrase':phrase3_list,'PMI':pmi3_list,'Count':count3_list})
return df
在我的 pmi_tups 和 freq_list 上运行此代码,它仍在运行,并且已经超过 1000 分钟。我也愿意使用不同的库来评估二元短语、pmi 和频率。
最终更改了我的函数,将 freq_list 转换为字典和列表推导式,而不是 for 循环,此代码立即返回一个数据帧:
def quicker_func(pmi_tups, freq_list):
import pandas as pd
freq_dict = dict(freq_list) # Create a dictionary for faster lookups
pmi_list = [pmi for phrase, pmi in pmi_tups if phrase in freq_dict]
count_list = [freq_dict[phrase] for phrase, pmi in pmi_tups if phrase in freq_dict]
phrase_list = [phrase for phrase, pmi in pmi_tups if phrase in freq_dict]
df = pd.DataFrame({'Phrase': phrase_list, 'PMI': pmi_list, 'Count': count_list})
return df