更快的方式 NLTK 中的独特词频

问题描述 投票:0回答:1

我的数据框有 230 万行。我试图从中找出最常用的 100 个单词。我不需要标点符号、动词、数字和 ('a','the','an') 我在 python 中使用以下查询,但需要很长时间才能得到结果。有没有更快的方法?

    import re
    import nltk
    # Download NLTK data if you haven't already
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['Comments_Final'])
    unique_words = sorted(vectorizer.get_feature_names())
    def count_words_without_punctuation_and_verbs(text):
        words = re.findall(r'\b\w+\b', text.lower())
        # Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
        tagged_words = nltk.pos_tag(words)
        filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and 
                          not pos == 'CD']
        return len(filtered_words)
   # Create a dictionary to store word frequencies
   word_frequencies = {}
   for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

   # Sort the words by frequency in descending order
   sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

   # Print the top 100 words
   for word, frequency in sorted_words[:100]:
   print(f"{word}: {frequency}")
    
python nlp nltk
1个回答
0
投票

是的,有更快的方法。如果您稍微清理一下代码,您会发现它会更快。

  1. def count_words_without_punctuation_and_verbs(text)

注意稍后如何通过

for
循环调用上述函数:

for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

每次迭代中对

count_words_without_punctuation_and_verbs()
的调用意味着您每次迭代都会对 整个 DataFrame 进行冗余标记/标记,这显然效率极低。

  1. return len(filtered_words)

这也是多余的。

CountVectorizer
可以生成这个数字,因为您使用它来获取过滤后的单词。

© www.soinside.com 2019 - 2024. All rights reserved.