更快的方式 NLTK 中的独特词频

Question

我的数据框有 230 万行。我试图从中找出最常用的 100 个单词。我不需要标点符号、动词、数字和 ('a','the','an') 我在 python 中使用以下查询，但需要很长时间才能得到结果。有没有更快的方法？

    import re
    import nltk
    # Download NLTK data if you haven't already
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['Comments_Final'])
    unique_words = sorted(vectorizer.get_feature_names())
    def count_words_without_punctuation_and_verbs(text):
        words = re.findall(r'\b\w+\b', text.lower())
        # Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
        tagged_words = nltk.pos_tag(words)
        filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and 
                          not pos == 'CD']
        return len(filtered_words)
   # Create a dictionary to store word frequencies
   word_frequencies = {}
   for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

   # Sort the words by frequency in descending order
   sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

   # Print the top 100 words
   for word, frequency in sorted_words[:100]:
   print(f"{word}: {frequency}")

Answer 1

是的，有更快的方法。如果您稍微清理一下代码，您会发现它会更快。

def count_words_without_punctuation_and_verbs(text)

注意稍后如何通过

for

循环调用上述函数：

for word in unique_words:
   count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
   word_frequencies[word] = count

每次迭代中对

count_words_without_punctuation_and_verbs()

的调用意味着您每次迭代都会对 整个 DataFrame 进行冗余标记/标记，这显然效率极低。

```
return len(filtered_words)
```

这也是多余的。

CountVectorizer

可以生成这个数字，因为您使用它来获取过滤后的单词。

更快的方式 NLTK 中的独特词频

问题描述投票：0回答：1

1个回答

最新问题

更快的方式 NLTK 中的独特词频

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1