我的数据框有 230 万行。我试图从中找出最常用的 100 个单词。我不需要标点符号、动词、数字和 ('a','the','an') 我在 python 中使用以下查询,但需要很长时间才能得到结果。有没有更快的方法?
import re
import nltk
# Download NLTK data if you haven't already
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Comments_Final'])
unique_words = sorted(vectorizer.get_feature_names())
def count_words_without_punctuation_and_verbs(text):
words = re.findall(r'\b\w+\b', text.lower())
# Use NLTK to tag words and exclude verbs (VB* tags) and digits (CD tags)
tagged_words = nltk.pos_tag(words)
filtered_words = [word for word, pos in tagged_words if not pos.startswith('VB') and
not pos == 'CD']
return len(filtered_words)
# Create a dictionary to store word frequencies
word_frequencies = {}
for word in unique_words:
count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
word_frequencies[word] = count
# Sort the words by frequency in descending order
sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)
# Print the top 100 words
for word, frequency in sorted_words[:100]:
print(f"{word}: {frequency}")
是的,有更快的方法。如果您稍微清理一下代码,您会发现它会更快。
def count_words_without_punctuation_and_verbs(text)
注意稍后如何通过
for
循环调用上述函数:
for word in unique_words:
count = df['Comments_Final'].apply(count_words_without_punctuation_and_verbs).sum()
word_frequencies[word] = count
每次迭代中对
count_words_without_punctuation_and_verbs()
的调用意味着您每次迭代都会对 整个 DataFrame 进行冗余标记/标记,这显然效率极低。
return len(filtered_words)
这也是多余的。
CountVectorizer
可以生成这个数字,因为您使用它来获取过滤后的单词。