我正在学习Python,并尝试使用CountVectorizer
删除一些单词。我想要的是替换count_vectorizer = CountVectorizer(stop_words='english')
并从文件中读取停用词。
这是我的代码:
# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
import matplotlib.pyplot as plt
words = count_vectorizer.get_feature_names()
total_counts = np.zeros(len(words))
for t in count_data:
total_counts+=t.toarray()[0]
count_dict = (zip(words, total_counts))
count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
words = [w[0] for w in count_dict]
counts = [w[1] for w in count_dict]
x_pos = np.arange(len(words))
plt.figure(2, figsize=(15, 15/1.6180))
plt.subplot(title='10 most common words')
sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
sns.barplot(x_pos, counts, palette='husl')
plt.xticks(x_pos, words, rotation=90)
plt.xlabel('words')
plt.ylabel('counts')
plt.show()
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(papers['Abstract'])
# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)
谢谢。
首先,从文件中读取停用词,并使用.split()
方法列出停用词:
with open("name_of_your_stop_words_file") as stop_words:
your_stop_words_list = stop_words.read().split()
然后使用此列表代替字符串'english'
:
count_vectorizer = CountVectorizer(stop_words=your_stop_words_list)
这假设您的停用词文件中包含仅由空格字符(例如空格或制表符)分隔的停用词,如果一行中有不止一个这样的停用词。