添加停用词

问题描述 投票:0回答:1

我正在学习python,需要在countvectorization中向停用词方法添加一些单词。这是完整的代码段

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# Helper function
def plot_10_most_common_words(count_data, count_vectorizer):
    import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]

    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:10]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 

    plt.figure(2, figsize=(15, 15/1.6180))
    plt.subplot(title='10 most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x_pos, counts, palette='husl')
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(papers['Abstract'])

# Visualise the 10 most common words
plot_10_most_common_words(count_data, count_vectorizer)

并且从这里count_vectorizer = CountVectorizer(stop_words ='english')] >>我想添加一些单词,例如“ new”,“ file”,“ author”

谢谢

我正在学习python,需要在countvectorization中向停用词方法添加一些单词。这是sklearn.feature_extraction.text import CountVectorizer import的完整代码段...

python stop-words countvectorizer
1个回答
0
投票

根据the documentation stop_words可以是列表。

© www.soinside.com 2019 - 2024. All rights reserved.