我创建了一个朴素的贝叶斯分类器,它使用来自不同政客的推文来预测他们的政党。我使用了sklearn MultinomialNB
实现。这是我的实现:
Senators_Vectorizer = CountVectorizer(decode_error= 'replace')
senator_counts = Senators_Vectorizer.fit_transform(senator_tweets['text'].values)
senator_targets = senator_tweets['party'].values
senator_counts_train, senator_counts_test, senator_targets_train, senator_targets_test = train_test_split(senator_counts, senator_targets, test_size = .1)
senator_party_clf = MultinomialNB()
senator_party_clf.fit(senator_counts_train, senator_targets_train)
我如何找到朴素贝叶斯分类器用来进行预测的词?有没有办法找到哪些词最有可能出现在民主党/共和党的推文中?
我希望Senators_Vectorizer
中每个单词的概率而不是特定推文来自特定方的概率。
使用feature_log_prob _获取每个功能的概率。
feature_log_prob _:形状的ndarray(n_classes,n_features)。给定一类P(x_i | y)的特征的经验对数概率。
This教程对我有帮助。
获取每个类的主要功能的快速示例:
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footers', 'quotes'),
categories=categories)
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data)
clf = MultinomialNB(alpha=.01).fit(vectors, newsgroups_train.target)
import numpy as np
def show_top10(classifier, vectorizer, categories):
feature_names = np.asarray(vectorizer.get_feature_names())
for i, category in enumerate(categories):
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s: %s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, newsgroups_train.target_names)
alt.atheism: islam does religion atheism say just think don people god
comp.graphics: windows does looking program know file image files thanks graphics
sci.space: earth think shuttle orbit moon just launch like nasa space
talk.religion.misc: objective think just bible don christians christian people Jesus god