我有一个KMeans集群脚本,它根据文本内容组织一些文档。这些文档属于3个集群中的1个,但似乎非常或不是,我希望能够看到每个文档对集群的重要性。
例如。文档A在群集1中90%匹配,文档B在群集1中但45%匹配。
因此我可以创建某种阈值来说,我只想要80%或更高的文档。
dict_of_docs = {'Document A':'some text content',...'Document Z':'some more text content'}
# Vectorizing the data, my data is held in a Dict, so I just want the values.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
X = X.toarray()
# 3 Clusters as I know that there are 3, otherwise use Elbow method
# Then add the vectorized data to the Vocabulary
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X)
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# Print out top terms for each cluster
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
for doc in dict_of_docs:
text = dict_of_docs[doc]
Y = vectorizer.transform([text])
prediction = km.predict(Y)
print(prediction, doc)
我不相信有可能做到你想要的,因为k-means实际上不是一个概率模型,它的scikit-learn实现(这是我假设你正在使用的)只是没有提供正确的接口。
我建议的一个选项是使用KMeans.score
方法,该方法不提供概率输出,但提供的分数越接近点到最近的簇。您可以通过此阈值,例如通过说“文档A在群集1中,得分为-.01,所以我保留它”或“文档B在群集2中,得分为-1000,所以我忽略它”。
另一种选择是使用GaussianMixture
模型。高斯混合是与k-means非常相似的模型,它提供了你想要的GaussianMixture.predict_proba
概率。