我正在研究文本聚类。我需要使用不同的颜色来绘制数据。我使用了 kmeans
聚类和 tf-idf
的相似性。
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])
目前,我的输出看起来像。 有一些元素,因为它是一个测试。我需要添加标签(它们是字符串),并通过簇来区分点:每个簇应该有自己的颜色,以使读者易于分析图表。
你能告诉我如何修改我的代码,以便同时包含标签和颜色吗?我想任何例子都会很好。
我的数据集的样本是(上面的输出是由不同的样本生成的)。
句子
Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
我们可以使用一个示例数据集。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
newsgroups = fetch_20newsgroups(subset='train',
categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target
pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
然后像你一样做KMeans, 获得聚类和中心, 所以只需为聚类添加一个名字:
kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]
你可以通过提供聚类来添加颜色。"c="
并从 厘米 或定义你自己的地图。
plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")
你也可以考虑使用seaborn
sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")
根据你的代码,试试下面的代码。
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_
cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}
fig, ax = plt.subplots()
for g in np.unique(group):
ix = np.where(group == g)
ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()
我假设你的 kmeans
有 n_clusters=3
. 该 cdict
和 ldict
需要根据集群的数量进行相应的设置。在这种情况下,集群0将是红色的,标签为 label_1
簇1将是蓝色的,标签为 label_2
等等。
EDIT: 我把 cdict
从0.EDIT 2:添加标签。