使用不同的颜色和标签进行聚类

问题描述 投票:0回答:1

我正在研究文本聚类。我需要使用不同的颜色来绘制数据。我使用了 kmeans 聚类和 tf-idf 的相似性。

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])

目前,我的输出看起来像。enter image description here 有一些元素,因为它是一个测试。我需要添加标签(它们是字符串),并通过簇来区分点:每个簇应该有自己的颜色,以使读者易于分析图表。

你能告诉我如何修改我的代码,以便同时包含标签和颜色吗?我想任何例子都会很好。

我的数据集的样本是(上面的输出是由不同的样本生成的)。

句子

Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
python matplotlib cluster-analysis k-means tf-idf
1个回答
2
投票

我们可以使用一个示例数据集。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

newsgroups = fetch_20newsgroups(subset='train',
                                categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target

pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

然后像你一样做KMeans, 获得聚类和中心, 所以只需为聚类添加一个名字:

kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]

你可以通过提供聚类来添加颜色。"c=" 并从 厘米 或定义你自己的地图。

plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
    plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")

enter image description here

你也可以考虑使用seaborn

sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")

enter image description here


1
投票

根据你的代码,试试下面的代码。

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_

cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}

fig, ax = plt.subplots()
for g in np.unique(group):
    ix = np.where(group == g)
    ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()

我假设你的 kmeansn_clusters=3. 该 cdictldict 需要根据集群的数量进行相应的设置。在这种情况下,集群0将是红色的,标签为 label_1簇1将是蓝色的,标签为 label_2 等等。

EDIT: 我把 cdict 从0.EDIT 2:添加标签。

© www.soinside.com 2019 - 2024. All rights reserved.