按作者绘制颜色,但按 kmeans/tf-idf python 进行聚类

问题描述 投票:0回答:1

嘿!

我第一次使用 k-means/tf-idf/document cluster。 我使用 k-means/tf-idf 对文本文件进行聚类,效果很好。我绘制 (PCA) 并且可以很好地看到集群。

但现在我希望文本中的作者作为颜色指示器,而不是集群/主题。有谁知道怎么做吗?


file_list = glob.glob(os.path.join(os.getcwd(), "myFiles", "*.txt"))

dataset = []

for file_path in file_list:
    with open(file_path) as f_input:
        dataset.append(f_input.read())

vectorizer = TfidfVectorizer(stop_words='english')


vectorized_documents = vectorizer.fit_transform(dataset) 
  
pca = PCA(n_components=2) 
reduced_data = pca.fit_transform(vectorized_documents.toarray()) 
  
num_clusters = 7
kmeans = KMeans(n_clusters=num_clusters, n_init=5, 
                max_iter=500, random_state=42) 
kmeans.fit(vectorized_documents) 
  

# create a dataframe to store the results 
results = pd.DataFrame() 
results['document'] = dataset 
results['cluster'] = kmeans.labels_ 


# plot the results 
colors = ['black', 'red', 'green', 'yellow', 'blue', 'orange', 'purple'] 
cluster = ['0', '1','2', '3', '4', '5', '6'] 
for i in range(num_clusters): 
    plt.scatter(reduced_data[kmeans.labels_ == i, 0], 
                reduced_data[kmeans.labels_ == i, 1],  
                s=10, color=colors[i],  
                label=f' {cluster[i]}') 
plt.legend() 
plt.show()

matplotlib plot cluster-analysis k-means tf-idf
1个回答
0
投票

所以只需提取作者姓名

将作者映射到颜色,然后修改绘图代码,如下所示

import os
import glob
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Assuming file_list and dataset are already defined as in your code.
# Extract author names from filenames
# Modify this according to your actual filename format
authors = [os.path.basename(file).split('_')[0] for file in file_list]

vectorizer = TfidfVectorizer(stop_words='english')
vectorized_documents = vectorizer.fit_transform(dataset)

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(vectorized_documents.toarray())

num_clusters = 7
kmeans = KMeans(n_clusters=num_clusters, n_init=5, max_iter=500, random_state=42)
kmeans.fit(vectorized_documents)

# Create a color map for authors
unique_authors = set(authors)
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_authors)))
color_map = {author: color for author, color in zip(unique_authors, colors)}

# Plot the PCA results, colored by author
for author in unique_authors:
    idx = [i for i, a in enumerate(authors) if a == author]
    plt.scatter(reduced_data[idx, 0], reduced_data[idx, 1], color=color_map[author], label=author)

plt.legend()
plt.show()
© www.soinside.com 2019 - 2024. All rights reserved.