我正在运行这段代码。
import pandas as pd
titanic = pd.read_csv('titanic.csv')
titanic.head()
#Import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = titanic['Name']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
from sklearn.cluster import KMeans
# initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# fit the model
kmeans.fit(X)
# store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
titanic.tail()
Finally...
from sklearn.decomposition import PCA
documents = titanic['Name']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())
# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]
# assign clusters and pca vectors to our dataframe
titanic['cluster'] = clusters
titanic['x0'] = x0
titanic['x1'] = x1
titanic.head()
import plotly.express as px
fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', text='Name')
fig.show()
这是我看到的情节。
我想它在工作......但我的问题是......我们如何使文本更加分散和/或删除异常值以使图表更有意义?我猜聚类是正确的,因为我在这里没有做任何特别的事情,但是有什么方法可以使聚类更重要或更有意义吗?
数据来自这里。
https://www.kaggle.com/competitions/titanic/data?select=test.csv
您可以让名称信息仅在鼠标悬停在某个数据点上时显示。目前,您正在尝试在数据点旁边绘制每位乘客的姓名。由于有很多数据点彼此靠近,直接在图上包括姓名会导致每位乘客的姓名被放在彼此之上。您可以通过将绘图代码更改为类似以下内容来解决此问题:
fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
title_font_size=30)
fig.show()
基本上,我们在上面的代码中唯一改变的是我们使用哪个参数来包含“名称”信息。这是此更改后的样子:
现在,仅当您将鼠标悬停在数据点上时才会显示名称。
考虑到上述更改,这是您的完整代码:
# Import required module
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
# Where our data is located in our machine
train_data_filepath = '/Users/erikingwersen/Downloads/train.csv'
test_data_filepath = '/Users/erikingwersen/Downloads/test.csv'
# Read the train data from downloaded file
titanic = pd.read_csv(train_data_filepath)
documents = titanic['Name']
X = TfidfVectorizer(stop_words='english').fit_transform(documents)
# Initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# Fit the model
kmeans.fit(X)
# Store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
documents = titanic['Name']
X = TfidfVectorizer(stop_words='english').fit_transform(documents)
# Initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# Pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())
# Save our two dimensions into x0 and x1
x0, x1 = pca_vecs[:, 0], pca_vecs[:, 1]
# Assign clusters and pca vectors to our dataframe
titanic[['cluster', 'x0', 'x1']] = [
[x, y, z] for x, y, z in zip(clusters, x0, x1)
]
titanic.head()
fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
title_font_size=30)
fig.show()