如何使 KMeans 聚类对泰坦尼克号数据更有意义?

问题描述 投票:0回答:1

我正在运行这段代码。

import pandas as pd
titanic = pd.read_csv('titanic.csv')
titanic.head()


#Import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

from sklearn.cluster import KMeans

# initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# fit the model
kmeans.fit(X)
# store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
titanic.tail()

Finally...

from sklearn.decomposition import PCA

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]

# assign clusters and pca vectors to our dataframe 
titanic['cluster'] = clusters
titanic['x0'] = x0
titanic['x1'] = x1

titanic.head()

import plotly.express as px

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', text='Name')
fig.show()

这是我看到的情节。

我想它在工作......但我的问题是......我们如何使文本更加分散和/或删除异常值以使图表更有意义?我猜聚类是正确的,因为我在这里没有做任何特别的事情,但是有什么方法可以使聚类更重要或更有意义吗?

数据来自这里。

https://www.kaggle.com/competitions/titanic/data?select=test.csv

python python-3.x cluster-analysis k-means
1个回答
0
投票

您可以让名称信息仅在鼠标悬停在某个数据点上时显示。目前,您正在尝试在数据点旁边绘制每位乘客的姓名。由于有很多数据点彼此靠近,直接在图上包括姓名会导致每位乘客的姓名被放在彼此之上。您可以通过将绘图代码更改为类似以下内容来解决此问题:

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

基本上,我们在上面的代码中唯一改变的是我们使用哪个参数来包含“名称”信息。这是此更改后的样子:

现在,仅当您将鼠标悬停在数据点上时才会显示名称。

完整代码

考虑到上述更改,这是您的完整代码:

# Import required module
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Where our data is located in our machine
train_data_filepath = '/Users/erikingwersen/Downloads/train.csv'
test_data_filepath = '/Users/erikingwersen/Downloads/test.csv'

# Read the train data from downloaded file
titanic = pd.read_csv(train_data_filepath)

documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)

# Fit the model
kmeans.fit(X)

# Store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)

# Pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# Save our two dimensions into x0 and x1
x0, x1 = pca_vecs[:, 0], pca_vecs[:, 1]

# Assign clusters and pca vectors to our dataframe 
titanic[['cluster', 'x0', 'x1']] = [
    [x, y, z] for x, y, z in zip(clusters, x0, x1)
]


titanic.head()

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

© www.soinside.com 2019 - 2024. All rights reserved.