我有一个包含 x 和 y 变量的数据集。还有一个 z 列,指定 x 和 y 属于哪个组。有11组。我使用 K 均值聚类来创建一台机器,它将 x 和 y 变量分类到正确的组中。然后,我将这些 x 和 y 变量绘制到散点图上,并且 K 均值将它们分类为 11 种独特颜色之一。我现在想将其与真实情况(在本例中是原始 x 和 y 变量)进行比较。我希望它在第三个散点图上表示,该散点图将以红色突出显示 K 均值生成的数据点,这些数据点与真实情况不相符,而与真实情况相符的数据点则以绿色突出显示。
我该如何编码?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
df = pd.read_csv("FILENAME")
print(df)
x = df['height_mean']
y = df['weight_mean']
points = df[['height_mean', 'weight_mean']].values
#np.array([Values here])
# Number of clusters
n_clusters = 11
# Fit the KMeans model
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(points)
# Get cluster assignments
labels = kmeans.labels_
# Get cluster centers
centers = kmeans.cluster_centers_
# Plot the clusters
plt.scatter(x, y, c=labels, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=100) # Plot cluster centers as red X's
plt.xlabel('height')
plt.ylabel('weight')
plt.title("K-Means Clustering")
plt.show()
print(df)`
这是 k 均值散点图。我如何修改它以将其与地面实况散点图进行比较?
要将 K 均值聚类分配与基本事实(在“z”列中指定)进行比较,您可以创建一个新的散点图,其中根据 K 均值标签是否与基本事实匹配对点进行着色。这是基于您提供的代码的代码示例:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
# Read data
df = pd.read_csv("FILENAME")
print(df)
# Extract data
x = df['height_mean']
y = df['weight_mean']
z = df['group_column_name'] # Replace 'group_column_name' with the actual name of the column that specifies the group
points = df[['height_mean', 'weight_mean']].values
# Number of clusters
n_clusters = 11
# Fit the KMeans model
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(points)
# Get cluster assignments
labels = kmeans.labels_
# Create an array to store colors for each point
colors = np.zeros(labels.shape, dtype=str)
# Compare with ground truth
for i in range(len(labels)):
if labels[i] == z[i]:
colors[i] = 'g' # Green for a match
else:
colors[i] = 'r' # Red for a mismatch
# Plot the clusters
plt.scatter(x, y, c=labels, cmap='viridis', alpha=0.5)
plt.scatter(x, y, c=colors, alpha=0.5) # Overplot points with the color array to show matches/mismatches
plt.xlabel('height')
plt.ylabel('weight')
plt.title("K-Means Clustering vs Ground Truth")
plt.show()
z
数组代表每个点的基本事实。colors
数组,用于存储每个点的颜色(“g”表示绿色,“r”表示红色)。colors
数组。如果 K 均值标签与真实值匹配,则点将被着色为绿色,否则为红色。colors
数组中指定的颜色绘制点。