我正在尝试对数据集进行无监督学习来进行特征提取,并找出哪组数据聚集在一起以及该组数据的主要特征(质心)是什么。所以,我打算用Kmeans来求每个质心的权重。但在使用 Kmeans 之前,我使用 TSNE 来降低数据的维度,以便可以以散点图的形式呈现。我的目标是获得具有最差条件数据点和最差条件数据点的质心。这是我的代码示例。
# Set a seed for reproducibility
np.random.seed(42)
# Generate dummy data with random values
num_rows = 1000
# Create a DataFrame with random values and specific column names
dummy_data = pd.DataFrame({
'Name': [np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack', 'Kate', 'Liam', 'Mia', 'Noah', 'Olivia', 'Peter', 'Quinn', 'Rachel', 'Sam', 'Taylor']
) for _ in range(num_rows)],
'Condition': np.random.choice(['Good', 'Bad'], size=num_rows),
'Latency_Wifi': np.random.normal(loc=1, scale=0.2, size=num_rows), # 'Good' condition has lower latency
'Loss_Wifi': np.random.normal(loc=0.05, scale=0.02, size=num_rows), # 'Good' condition has lower loss
'Latency_Gaming': np.random.normal(loc=1, scale=0.2, size=num_rows),
'Loss_Gaming': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
'Latency_Video': np.random.normal(loc=1, scale=0.2, size=num_rows),
'Loss_Video': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
'Latency_WFH': np.random.normal(loc=1, scale=0.2, size=num_rows),
'Loss_WFH': np.random.normal(loc=0.05, scale=0.02, size=num_rows),
})
features = dummy_data.drop(['Name',
'Condition'], axis=1)
# Standardize the data to have zero mean and unit variance
scaler = StandardScaler()
data_scaled = scaler.fit_transform(features)
# kpca = KernelPCA(n_components=10, kernel='rbf', gamma=0.1)
# data_kpca = kpca.fit_transform(data_scaled)
# Apply t-SNE for further dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
data_tsne = tsne.fit_transform(data_scaled)
df = dummy_data
features = dummy_data.drop(['Name',
'Condition'], axis=1)
columns_of_interest = features.columns.to_list()
# Apply K-means on the t-SNE components
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(data_tsne)
# Add t-SNE components and cluster labels to the original DataFrame
df['TSNE_Component_1'] = data_tsne[:, 0]
df['TSNE_Component_2'] = data_tsne[:, 1]
df['Cluster'] = labels
# Get the centroid coordinates
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=columns_of_interest)
# Display the main features for each centroid
for cluster_num in range(n_clusters):
centroid_features = centroids.iloc[cluster_num]
main_features = centroid_features.abs().sort_values(ascending=False).head(3) # Display top 3 features
print(f"Cluster {cluster_num + 1}: Main Features - {main_features.index.tolist()}")
# Count the number of users in each cluster
cluster_counts = df['Cluster'].value_counts().reset_index()
cluster_counts.columns = ['Cluster', 'Number_of_Users']
# Select the top 10 clusters based on the highest number of users
top_clusters = cluster_counts.nlargest(10, 'Number_of_Users')['Cluster'].tolist()
# Filter the DataFrame for the top clusters
df_top_clusters = df[df['Cluster'].isin(top_clusters)]
但是我在运行上面的代码时遇到了错误:
centroids = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=columns_of_interest)**
**ValueError: operands could not be broadcast together with shapes (10,2) (8,) (10,2)
我的一个朋友建议我使用另一个函数将数据的维度从非线性降低为线性。但我以为这就是使用 TSNE 的目的?