聚类数据-不良结果，特征提取

Question

我已经获得了在不同运行条件下运行的风力涡轮机的测量数据（振动）。我的数据集由我从测量数据中提取的操作条件和测量特征组成。

数据集形状：(423, 15)。 423个数据点中的每一个代表一天中的一次测量，按时间顺序超过423天。

我现在想对数据进行聚类，以查看测量值是否有任何变化。具体来说，我想检查一下振动是否随时间变化（这可能表明涡轮变速箱出现故障）。

我目前正在做什么：

在0,1之间缩放数据->
执行PCA（从15减少到5）
因为我不知道群集数，所以使用db scan群集。我正在使用此代码在dbscan中找到最佳的epsilon（eps）：

# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)

到目前为止的结果并未清楚地表明数据随时间变化：

当然，情况可能是这些数据点上的数据没有变化。但是，我还能尝试其他什么方法？有点悬而未决的问题，但我的想法已经用完了。

Answer 1

首先，对于KMeans，如果数据集不是自然分区的，您可能会得到一些非常奇怪的结果！由于KMeans是不受监督的，因此您基本上可以转储各种数值变量，设置目标变量，然后让机器为您完成任务。这是使用规范虹膜数据集的简单示例。您可以轻松地修改它以适合您的特定数据集。只需更改“ X”变量（目标变量除外）和“ y”变量（仅一个目标变量）即可。试试看并提供反馈。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")


from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4]  # we only take the first two features.
y = iris.target



from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)

X_scaled.sample(5)


# try clustering on the 4d data and see if can reproduce the actual clusters.

# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.

# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.

from sklearn.cluster import KMeans

nclusters = 3 # this is the k in kmeans
seed = 0

km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)

# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans


# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
   .map(plt.scatter, "sepal_length", "sepal_width") \
   .add_legend();

聚类数据-不良结果，特征提取

问题描述投票：0回答：1

1个回答

最新问题

聚类数据-不良结果，特征提取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1