聚类数据-不良结果,特征提取

问题描述 投票:0回答:1

我已经获得了在不同运行条件下运行的风力涡轮机的测量数据(振动)。我的数据集由我从测量数据中提取的操作条件测量特征组成。

数据集形状:(423, 15)。 423个数据点中的每一个代表一天中的一次测量,按时间顺序超过423天。

enter image description here

我现在想对数据进行聚类,以查看测量值是否有任何变化。具体来说,我想检查一下振动是否随时间变化(这可能表明涡轮变速箱出现故障)。

我目前正在做什么:

  1. 在0,1之间缩放数据->
  2. 执行PCA(从15减少到5)
  3. 因为我不知道群集数,所以使用db scan群集。我正在使用此代码在dbscan中找到最佳的epsilon(eps):
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
  1. 到目前为止的结果并未清楚地表明数据随时间变化:

enter image description here

当然,情况可能是这些数据点上的数据没有变化。但是,我还能尝试其他什么方法?有点悬而未决的问题,但我的想法已经用完了。

python cluster-analysis signal-processing unsupervised-learning
1个回答
0
投票

首先,对于KMeans,如果数据集不是自然分区的,您可能会得到一些非常奇怪的结果!由于KMeans是不受监督的,因此您基本上可以转储各种数值变量,设置目标变量,然后让机器为您完成任务。这是使用规范虹膜数据集的简单示例。您可以轻松地修改它以适合您的特定数据集。只需更改“ X”变量(目标变量除外)和“ y”变量(仅一个目标变量)即可。试试看并提供反馈。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")


from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4]  # we only take the first two features.
y = iris.target



from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)

X_scaled.sample(5)


# try clustering on the 4d data and see if can reproduce the actual clusters.

# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.

# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.

from sklearn.cluster import KMeans

nclusters = 3 # this is the k in kmeans
seed = 0

km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)

# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans


# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
   .map(plt.scatter, "sepal_length", "sepal_width") \
   .add_legend();

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.