我已经获得了在不同运行条件下运行的风力涡轮机的测量数据(振动)。我的数据集由我从测量数据中提取的操作条件和测量特征组成。
数据集形状:(423, 15)
。 423个数据点中的每一个代表一天中的一次测量,按时间顺序超过423天。
我现在想对数据进行聚类,以查看测量值是否有任何变化。具体来说,我想检查一下振动是否随时间变化(这可能表明涡轮变速箱出现故障)。
我目前正在做什么:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
当然,情况可能是这些数据点上的数据没有变化。但是,我还能尝试其他什么方法?有点悬而未决的问题,但我的想法已经用完了。
首先,对于KMeans,如果数据集不是自然分区的,您可能会得到一些非常奇怪的结果!由于KMeans是不受监督的,因此您基本上可以转储各种数值变量,设置目标变量,然后让机器为您完成任务。这是使用规范虹膜数据集的简单示例。您可以轻松地修改它以适合您的特定数据集。只需更改“ X”变量(目标变量除外)和“ y”变量(仅一个目标变量)即可。试试看并提供反馈。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();