DBSCAN的异常检测

问题描述 投票:0回答:1

我正在使用DBSCAN对我的训练数据集进行训练,以便在训练模型之前从数据集中找到离群值并去除这些离群值。我在我的训练行7697上使用DBSCAN,有8列。

from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)

X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)

Q-1 在这7个数据中,有些是离散的,有些是连续的,是否可以把离散和连续的数据都放大,或者只放大连续的数据?Q-2 我是否需要将集群映射到测试数据上,因为它是从训练中学到的?

machine-learning deep-learning cluster-analysis outliers hdbscan
1个回答
0
投票

DBSCAN将为你处理这些离群值。 这就是建立的目的。 请看下面的例子,如果你有其他问题,请回帖。

import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
  bins = 50,
  title = "Histogram of the age variable"
)

from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
  lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]

ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")

from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
  eps = 0.5,
  metric="euclidean",
  min_samples = 3,
  n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters

from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
  x = "age",
  y = "fare",
  c = clusters,
  cmap = cmap,
  colorbar = False
)

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.