我的代码中是否存在数据泄漏(ROC曲线给出1.00 AUC分数)?

问题描述 投票:1回答:1

人们告诉我,将我的数据分为测试和培训集时,存在数据泄漏。我的目标是使用原始数据(标准化后)和主成分分析模型的对数回归来绘制接收器工作特性曲线。

但是,对于每个图,接收器工作特性曲线在接收器工作特性曲线得分下仅给了我1.00面积。这是分配我的功能和标签的代码。:

# Import wine dataset.
df_1 = pd.read_csv ('./wine.data', delimiter =',', header = None, nrows=200)

# Feature Selection (dropping strongly correlated features):
df_1 = df_1.drop(df_1.columns[7], axis='columns')
df_1

# Separating out the features, columns 1 to 13.
x = df_1.iloc[:, 1:13].values 

# Separating out the target, column 0, with classes 1, 2, and 3.
y = df_1.iloc[:, 0].values 

# Standardizing the features (centering and scaling).
x = StandardScaler().fit_transform(x)

# Dataset is split into training set (70%) and testing set (30%).
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=123)

#PCA is performed for 2 components
pca_2 = PCA(n_components=2)
X_train_pca_2 = pca_2.fit_transform(X_train)
X_test_pca_2 = pca_2.fit_transform(X_test)

这是我的其余代码,以防数据泄漏实际上不是问题:

# Create logistic regression function for one vs all classifier.
logreg = LogisticRegression(random_state=0, multi_class='ovr')

# Train the model using the original pre-processed dataset.
model = logreg.fit(X_train, y_train)

# Predict the class of each wine in the testing data.
y_pred = model.predict(X_test)
y_pred

array([3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 3, 3, 1, 1, 3, 2, 2, 1, 2, 3, 3,
       3, 3, 2, 3, 3, 2, 1, 1, 1, 1, 2, 2, 3, 2, 3, 1, 2, 2, 3, 3, 1, 1,
       2, 1, 1, 2, 1, 2, 2, 3, 3, 2], dtype=int64)

# Train a second model using the PCA.
model_PCA = logreg.fit(X_train_pca_2, y_train)

# Predict the class of each wine in the testing data.
y_pred_PCA = model_PCA.predict(X_test_pca_2)
y_pred_PCA

array([3, 2, 2, 2, 2, 3, 1, 2, 3, 2, 3, 3, 3, 1, 1, 3, 2, 2, 1, 2, 2, 3,
       3, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 1,
       2, 1, 1, 2, 1, 1, 2, 2, 3, 2], dtype=int64)

# Binarize the labels
y_train_binary = label_binarize(y_train, classes=[1, 2, 3])
y_test_binary = label_binarize(y_test, classes=[1, 2, 3])
n_classes = y_train_binary.shape[1]

# Score for One vs Rest Classifier
y_score = model.fit(X_train, y_train).decision_function(X_test)

# Score for PCA model
y_score_PCA = model_PCA.fit(X_train, y_train).decision_function(X_test)

fpr1 = dict()
tpr1 = dict()
roc_auc1 = dict()
for i in range(n_classes):
    fpr1[i], tpr1[i], _ = roc_curve(y_test_binary[:, i], y_score[:, i])
    roc_auc1[i] = auc(fpr1[i], tpr1[i])

fpr2 = dict()
tpr2 = dict()
roc_auc2 = dict()
for j in range(n_classes):
    fpr2[j], tpr2[j], _ = roc_curve(y_test_binary[:, j], y_score_PCA[:, j])
    roc_auc2[j] = auc(fpr2[j], tpr2[j])

# Plotting ROC Curve of all 3 classes using logistic regression on the orginal data set and the PCA model
plt.figure()
lw = 1
plt.plot(fpr1[0], tpr1[0], color='purple',
         lw=lw, label='LogReg Class 1 (area = %0.2f)' % roc_auc1[0])
plt.plot(fpr1[1], tpr1[1], color='blue',
         lw=lw, label='LogReg Class 2 (area = %0.2f)' % roc_auc1[1])
plt.plot(fpr1[2], tpr1[2], color='aqua',
         lw=lw, label='LogReg Class 3 (area = %0.2f)' % roc_auc1[2])
plt.plot(fpr2[0], tpr2[0], color='orange',
         lw=lw, label='PCA Class 1 (area = %0.2f)' % roc_auc2[0])
plt.plot(fpr2[1], tpr2[1], color='green',
         lw=lw, label='PCA Class 2 (area = %0.2f)' % roc_auc2[1])
plt.plot(fpr2[2], tpr2[2], color='red',
         lw=lw, label='PCA Class 3 (area = %0.2f)' % roc_auc2[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

人们告诉我,将我的数据分为测试和培训集时,存在数据泄漏。我的目标是使用原始... [

python logistic-regression pca roc
1个回答
0
投票

您可能实际上获得了完美的结果。我没有处理数据集,所以我不知道使用Logistic回归函数的可能性如何。但是,您的代码有1个主要缺陷,即您在整个功能空间上训练的模型上覆盖了pca模型。参见:

© www.soinside.com 2019 - 2024. All rights reserved.