人们告诉我,将我的数据分为测试和培训集时,存在数据泄漏。我的目标是使用原始数据(标准化后)和主成分分析模型的对数回归来绘制接收器工作特性曲线。
但是,对于每个图,接收器工作特性曲线在接收器工作特性曲线得分下仅给了我1.00面积。这是分配我的功能和标签的代码。:
# Import wine dataset. df_1 = pd.read_csv ('./wine.data', delimiter =',', header = None, nrows=200) # Feature Selection (dropping strongly correlated features): df_1 = df_1.drop(df_1.columns[7], axis='columns') df_1 # Separating out the features, columns 1 to 13. x = df_1.iloc[:, 1:13].values # Separating out the target, column 0, with classes 1, 2, and 3. y = df_1.iloc[:, 0].values # Standardizing the features (centering and scaling). x = StandardScaler().fit_transform(x) # Dataset is split into training set (70%) and testing set (30%). X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=123) #PCA is performed for 2 components pca_2 = PCA(n_components=2) X_train_pca_2 = pca_2.fit_transform(X_train) X_test_pca_2 = pca_2.fit_transform(X_test)
这是我的其余代码,以防数据泄漏实际上不是问题:
# Create logistic regression function for one vs all classifier.
logreg = LogisticRegression(random_state=0, multi_class='ovr')
# Train the model using the original pre-processed dataset.
model = logreg.fit(X_train, y_train)
# Predict the class of each wine in the testing data.
y_pred = model.predict(X_test)
y_pred
array([3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 3, 3, 1, 1, 3, 2, 2, 1, 2, 3, 3,
3, 3, 2, 3, 3, 2, 1, 1, 1, 1, 2, 2, 3, 2, 3, 1, 2, 2, 3, 3, 1, 1,
2, 1, 1, 2, 1, 2, 2, 3, 3, 2], dtype=int64)
# Train a second model using the PCA.
model_PCA = logreg.fit(X_train_pca_2, y_train)
# Predict the class of each wine in the testing data.
y_pred_PCA = model_PCA.predict(X_test_pca_2)
y_pred_PCA
array([3, 2, 2, 2, 2, 3, 1, 2, 3, 2, 3, 3, 3, 1, 1, 3, 2, 2, 1, 2, 2, 3,
3, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 1,
2, 1, 1, 2, 1, 1, 2, 2, 3, 2], dtype=int64)
# Binarize the labels
y_train_binary = label_binarize(y_train, classes=[1, 2, 3])
y_test_binary = label_binarize(y_test, classes=[1, 2, 3])
n_classes = y_train_binary.shape[1]
# Score for One vs Rest Classifier
y_score = model.fit(X_train, y_train).decision_function(X_test)
# Score for PCA model
y_score_PCA = model_PCA.fit(X_train, y_train).decision_function(X_test)
fpr1 = dict()
tpr1 = dict()
roc_auc1 = dict()
for i in range(n_classes):
fpr1[i], tpr1[i], _ = roc_curve(y_test_binary[:, i], y_score[:, i])
roc_auc1[i] = auc(fpr1[i], tpr1[i])
fpr2 = dict()
tpr2 = dict()
roc_auc2 = dict()
for j in range(n_classes):
fpr2[j], tpr2[j], _ = roc_curve(y_test_binary[:, j], y_score_PCA[:, j])
roc_auc2[j] = auc(fpr2[j], tpr2[j])
# Plotting ROC Curve of all 3 classes using logistic regression on the orginal data set and the PCA model
plt.figure()
lw = 1
plt.plot(fpr1[0], tpr1[0], color='purple',
lw=lw, label='LogReg Class 1 (area = %0.2f)' % roc_auc1[0])
plt.plot(fpr1[1], tpr1[1], color='blue',
lw=lw, label='LogReg Class 2 (area = %0.2f)' % roc_auc1[1])
plt.plot(fpr1[2], tpr1[2], color='aqua',
lw=lw, label='LogReg Class 3 (area = %0.2f)' % roc_auc1[2])
plt.plot(fpr2[0], tpr2[0], color='orange',
lw=lw, label='PCA Class 1 (area = %0.2f)' % roc_auc2[0])
plt.plot(fpr2[1], tpr2[1], color='green',
lw=lw, label='PCA Class 2 (area = %0.2f)' % roc_auc2[1])
plt.plot(fpr2[2], tpr2[2], color='red',
lw=lw, label='PCA Class 3 (area = %0.2f)' % roc_auc2[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
人们告诉我,将我的数据分为测试和培训集时,存在数据泄漏。我的目标是使用原始... [
您可能实际上获得了完美的结果。我没有处理数据集,所以我不知道使用Logistic回归函数的可能性如何。但是,您的代码有1个主要缺陷,即您在整个功能空间上训练的模型上覆盖了pca模型。参见: