使用 KNN 找到准确度的最佳 K 值

问题描述 投票:0回答:1

我是 KNN 的菜鸟,如果我们最关心 4 倍的平均准确率,我会尝试找到 k 的最佳值。我知道我的最佳值是 12,但我一直得到 7 的输出。有人可以帮忙吗?尽管我的代码可以运行,但它不会产生预期的输出。顺便说一句,使用 Jupyter 笔记本。也许我误解了算法。

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer

# Load data
institutions_df = pd.read_csv('institutions.csv')
banklist_df = pd.read_csv('banklist.csv', encoding='cp1252')

# Merge the two dataframes based on the 'cert' key
merged_df = pd.merge(institutions_df, banklist_df, on='cert', how='left')

# Create 'failure' column indicating whether the bank has failed or not
merged_df["failure"] = merged_df["closing"].isnull().astype(int)

# Extract the relevant features and normalize them using min-max normalization
features = ['ASSET', 'DEP', 'DEPDOM', 'NETINC', 'OFFDOM', 'ROA', 'ROAPTX', 'ROE']
scaler = MinMaxScaler()
merged_df[features] = scaler.fit_transform(merged_df[features])

# Handle missing values using SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0)
merged_df[features] = imputer.fit_transform(merged_df[features])

# Define the target variable (y) and features (X)
y = merged_df['failure']
X = merged_df[features]

# Define the range of k values to test
k_values = list(range(4, 19))

# Initialize variables to store results
accuracy_results = []

# Iterate over different values of k
for k in k_values:
    # Create a KNN classifier with the current value of k
    knn = KNeighborsClassifier(n_neighbors=k)

    # Use KFold cross-validation to evaluate the classifier
    kfold = KFold(n_splits=4, shuffle=True, random_state=0)
    scores = cross_val_score(estimator=knn, X=X, y=y, cv=kfold)

    # Store mean accuracy for each k
    accuracy_results.append(scores.mean())

# Find the optimal k based on accuracy
optimal_k_accuracy = k_values[accuracy_results.index(max(accuracy_results))]

# Print the results
print(f"Optimal k for accuracy: {optimal_k_accuracy}")
python jupyter-notebook classification knn
1个回答
0
投票

首先,您是否考虑过这样一个事实:您可能有多个 k 最佳值?

否则,你的代码对我来说似乎是正确的,虽然我不熟悉 panda,但我认为至少从你定义

k_values
的行到最后的代码是正确的。

一个好主意总是打印所有重要的对象(这里可能是

accuracy_results
x
Y
等)来澄清你的代码是否真的做了你想要它做的事情......

© www.soinside.com 2019 - 2024. All rights reserved.