我有一个包含字符串和浮点数据类型的数据集,我想用该数据集训练我的 KNN 模型,但它给出一个 ValueError 说“无法将字符串转换为浮点”
inputs=data.drop(['HeartDisease'],'columns')
output=data.drop(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope'],'columns')
import sklearn
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(inputs,output,train_size=0.8)
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=31)
model.fit(x_train,y_train)
我还附上了数据集的图像..
我期望模型能够使用特定的数据集进行训练
在每个机器学习模型中,您不能按原样使用数据字符串。您必须预处理输入以将其转换为数字类型。除了自然语言处理之外,您可能有一些不同的文本值(分类特征)。
以
'ChestPainType'
列为例,您应该只有 4 个值:['ATA', 'NAP', 'ASY', 'TA']
。现在您必须将此字符串转换为数字: 'ATA': 0, 'NAP': 1, 'ASY': 2, 'TA': 3。在 Pandas 中,您可以使用 pd.factorize
或 pd.get_dummies
这样做,但如果您使用 sklearn
,请尝试 LabelEncoder
(特别是需要时使用 y
目标)或 OneHotEncoder
(有时 OrdinalEncoder
)。
ColumnTransformer
。
可重现的示例:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
# https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
data = pd.read_csv('heart.csv')
features = data.drop(columns=['HeartDisease'])
target = df['HeartDisease']
# Text features to convert as numeric. 'M': [1, 0], 'F': [0, 1]
feat_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
ct = ColumnTransformer(
transformers=[('le', OrdinalEncoder(), feat_cols)],
remainder='passthrough'
)
# Convert your data as numeric values
X = ct.fit_transform(features)
y = np.stack(target.values)
# Create 2 datasets for train and test
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8)
# Missing step, use `StandardScaler` to normalize numeric values
# Train your model
model = KNeighborsClassifier(n_neighbors=31)
model.fit(X_train, y_train)
# Evaluate your model (63% here)
model.score(X_test, y_test)