ValueError:无法将字符串转换为浮点数:'Curtis RIngraham Directge'

问题描述 投票:0回答:1

我正在进行数据分割和交叉验证。 对于数据分割,我需要仅提取测试数据集,并保留其余数据以进行交叉验证。我在交叉验证结束时收到错误 ValueError: Could not conversion string to float: 'Curtis RIngraham Directge' 。我该如何解决?

数据分割

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = features_df.to_numpy()
labels = labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))

交叉验证

best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():

    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]

    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []

    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)

        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)

        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train)

        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)

        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)

    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k

    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")

    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k

    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")

    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

尝试查找任何剩余的 x_train、y_train、x_valid 和 y_valid 字符串值,但找不到任何值。

machine-learning classification cross-validation k-fold
1个回答
0
投票

这可能是因为您的数据集中有一些列包含分类数据。首先可以使用方法1或方法2将它们转换为数字: 方法一:

#将类别转化为数字

从 sklearn.preprocessing 导入 OneHotEncoder

从 sklearn.compose 导入 ColumnTransformer

categories = ["col1", "col2","col3","col4"]---具有分类值的列

one_hot = OneHotEncoder() 变压器 = ColumnTransformer([("one_hot", 一个热, 类别)], 剩余=“直通”)

transformed_X = Transformer.fit_transform(X)

变形_X

方法2:可以将特定列的Values转换为int

导入警告

warnings.filterwarnings('忽略')

df['col1']=pd.get_dummies(df['col1'], drop_first=True)

© www.soinside.com 2019 - 2024. All rights reserved.