获取“ValueError:X 有 6 个特征,但 LinearRegression 期望 7 个特征作为输入。”可能是由于列转换(管道)步骤

问题描述 投票:0回答:1
titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')

**
#Pipeline**
titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy='median')),
        ("std_scaler", StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ("enc", OneHotEncoder(drop='if_binary'))
    ])

def full_pipeline(num_attribs, cat_attribs):
    return ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs)
    ])

titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)
**
#Here, I'm preparing the test data via the same pipeline**
titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)

标题上给出错误的代码:

final_model.predict(titanic_test_clean)

打印可能提示问题的有用信息:

titanic_clean[0] -> array([-0.56573646, -0.50244517,  1.        ,  0.        ,  0.        ,
        1.        ,  0.        ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333,  1.        ,  0.        ,  1.        ,
        0.        ]) # 6 items

从上面的信息来看,我认为问题在于 onecodeencoder 的数量不匹配。我怀疑训练集和测试集的分类值数量不同。但他们确实是。

数据集的链接 -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv

machine-learning scikit-learn pipeline encoder
1个回答
0
投票

您看到的错误确实是由

OneHotEncoder
引起的。

但是,我想指出更关键的一点:将管道放入函数中并不是一个好的做法。 通常我们将管道分配给一个变量,然后对其调用

fit
fit_transform

# Define the pipelines for numerical and categorical attributes
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='median')),
    ("std_scaler", StandardScaler()),
])

cat_pipeline = Pipeline([
    ("enc", OneHotEncoder(drop='if_binary'))
])

# Combine pipelines in a ColumnTransformer
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, titanic_num),
    ("cat", cat_pipeline, titanic_cat)
])

# Fit and transform the training data
titanic_clean = full_pipeline.fit_transform(titanic)

# Transform the test data using the same transformations
titanic_test_clean = full_pipeline.transform(titanic_test)

# Model fitting and prediction
final_model.fit(titanic_clean, titanic_train_labels)
predictions = final_model.predict(titanic_test_clean)

这种方法确保将相同的转换应用于两个数据集,从而保持一致的特征集。

OneHotEncoder
内的
ColumnTransformer
将从训练数据中学习类别,并将相同的编码应用于测试数据,解决特征不匹配问题。

© www.soinside.com 2019 - 2024. All rights reserved.