titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')
**
#Pipeline**
titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy='median')),
("std_scaler", StandardScaler()),
])
cat_pipeline = Pipeline([
("enc", OneHotEncoder(drop='if_binary'))
])
def full_pipeline(num_attribs, cat_attribs):
return ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)
**
#Here, I'm preparing the test data via the same pipeline**
titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)
标题上给出错误的代码:
final_model.predict(titanic_test_clean)
打印可能提示问题的有用信息:
titanic_clean[0] -> array([-0.56573646, -0.50244517, 1. , 0. , 0. ,
1. , 0. ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333, 1. , 0. , 1. ,
0. ]) # 6 items
从上面的信息来看,我认为问题在于 onecodeencoder 的数量不匹配。我怀疑训练集和测试集的分类值数量不同。但他们确实是。
数据集的链接 -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv
您看到的错误确实是由
OneHotEncoder
引起的。
但是,我想指出更关键的一点:将管道放入函数中并不是一个好的做法。 通常我们将管道分配给一个变量,然后对其调用
fit
和 fit_transform
:
# Define the pipelines for numerical and categorical attributes
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy='median')),
("std_scaler", StandardScaler()),
])
cat_pipeline = Pipeline([
("enc", OneHotEncoder(drop='if_binary'))
])
# Combine pipelines in a ColumnTransformer
full_pipeline = ColumnTransformer([
("num", num_pipeline, titanic_num),
("cat", cat_pipeline, titanic_cat)
])
# Fit and transform the training data
titanic_clean = full_pipeline.fit_transform(titanic)
# Transform the test data using the same transformations
titanic_test_clean = full_pipeline.transform(titanic_test)
# Model fitting and prediction
final_model.fit(titanic_clean, titanic_train_labels)
predictions = final_model.predict(titanic_test_clean)
这种方法确保将相同的转换应用于两个数据集,从而保持一致的特征集。
OneHotEncoder
内的ColumnTransformer
将从训练数据中学习类别,并将相同的编码应用于测试数据,解决特征不匹配问题。