我正在尝试编写一个使用
RandomForestClassifier
预测乳腺癌的机器学习模型。代码如下:
from sklearn.model_selection import train_test_split
print("Shape of training set:", x_train.shape)
print("Shape of test set:", x_test.shape)
训练集的形状:(292, 30)
测试集的形状为:(91, 29)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(x_train)
X_test = ss.fit_transform(x_test)
RandomForestClassifier
的实例化:
from sklearn.ensemble import RandomForestClassifier
rand_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 11, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)
rand_clf.fit(X_train, y_train)
我被困在这里:
y_pred = rand_clf.predict(X_test)
显示的错误是:
ValueError: X has 29 features, but RandomForestClassifier is expecting 30 features as input
我该如何解决这个问题?否则,
x_train
和 x_test
列不相等。
问题在这里:
训练集的形状:(292, 30)
测试集的形状为:(91, 29)
训练集和测试集需要具有相同数量的特征,要么是 29 要么是 30(对于两者)