ValueError任何类的最小组数不能小于2

问题描述 投票:0回答:1
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

这是我从下面的代码得到的错误

# List of machine learning algorithms that will be used for predictions
estimator = [('Logistic Regression', LogisticRegression), ('Ridge Classifier', RidgeClassifier), 
             ('SGD Classifier', SGDClassifier), ('Passive Aggressive Classifier', PassiveAggressiveClassifier), 
             ('SVC', SVC), ('Linear SVC', LinearSVC), ('Nu SVC', NuSVC), 
             ('K-Neighbors Classifier', KNeighborsClassifier),
             ('Gaussian Naive Bayes', GaussianNB), ('Multinomial Naive Bayes', MultinomialNB), 
             ('Bernoulli Naive Bayes', BernoulliNB), ('Complement Naive Bayes', ComplementNB), 
             ('Decision Tree Classifier', DecisionTreeClassifier), 
             ('Random Forest Classifier', RandomForestClassifier), ('AdaBoost Classifier', AdaBoostClassifier), 
             ('Gradient Boosting Classifier', GradientBoostingClassifier), ('Bagging Classifier', BaggingClassifier), 
             ('Extra Trees Classifier', ExtraTreesClassifier), ('XGBoost', XGBClassifier)]

# Separating independent features and dependent feature from the dataset
#X_train = titanic.drop(columns='Survived')
#y_train = titanic['Survived']

# Creating a dataframe to compare the performance of the machine learning models
comparison_cols = ['Algorithm', 'Training Time (Avg)', 'Accuracy (Avg)', 'Accuracy (3xSTD)']
comparison_df = pd.DataFrame(columns=comparison_cols)

# Generating training/validation dataset splits for cross validation
cv_split = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

# Performing cross-validation to estimate the performance of the models
for idx, est in enumerate(estimator):

    cv_results = cross_validate(est[1](), X, y, cv=cv_split)

    comparison_df.loc[idx, 'Algorithm'] = est[0]
    comparison_df.loc[idx, 'Training Time (Avg)'] = cv_results['fit_time'].mean()
    comparison_df.loc[idx, 'Accuracy (Avg)'] = cv_results['test_score'].mean()
    comparison_df.loc[idx, 'Accuracy (3xSTD)'] = cv_results['test_score'].std() * 3

comparison_df.set_index(keys='Algorithm', inplace=True)
comparison_df.sort_values(by='Accuracy (Avg)', ascending=False, inplace=True)

我猜cv_split部分给我的问题 我找到了使用train_test_split的解决方案,但这并不像cv_split那样返回它

但奇怪的是我使用这个代码很好与其他kaggle问题 所以我试着比较两个摇摆的数据框架的形状

讨人喜欢没问题 打印(X.shape) 打印(y.shape) (891,9) (891) 数组([0,1,1,1,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0,1,0,1,1,1,1 ,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,1,0,0,1,0 ,0,0,0,1,1,0,1,1,0,1,0,0,1,0,0,0,1,1 .......)

=============================================================

讨价还价有问题(错误) 打印(X.shape) 打印(y.shape) (15035,24) (15035) array([221900。,180000。,510000。,...,360000.,400000。,325000。])

两个内核的形状对我来说都是一样的 我不知道这两个内核的X,y的区别。

任何人有任何想法为什么跟随错误来自?

python data-science kaggle feature-engineering
1个回答
0
投票

是你的y拿起索引值..虽然不确定..你可以尝试StratifiedKFold而不是..下面为我工作

kfold = StratifiedKFold(n_splits = 10,random_state = 7)results = cross_val_score(model,X_train,y_train,cv = kfold)

© www.soinside.com 2019 - 2024. All rights reserved.