Class_weight 参数不会影响 RandomForestClassifier 不平衡数据集中的结果

Question

我对机器学习相当陌生，现在我正在中型数据集中预测员工流失。我已经能够顺利地运行一切，但是，由于数据集不平衡，我一直在尝试为模型添加权重，因此通过失去一些精度，我在正类中获得了更多的回忆。当我尝试在

scikit-learn

RandomForestClassifier

中执行此操作时，问题就出现了，我尝试了不同的方法，通过为值创建独立的字典，将字典直接添加到参数中，并且它根本不影响模型。结果总是保持不变，多数班级总是比少数班级取得更好的成绩。

使用其他型号我完全没有问题。

我这里做错了什么吗？

（这是我正在使用的数据集，如果它对任何人有帮助的话：https://www.kaggle.com/datasets/bhanupratapbiswas/hr-analytics-case-study）

型号代码：

#Running the model with the best hyperparameters
weight_dict = {0: 0.59, 1: 3.12}

model = RandomForestClassifier(bootstrap=False, criterion='gini', max_depth=24, max_features='log2', min_samples_leaf=1, min_samples_split=2, n_estimators=200, class_weight=weight_dict)
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test_outliers)

#Printing the results
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC-ROC Score:', roc_auc_score(y_test, y_pred))
print('Classification Report:', classification_report(y_test, y_pred))

#Plotting the confusion matrix
plt.figure()
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.xticks(rotation=45)

我期望少数群体能有更多的记忆力，但多数群体会失去一些精确度和记忆力。

我已经检查了过去的问题和答案，但我已经应用了不同答案的解决方案，但没有成功。

谢谢！

Answer 1

#使用最佳超参数运行模型

weight_dict = {0: 0.59, 1: 3.12}

对我来说，您似乎已经执行了超参数搜索来找到最佳类权重，这是一种有趣的方法，但并不常见。类别权重通常根据类别频率的倒数或类似函数来计算，以帮助模型更多地关注少数类别：

from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(zip(np.unique(y_train), weights))

由于您正在处理不平衡的数据集，您可能需要尝试使用

BalancedRandomForestClassifier

库中的

imbalanced-learn

。有了它，就不需要对少数类施加权重，因为对于每个森林，都会对多数类进行另一个欠采样，以匹配少数类的大小。结果是一片树林，每棵树都适合不同的欠样本。

关于指标，准确性和 AUC-ROC 等传统指标可能无法提供模型性能的完整情况，特别是对于少数类别。相反，我建议使用精确召回曲线下面积 (AUC-PR)。精确率-召回率曲线对于不平衡的数据集提供了更多信息，因为它们特别关注少数类别的表现。我已经在这个答案中更详细地讨论了这个问题。 from sklearn.metrics import precision_recall_curve, auc precision, recall, _ = precision_recall_curve(y_test, model.predict_proba(X_test)[:, 1]) auc_pr = auc(recall, precision) print('AUC-PR Score:', auc_pr)

此外，不要使用

scikit-learn

的

classification_report

，而是考虑使用

imblearn

的

classification_report_imbalanced

，其中包括专门为不平衡数据设计的指标：

from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

希望有帮助。

Class_weight 参数不会影响 RandomForestClassifier 不平衡数据集中的结果

问题描述投票：0回答：1

1个回答

最新问题

Class_weight 参数不会影响 RandomForestClassifier 不平衡数据集中的结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1