使用 StratifiedKFold 与外生组特征进行交叉验证

Question

早上好/下午好，我想在 sklearn 中使用交叉验证来预测连续变量。

我参考了“可视化scikit-learn中的交叉验证行为”页面来选择适合我的问题的交叉验证方法。 https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py

我想使用 StratifiedKFold，但它没有提供使用非目标变量（“类”）的“分层”变量的方法，如下例所示。

我想要的是使用“group”变量来分层。

目前，我所做的是：

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits=5, 
                      shuffle = True,
                      random_state=57)
cross_val_score(regr, X, y, cv=skf.split(training,groups))

其中regr是我的回归量，X是我的特征，y是我的目标，并对我喜欢的“分层”变量的熊猫系列进行分组。我已经检查过 skf.split(training,groups) 提供了适合我的需求的分割，即保持我的组的原始分布的训练和测试集。

但是，我无法检查交叉验证是否具有我期望的行为。我对么？我可以查一下吗？

Answer 1

你的方法对我来说看起来是正确的，即使它相当不常见。

您可以检查分层是否适用于此代码：

# Setup StratifiedKFold, just as you did
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=57)

# Checking the distribution in each fold
for train_index, test_index in skf.split(X, groups):
    print("TRAIN:", train_index)
    print("TEST:", test_index)
    
    # Distribution of 'groups' in train and test split
    train_groups_distribution = np.bincount(groups[train_index])
    test_groups_distribution = np.bincount(groups[test_index])
    
    print("Train Groups Distribution:", train_groups_distribution)
    print("Test Groups Distribution:", test_groups_distribution)
    print("-----")

如果变量组有太多不同/唯一的值，我不会使用它。如果每个组只有少量样本，

StratifiedKFold

可能会因为没有足够的样本来创建分层折叠而抛出错误。

使用 StratifiedKFold 与外生组特征进行交叉验证

问题描述投票：0回答：1

1个回答

最新问题

使用 StratifiedKFold 与外生组特征进行交叉验证

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1