使用 StratifiedKFold 创建训练/测试/验证拆分

问题描述 投票:0回答:4

我正在尝试使用

StratifiedKFold
创建训练/测试/验证拆分以用于非 sklearn 机器学习工作流程。因此,DataFrame 需要拆分然后保持原样。

我正在尝试像下面这样使用

.values
因为我正在传递 pandas DataFrames:

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

这失败了:

ValueError: not enough values to unpack (expected 3, got 2).

我通读了所有

sklearn
文档并运行了示例代码,但没有更好地理解如何在
sklearn
交叉验证场景之外使用分层 k 折拆分。

编辑:

我也这样试过:

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

这似乎行得通,尽管我想我这样做是在搞乱分层。

python pandas scikit-learn cross-validation data-science
4个回答
3
投票

StratifiedKFold 只能用于将数据集分成两部分。你得到一个错误,因为

split()
方法只会产生一个 train_index 和 test_index 的元组(见https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py #L94).

对于这个用例,您应该首先将数据拆分为验证数据和其余数据,然后将其余数据再次拆分为测试和训练,如下所示:

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

2
投票

我不太确定这个问题是关于 KFold 还是只是分层拆分,但是我用交叉验证集为 StratifiedKFold 写了这个快速包装器。

from sklearn.model_selection import StratifiedKFold, train_test_split

class StratifiedKFold3(StratifiedKFold):

    def split(self, X, y, groups=None):
        s = super().split(X, y, groups)
        for train_indxs, test_indxs in s:
            y_train = y[train_indxs]
            train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
            yield train_indxs, cv_indxs, test_indxs

可以这样使用:

X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))

0
投票

stratify
参数中,传入要分层的目标。首先,告知完整的目标数组(在我的例子中是
y
)。然后,在下一次拆分中,通知被拆分的目标(在我的例子中是
y_train
):

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

0
投票

这是我的尝试,通过在第一个拆分中嵌套另一个 StratifiedGroupKFold。首先,我们查看要拆分多少以便我们可以获得训练索引,然后我们查看 val 和 test 之间的比率并相应地进行拆分。

请注意,这里有一些我没有检查的注意事项,例如当组数很低时,我们可能会在到达测试验证拆分之前“用完”组。例如,当我们有 10 个组时,我们使用 0.9、0.05 和 0.05 拆分。 train set 将用完 9 组,只剩下 1 组在 test 和 val 之间共享。

此外,如果请求的列车比例不是最大的,则此代码不起作用。在那种情况下,你应该像我对内部 val 和测试拆分所做的那样再次反转 train 和 val-test。

import numpy as np
from sklearn.model_selection import StratifiedGroupKFold

# set the ratios for train, validation, and test splits
train_ratio = 0.5
val_ratio = 0.1
test_ratio = 0.4

assert train_ratio >= 0.5, "This code only works when train_ratio is the biggest"

num_splits = int(1 / (val_ratio + test_ratio))
N = 10000
X = np.random.rand(N, 10)
groups = np.random.randint(0, 100, N)
y = np.random.randint(0, 10, N)

num_folds = 3
for fold in range(num_folds):
    # We instantiate a new one every time since we control the number of folds ourselves
    sgkf = StratifiedGroupKFold(n_splits=num_splits, random_state=fold, shuffle=True)
    for train_indices, val_test_indices in sgkf.split(X, y, groups):

        X_train = X[train_indices]
        y_train = y[train_indices]
        groups_train = groups[train_indices]

        X_val_test = X[val_test_indices]
        y_val_test = y[val_test_indices]
        groups_val_test = groups[val_test_indices]

        # Now we have to split it based on the ratio between test and val
        split_ratio = test_ratio / val_ratio
        test_val_order = True
        if split_ratio < 1: # In this case we invert the ratio and the assignment of test-val / val-test
            test_val_order = False
            split_ratio = 1 / split_ratio

        split_ratio = int(split_ratio) + 1
        sgkf2 = StratifiedGroupKFold(n_splits=split_ratio)
        i1, i2 = next(sgkf2.split(X_val_test, y_val_test, groups_val_test))
        if test_val_order:
            test_indices = i1
            val_indices = i2
        else:
            test_indices = i2
            val_indices = i1

        X_val = X_val_test[val_indices]
        groups_val = groups_val_test[val_indices]

        X_test = X_val_test[test_indices]
        groups_test = groups_val_test[test_indices]

        print("train groups = ", np.unique(groups_train))
        print("val groups =", np.unique(groups_val))
        print("test groups =", np.unique(groups_test))
        print(X_train.shape, X_val.shape, X_test.shape)

    print()
© www.soinside.com 2019 - 2024. All rights reserved.