使用 StratifiedKFold 创建训练/测试/验证拆分

Question

我正在尝试使用

StratifiedKFold

创建训练/测试/验证拆分以用于非 sklearn 机器学习工作流程。因此，DataFrame 需要拆分然后保持原样。

我正在尝试像下面这样使用

.values

因为我正在传递 pandas DataFrames：

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

这失败了：

ValueError: not enough values to unpack (expected 3, got 2).

我通读了所有

sklearn

文档并运行了示例代码，但没有更好地理解如何在

sklearn

交叉验证场景之外使用分层 k 折拆分。

编辑：

我也这样试过：

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

这似乎行得通，尽管我想我这样做是在搞乱分层。

Answer 1

StratifiedKFold 只能用于将数据集分成两部分。你得到一个错误，因为

split()

方法只会产生一个 train_index 和 test_index 的元组（见https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py #L94).

对于这个用例，您应该首先将数据拆分为验证数据和其余数据，然后将其余数据再次拆分为测试和训练，如下所示：

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

Answer 2

我不太确定这个问题是关于 KFold 还是只是分层拆分，但是我用交叉验证集为 StratifiedKFold 写了这个快速包装器。

from sklearn.model_selection import StratifiedKFold, train_test_split

class StratifiedKFold3(StratifiedKFold):

    def split(self, X, y, groups=None):
        s = super().split(X, y, groups)
        for train_indxs, test_indxs in s:
            y_train = y[train_indxs]
            train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
            yield train_indxs, cv_indxs, test_indxs

可以这样使用：

X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))

Answer 3

在

stratify

参数中，传入要分层的目标。首先，告知完整的目标数组（在我的例子中是

）。然后，在下一次拆分中，通知被拆分的目标（在我的例子中是

y_train

）：

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

Answer 4

这是我的尝试，通过在第一个拆分中嵌套另一个 StratifiedGroupKFold。首先，我们查看要拆分多少以便我们可以获得训练索引，然后我们查看 val 和 test 之间的比率并相应地进行拆分。

请注意，这里有一些我没有检查的注意事项，例如当组数很低时，我们可能会在到达测试验证拆分之前“用完”组。例如，当我们有 10 个组时，我们使用 0.9、0.05 和 0.05 拆分。 train set 将用完 9 组，只剩下 1 组在 test 和 val 之间共享。

此外，如果请求的列车比例不是最大的，则此代码不起作用。在那种情况下，你应该像我对内部 val 和测试拆分所做的那样再次反转 train 和 val-test。

import numpy as np
from sklearn.model_selection import StratifiedGroupKFold

# set the ratios for train, validation, and test splits
train_ratio = 0.5
val_ratio = 0.1
test_ratio = 0.4

assert train_ratio >= 0.5, "This code only works when train_ratio is the biggest"

num_splits = int(1 / (val_ratio + test_ratio))
N = 10000
X = np.random.rand(N, 10)
groups = np.random.randint(0, 100, N)
y = np.random.randint(0, 10, N)

num_folds = 3
for fold in range(num_folds):
    # We instantiate a new one every time since we control the number of folds ourselves
    sgkf = StratifiedGroupKFold(n_splits=num_splits, random_state=fold, shuffle=True)
    for train_indices, val_test_indices in sgkf.split(X, y, groups):

        X_train = X[train_indices]
        y_train = y[train_indices]
        groups_train = groups[train_indices]

        X_val_test = X[val_test_indices]
        y_val_test = y[val_test_indices]
        groups_val_test = groups[val_test_indices]

        # Now we have to split it based on the ratio between test and val
        split_ratio = test_ratio / val_ratio
        test_val_order = True
        if split_ratio < 1: # In this case we invert the ratio and the assignment of test-val / val-test
            test_val_order = False
            split_ratio = 1 / split_ratio

        split_ratio = int(split_ratio) + 1
        sgkf2 = StratifiedGroupKFold(n_splits=split_ratio)
        i1, i2 = next(sgkf2.split(X_val_test, y_val_test, groups_val_test))
        if test_val_order:
            test_indices = i1
            val_indices = i2
        else:
            test_indices = i2
            val_indices = i1

        X_val = X_val_test[val_indices]
        groups_val = groups_val_test[val_indices]

        X_test = X_val_test[test_indices]
        groups_test = groups_val_test[test_indices]

        print("train groups = ", np.unique(groups_train))
        print("val groups =", np.unique(groups_val))
        print("test groups =", np.unique(groups_test))
        print(X_train.shape, X_val.shape, X_test.shape)

    print()

使用 StratifiedKFold 创建训练/测试/验证拆分

问题描述投票：0回答：4

4个回答

最新问题

使用 StratifiedKFold 创建训练/测试/验证拆分

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4