将数据集在训练和测试之间进行划分,尊重类分布[重复]。

问题描述 投票:-1回答:2

我想在一个给定的数据集中对机器学习算法进行10次运行,分布如下

np.unique(x[:,24], return_counts=True)
(array([1., 2.]), array([700, 300]))

也就是说,我的数据70%来自于1班,30%来自于2班。

下面是我的数据快照。最后一列告知班级标签(1或2)。

1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1
2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2
4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1
1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1
1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2
4,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,1
4,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,1
2,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,1
4,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1
2,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,2
2,12,2,13,1,2,2,1,3,25,3,1,1,1,1,1,0,1,0,1,0,0,0,1,2
1,48,2,43,1,2,2,4,2,24,3,1,1,1,1,0,0,1,0,1,0,0,0,1,2
2,12,2,16,1,3,2,1,3,22,3,1,1,2,1,0,0,1,0,0,1,0,0,1,1
1,24,4,12,1,5,3,4,3,60,3,2,1,1,1,1,0,1,0,0,1,0,1,0,2
1,15,2,14,1,3,2,4,3,28,3,1,1,1,1,1,0,1,0,1,0,0,0,1,1
1,24,2,13,2,3,2,2,3,32,3,1,1,1,1,0,0,1,0,0,1,0,1,0,2
4,24,4,24,5,5,3,4,2,53,3,2,1,1,1,0,0,1,0,0,1,0,0,1,1
1,30,0,81,5,2,3,3,3,25,1,3,1,1,1,0,0,1,0,0,1,0,0,1,1
2,24,2,126,1,5,2,2,4,44,3,1,1,2,1,0,1,1,0,0,0,0,0,0,2
4,24,2,34,3,5,3,2,3,31,3,1,2,2,1,0,0,1,0,0,1,0,0,1,1
4,9,4,21,1,3,3,4,3,48,3,3,1,2,1,1,0,1,0,0,1,0,0,1,1
1,6,2,26,3,3,3,3,1,44,3,1,2,1,1,0,0,1,0,1,0,0,0,1,1
1,10,4,22,1,2,3,3,1,48,3,2,2,1,2,1,0,1,0,1,0,0,1,0,1
2,12,4,18,2,2,3,4,2,44,3,1,1,1,1,0,1,1,0,0,1,0,0,1,1
4,10,4,21,5,3,4,1,3,26,3,2,1,1,2,0,0,1,0,0,1,0,0,1,1
1,6,2,14,1,3,3,2,1,36,1,1,1,2,1,0,0,1,0,0,1,0,1,0,1
4,6,0,4,1,5,4,4,3,39,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1
3,12,1,4,4,3,2,3,1,42,3,2,1,1,1,0,0,1,0,1,0,0,0,1,1
2,7,2,24,1,3,3,2,1,34,3,1,1,1,1,0,0,0,0,0,1,0,0,1,1
1,60,3,68,1,5,3,4,4,63,3,2,1,2,1,0,0,1,0,0,1,0,0,1,2
2,18,2,19,4,2,4,3,1,36,1,1,1,2,1,0,0,1,0,0,1,0,0,1,1
1,24,2,40,1,3,3,2,3,27,2,1,1,1,1,0,0,1,0,0,1,0,0,1,1
2,18,2,59,2,3,3,2,3,30,3,2,1,2,1,1,0,1,0,0,1,0,0,1,1
4,12,4,13,5,5,3,4,4,57,3,1,1,1,1,0,0,1,0,1,0,0,1,0,1
3,12,2,15,1,2,2,1,2,33,1,1,1,2,1,0,0,1,0,0,1,0,0,0,1
2,45,4,47,1,2,3,2,2,25,3,2,1,1,1,0,0,1,0,0,1,0,1,0,2
4,48,4,61,1,3,3,3,4,31,1,1,1,2,1,0,0,1,0,0,0,0,0,1,1

完整的数据集可以在下面找到 此处

我想把数据分成90%用于训练,10%用于测试。但是,对于每一次拆分,我必须保持数据的比例(例如,在训练和验证的拆分中,70%的数据必须是第1类,30%的数据是第2类)。

我知道如何简单地将数据分为训练和测试,但我不知道如何使这种划分服从我上面举的类分布。如何在Python中做到这一点?

python machine-learning cross-validation
2个回答
1
投票

你可以使用 RepeatedStratifiedKFold顾名思义,它重复了一个K-Fold交叉验证器。n 次。要重复这个过程 10 时代 n_repeats,并有一定比例的 9:1 大约在 traintest 大小,我们可以设置 n_splits=10:

from sklearn.model_selection import RepeatedStratifiedKFold

X = a[:,:-1]
y = a[:,-1]

rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=2)

for train_index, test_index in rskf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f'\nClass 1: {((y_train==1).sum()/len(y_train))*100:.0f}%') 
    print(f'\nShape of train: {X_train.shape[0]}')
    print(f'Shape of test: {X_test.shape[0]}')

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4
...

0
投票

将数据分为训练和测试的一个众所周知的方法是scikit-learn。train_test_split.

API文件 模型选择.train_test_split..

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

你可以玩一下 random_state 变量(一个种子),直到你的类间比例正确为止。而 train_test_split 不会执行比例,一般都是按照人口中的比例。

© www.soinside.com 2019 - 2024. All rights reserved.