方法“train_test_split”中的参数“stratify”（scikit Learn）

Question

我正在尝试使用 scikit Learn 包中的

train_test_split

，但我在使用参数

stratify

时遇到问题。以下是代码：

from sklearn import cross_validation, datasets 

X = iris.data[:,:2]
y = iris.target

cross_validation.train_test_split(X,y,stratify=y)

但是，我不断遇到以下问题：

raise TypeError("Invalid parameters passed: %s" % str(options))
TypeError: Invalid parameters passed: {'stratify': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

有人知道发生了什么事吗？以下是功能文档。

[...]

stratify：类似数组或无（默认为无）

如果不是 None，则数据以分层方式分割，使用它作为标签数组。

0.17版本新增：分层分割

[...]

Answer 1

此

stratify

参数进行分割，以便生成的样本中的值的比例将与参数

stratify

提供的值的比例相同。

例如：二元分类问题，

如果

是数据框中以下值中的因变量或目标\标签列：

```
0
```
25% 数据为零
```
1
```
75% 的数据是个

然后

stratify=y

将确保您的随机分割具有：

0
```
 的 
```
25%
1
```
的
```
75%

Answer 2

对于通过谷歌来到这里的未来的自己：

train_test_split

现在位于 model_selection

，因此：

from sklearn.model_selection import train_test_split

# given:
# features: xs
# ground truth: ys

x_train, x_test, y_train, y_test = train_test_split(xs, ys,
                                                        test_size=0.33,
                                                        random_state=0,
                                                        stratify=ys)

就是使用方法。设置

random_state

 对于再现性而言是理想的选择。

Answer 3

Scikit-Learn 只是告诉您它无法识别“分层”参数，而不是您错误地使用了它。这是因为该参数是在 0.17 版本中添加的，如您引用的文档中所示。

所以你只需要更新 Scikit-Learn 即可。

Answer 4

在这种情况下，分层意味着train_test_split方法返回与输入数据集具有相同比例的类标签的训练和测试子集。

Answer 5

我可以给出的答案是，

分层保留了数据在目标列中的分布比例，并在train_test_split

中描绘了相同的分布比例。
举例来说，如果问题是

二元分类问题，并且目标列的比例为：

yes
no

由于目标列中

'yes'

 的数量是

'no'

 的 4 倍，因此通过将

和测试 分成训练集和测试集而不进行 分层，我们可能会遇到只有

'yes'

落入训练集的麻烦，并且所有

'no'

都落入我们的测试集中。（即，训练集的目标列中可能没有

'no'

）

因此，通过分层，

target

列：

训练集有80%的
```
'yes'
```
和20%的
```
'no'
```
，而且，
测试集分别具有'yes'的
```
80%
```
和'no'的
```
20%
```
。

因此，

stratify

使

target

（标签）在训练集和测试集中均匀分布 - 就像它在原始数据集中的分布一样。

from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(features,
                                                    target,
                                                    test-size = 0.25,
                                                    stratify = target,
                                                    random_state = 43)

Answer 6

尝试运行此代码，它“正常工作”：

from sklearn import cross_validation, datasets 

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X,y,train_size=.8, stratify=y)

y_test

array([0, 0, 0, 0, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 2,
       1, 2, 1, 1, 0, 2, 1])

方法“train_test_split”中的参数“stratify”（scikit Learn）

问题描述投票：0回答：6

6个回答

最新问题

方法“train_test_split”中的参数“stratify”（scikit Learn）

问题描述 投票：0回答：6

6个回答

最新问题

问题描述投票：0回答：6