scikit学习中的详尽特征选择?

问题描述 投票:9回答:3

在scikit-learn中是否有任何内置的蛮力特征选择方法?即详尽评估输入特征的所有可能组合,然后找到最佳子集。我熟悉“递归特征消除”类,但是我特别感兴趣的是一个接一个地评估输入特征的所有可能组合。

scikit-learn
3个回答
6
投票

否,未实现最佳子集选择。最简单的方法是自己编写。这应该可以帮助您入门:

from itertools import chain, combinations
from sklearn.cross_validation import cross_val_score

def best_subset_cv(estimator, X, y, cv=3):
    n_features = X.shape[1]
    subsets = chain.from_iterable(combinations(xrange(k), k + 1)
                                  for k in xrange(n_features))

    best_score = -np.inf
    best_subset = None
    for subset in subsets:
        score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean()
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score

这将在循环内执行k倍交叉验证,因此在提供具有[[p特征的数据时,它将适合k 2 个估计量。


1
投票
结合Fred Foo的答案以及nopper,ihadanny和jimijazz的评论,对于以下实验1(6.5.1)中的第一个示例,以下代码获得的结果与R函数regsubsets()(属于跨越库)相同。 《 R语言中的应用中的统计学习入门》一书中的“最佳子集选择”。

from itertools import combinations from sklearn.cross_validation import cross_val_score def best_subset(estimator, X, y, max_size=8, cv=5): '''Calculates the best model of up to max_size features of X. estimator must have a fit and score functions. X must be a DataFrame.''' n_features = X.shape[1] subsets = (combinations(range(n_features), k + 1) for k in range(min(n_features, max_size))) best_size_subset = [] for subsets_k in subsets: # for each list of subsets of the same size best_score = -np.inf best_subset = None for subset in subsets_k: # for each subset estimator.fit(X.iloc[:, list(subset)], y) # get the subset with the best score among subsets of the same size score = estimator.score(X.iloc[:, list(subset)], y) if score > best_score: best_score, best_subset = score, subset # to compare subsets of different sizes we must use CV # first store the best subset of each size best_size_subset.append(best_subset) # compare best subsets of each size best_score = -np.inf best_subset = None list_scores = [] for subset in best_size_subset: score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean() list_scores.append(score) if score > best_score: best_score, best_subset = score, subset return best_subset, best_score, best_size_subset, list_scores

请参阅http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection处的笔记本

0
投票
您可能想看看MLxtend's Exhaustive Feature Selector。它显然不是内置于scikit-learn中(还可以吗?),但确实支持其分类器和回归对象。
© www.soinside.com 2019 - 2024. All rights reserved.