scikit学习中的详尽特征选择？

Question

在scikit-learn中是否有任何内置的蛮力特征选择方法？即详尽评估输入特征的所有可能组合，然后找到最佳子集。我熟悉“递归特征消除”类，但是我特别感兴趣的是一个接一个地评估输入特征的所有可能组合。

Answer 1

否，未实现最佳子集选择。最简单的方法是自己编写。这应该可以帮助您入门：

from itertools import chain, combinations
from sklearn.cross_validation import cross_val_score

def best_subset_cv(estimator, X, y, cv=3):
    n_features = X.shape[1]
    subsets = chain.from_iterable(combinations(xrange(k), k + 1)
                                  for k in xrange(n_features))

    best_score = -np.inf
    best_subset = None
    for subset in subsets:
        score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean()
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score

这将在循环内执行k倍交叉验证，因此在提供具有[[p特征的数据时，它将适合k 2 ᵖ个估计量。

Answer 2

结合Fred Foo的答案以及nopper，ihadanny和jimijazz的评论，对于以下实验1（6.5.1）中的第一个示例，以下代码获得的结果与R函数regsubsets（）（属于跨越库）相同。《 R语言中的应用中的统计学习入门》一书中的“最佳子集选择”。

from itertools import combinations from sklearn.cross_validation import cross_val_score def best_subset(estimator, X, y, max_size=8, cv=5): '''Calculates the best model of up to max_size features of X. estimator must have a fit and score functions. X must be a DataFrame.''' n_features = X.shape[1] subsets = (combinations(range(n_features), k + 1) for k in range(min(n_features, max_size))) best_size_subset = [] for subsets_k in subsets: # for each list of subsets of the same size best_score = -np.inf best_subset = None for subset in subsets_k: # for each subset estimator.fit(X.iloc[:, list(subset)], y) # get the subset with the best score among subsets of the same size score = estimator.score(X.iloc[:, list(subset)], y) if score > best_score: best_score, best_subset = score, subset # to compare subsets of different sizes we must use CV # first store the best subset of each size best_size_subset.append(best_subset) # compare best subsets of each size best_score = -np.inf best_subset = None list_scores = [] for subset in best_size_subset: score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean() list_scores.append(score) if score > best_score: best_score, best_subset = score, subset return best_subset, best_score, best_size_subset, list_scores

请参阅http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection处的笔记本

Answer 3

您可能想看看MLxtend's Exhaustive Feature Selector。它显然不是内置于scikit-learn中（还可以吗？），但确实支持其分类器和回归对象。

scikit学习中的详尽特征选择？

问题描述投票：9回答：3

3个回答

最新问题

scikit学习中的详尽特征选择？

问题描述 投票：9回答：3

3个回答

最新问题

问题描述投票：9回答：3