Python Boruta 和 R Boruta 有区别吗？

Question

我在 R 和 Python 中使用 Boruta 包来处理相同的数据集。我应用的所有步骤和其他方法都是相同的。但 Boruta 在 Python 和 R 中进行特征选择的结果是不同的。在 R 中，选择了 46 个特征，但在 Python 中选择了 20 个特征。是什么原因？

R

 M_boruta <- Boruta::Boruta(is_churn ~ . -cust_id, data = Mobile, doTrace = 2) 

print(M_boruta)

plot(M_boruta, xlab = "", xaxt = "n")

lz_2 <- lapply(1:ncol(M_boruta$ImpHistory),function(i)
  M_boruta$ImpHistory[is.finite(M_boruta$ImpHistory[,i]),i])

names(lz_2) <- colnames(M_boruta$ImpHistory)

Labels_2 <- sort(sapply(lz_2,median))
axis(side = 1,las=2,labels = names(Labels_2),
     at = 1:ncol(M_boruta$ImpHistory), cex.axis = 0.7)

M_boruta_attr <- getSelectedAttributes(M_boruta, withTentative = F)

M_boruta_df <- Mobile[ ,(names(Mobile) %in% M_boruta_attr)]

str(M_boruta_df)]

Python

  from sklearn.ensemble import RandomForestClassifier
    from boruta import BorutaPy
    rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1,    class_weight='balanced',  max_depth=50)
    boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2)
    churn_gsm_bor_x = churn_gsm_bor.iloc[:,1:].values
    churn_gsm_bor_y = churn_gsm_bor.iloc[:,0].values.ravel()
    boruta_selector.fit(churn_gsm_bor_x, churn_gsm_bor_y)
    print("=============BORUTA==============")
    print(boruta_selector.n_features_)
    print(boruta_selector.support_)
    print(boruta_selector.ranking_)
    churn_gsm_bor_x_filter=boruta_selector.transform(churn_gsm_bor_x)
    print(churn_gsm_bor_x_filter)

Answer 1

这可能是因为您在 Python 中为随机森林分类器指定的参数与您在 R 中使用的默认参数不同（参见 https://cran.r-project.org/web/packages/randomForest/randomForest.pdf ，或较新版本的 Boruta 中的

ranger

：https://cran.r-project.org/web/packages/ranger/ranger.pdf）。我想指出的是，您还将 Python 实现中树的最大深度设置为高于建议值（参见 https://github.com/scikit-learn-contrib/boruta_py） - 我个人发现这会对选择多少个特征产生很大的影响。

Answer 2

根据文档，当two_step=False和

perc=100

时，BorutaPy

应该

与Boruta R表现完全相同。我发现虽然算法大体相同，但底层默认随机森林参数却不同，这可能是导致差异的原因。即使我们可以保存 R 和 Python 底层 RF 常量之间的所有参数，RF 中也会存在一些随机过程，导致结果略有不同。至少有一篇文章试图通过引导 Boruta 来解决这个问题。我还想复制 R 和 Python 之间的行为，并发现 R 中的 RF（使用 ranger）在这些主要方面（尽管可能还有更多）与 sklearn 的 RF 有所不同，正如 @eskrav 已经指出的那样。

最大深度/最大深度：

BorutaPy 文档建议将其设置为 5，但 Boruta R 中使用的默认值是 NULL，对应于无限深度。

num.trees/n_estimators：

在 Python 中默认为 100，但在 R 中默认为 500。

mtry/max_features：

“sqrt”表示取向下舍入的特征数量的平方根。当没有给出参数时，R 默认使用此方法。 Python 使用 1，即所有功能。

min.node.size/min_samples_split：

Python 中默认为 2，R 中默认为 5。根据我的调查，这是最接近的射频设置：

rf_closest_r = RandomForestRegressor(n_jobs=-1, max_depth=None, n_estimators=500, max_features="sqrt", min_samples_split = 5) # Use HPs that should "behave exactly like R" boruta_closest_r = BorutaPy(rf_closest_r, random_state=1, perc=100, two_step=False)

如果您迭代此块并更改 random_state，您可能仍然会看到根据数据复杂性选择的功能略有不同。

Python Boruta 和 R Boruta 有区别吗？

问题描述投票：0回答：2

2个回答

最新问题

Python Boruta 和 R Boruta 有区别吗？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2