R Tidymodels 随机森林分类:预测目标变量时出错

问题描述 投票:0回答:1

我有一个包含 90 个变量和 200000 个观测值的数据集。它是不平衡的,因为只有 4% 的情况目标变量为 1,在所有其他情况下目标变量为 0。

我将其分为 2 组:拟合(185000)和保留样本“df_holdout”(15000 obs。) 因此,我决定从拟合样本中获取模型拟合目标变量 = 1 的所有情况以及目标变量 = 0 的相同数量的情况。(集合“df”总共包括 25000 个观测值。)

变量的名称为 var_01, var_02, var_03, ... var_90 ,其中 var_90 被重命名为“target”。

我有一堆工作流程。

这是我用于模型拟合的代码:

rf_tune    <- parsnip::rand_forest(mode="classification",
                                                   mtry = tune(), 
                                                   trees = 1000,
                                                   min_n = tune()) %>%
                                                   set_engine("ranger",
                                                              importance = "impurity")
svm_tune              <-  parsnip::svm_poly(mode = "classification",
                                                   engine = "kernlab",
                                                   cost = tune(),
                                                   degree = tune(),
                                                   scale_factor = tune(),
                                                   margin = tune())

  

  # Create data split object
  df_split <- initial_split(df, prop = 0.75,
                            strata = target)
  
  # Create the training data
  df_train <- df_split %>% 
    training()
 
  df_test <- df_split %>% 
    testing()
   
  # create a recipe
  df_recipe <- recipe(target ~., data = df_train) %>% 
    step_zv(all_predictors()) %>%
    step_normalize(all_numeric()) %>% 
    step_corr(threshold = 0.7) %>% 
    step_dummy(all_nominal_predictors(), -all_outcomes())
  
  df_recipe %>% 
    prep(df_train) %>% 
    bake(df_train)

all_models_set <- 
    workflow_set(preproc = list(df_recipe = df_recipe),
                 models =  list(rf_tune,
                                svm_tune),
                 cross = TRUE)

set.seed(123)

  cv <-  vfold_cv(df_training, v=5, repeats=1, strata=target) 
  
  df_metr <- metric_set(accuracy, roc_auc,sens,spec)
  
  
  all_models <-
    all_models_set %>%
    workflow_map("tune_grid",
                 resamples = cv,
                 grid = 10,
                 control =  control_resamples( save_pred = T, save_workflow = T, verbose = T), 
                 metrics = df_metr
    )
  
 
 # Get the workflow ID for the top model from our workflow set
  best_workflow <-
    rank_results(all_models, rank_metric = "roc_auc", select_best = TRUE) %>% 
    filter(.metric=="roc_auc" & rank==1)
  
  
  final_model <-
    extract_workflow_set_result(all_models, pull(best_tuned_workflow, wflow_id)) %>% 
    select_best(metric = "roc_auc") 
  
  
  # Fit final model on Train and predict on Test set
  final_model_pred <- 
    extract_workflow(all_models, pull(best_tuned_workflow, wflow_id)) %>% # extract the workflow
    finalize_workflow(final_model) %>% 
    last_fit(df_split) # fit the model on Train and score on Test
  
  # final workflow extraction
  wf_final_model <- extract_workflow(final_model_pred)

创建模型并训练工作流程 (wf_final_model) 后,我将其保存并希望用于对保留样本进行预测。但是,当我尝试这样做时,我收到一条错误消息:

predict(wf_final_model, df_holdout)

Error: Missing data in columns: var_02_X4, var_02_X7, var_02_X9, var_02_X10, var_02_X11, var_02_X12, var_02_X13, var_02_X15, var_02_X17, var_02_X18, var_02_X20, var_02_X21, var_02_X22, var_02_X23, var_02_X24, var_02_X25, var_02_X26, var_02_X27, var_02_X28, var_02_X29, var_02_X30, var_02_X31, var_02_X33, var_02_X34, var_30_X2, var_30_X3, var_30_X6, var_30_X7, var_30_X9, var_30_X11, var_30_X13, var_30_X14, var_30_X15, var_30_X16, var_30_X17, var_30_X18, var_30_X19, var_30_X20, var_30_X22, var_30_X23, var_30_X24, var_30_X25, var_30_X26, var_30_X27, var_30_X33, var_30_X43, var_30_X46, var_30_X48, var_30_X49, var_30_X51, var_30_X56, var_30_X57, var_30_X60, var_36_X14, var_36_X18, var_36_X21, var_36_X24, var_36_X28, var_36_X29, var_36_X32, var_36_X44, var_36_X57, var_36_X61, var_36_X63, var_36_X85, var_36_X125, var_36_X130, var_36_X136, var_36_X144, var_36_X147, var_36_X148, var_36_X166, var_36_X169, var_36_X171, var_89_X3, var_89_X4, var_89_X5, var_89_X6, var_89_X7, var_89_X8, var_89_X9, va
In addition: Warning messages:
1: Novel levels found in column 'var_02': '2', '5'. The levels have been removed, and values have been coerced to 'NA'. 
2: Novel levels found in column 'var_30': '39', '41', '42', '47', '54'. The levels have been removed, and values have been coerced to 'NA'. 
3: Novel levels found in column 'var_36': '118'. The levels have been removed, and values have been coerced to 'NA'. 
4: Novel levels found in column 'var_89': '2'. The levels have been removed, and values have been coerced to 'NA'. 
5: There are new levels in a factor: NA 
6: There are new levels in a factor: NA 
7: There are new levels in a factor: NA 
8: There are new levels in a factor: NA 

我在训练集中、测试集中或保留集中都没有任何具有此类名称的变量。 据我了解,这些变量描述了相互作用,但我不确定如何处理它。 您能帮我修复错误以获得预测吗?

r machine-learning random-forest tidymodels r-parsnip
1个回答
0
投票

您看到的变量名称,

var_02_X4
var_02_X7
var_02_X9
var_02_X10
,是由
step_dummy()
创建的,例如
var_02
有级别
X4
X7
X9
X10
等等。

解决这个问题的方法是在

step_unknown()
之前添加
step_dummy()

  # create a recipe
  df_recipe <- recipe(target ~., data = df_train) %>% 
    step_zv(all_predictors()) %>%
    step_normalize(all_numeric()) %>% 
    step_corr(threshold = 0.7) %>% 
    step_unknown(all_nomial_predictors()) %>%
    step_dummy(all_nominal_predictors()) 

您不需要

-all_outcomes()
,因为
all_nominal_predictors()
不会选择结果。

© www.soinside.com 2019 - 2024. All rights reserved.