我有一个包含 90 个变量和 200000 个观测值的数据集。它是不平衡的,因为只有 4% 的情况目标变量为 1,在所有其他情况下目标变量为 0。
我将其分为 2 组:拟合(185000)和保留样本“df_holdout”(15000 obs。) 因此,我决定从拟合样本中获取模型拟合目标变量 = 1 的所有情况以及目标变量 = 0 的相同数量的情况。(集合“df”总共包括 25000 个观测值。)
变量的名称为 var_01, var_02, var_03, ... var_90 ,其中 var_90 被重命名为“target”。
我有一堆工作流程。
这是我用于模型拟合的代码:
rf_tune <- parsnip::rand_forest(mode="classification",
mtry = tune(),
trees = 1000,
min_n = tune()) %>%
set_engine("ranger",
importance = "impurity")
svm_tune <- parsnip::svm_poly(mode = "classification",
engine = "kernlab",
cost = tune(),
degree = tune(),
scale_factor = tune(),
margin = tune())
# Create data split object
df_split <- initial_split(df, prop = 0.75,
strata = target)
# Create the training data
df_train <- df_split %>%
training()
df_test <- df_split %>%
testing()
# create a recipe
df_recipe <- recipe(target ~., data = df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric()) %>%
step_corr(threshold = 0.7) %>%
step_dummy(all_nominal_predictors(), -all_outcomes())
df_recipe %>%
prep(df_train) %>%
bake(df_train)
all_models_set <-
workflow_set(preproc = list(df_recipe = df_recipe),
models = list(rf_tune,
svm_tune),
cross = TRUE)
set.seed(123)
cv <- vfold_cv(df_training, v=5, repeats=1, strata=target)
df_metr <- metric_set(accuracy, roc_auc,sens,spec)
all_models <-
all_models_set %>%
workflow_map("tune_grid",
resamples = cv,
grid = 10,
control = control_resamples( save_pred = T, save_workflow = T, verbose = T),
metrics = df_metr
)
# Get the workflow ID for the top model from our workflow set
best_workflow <-
rank_results(all_models, rank_metric = "roc_auc", select_best = TRUE) %>%
filter(.metric=="roc_auc" & rank==1)
final_model <-
extract_workflow_set_result(all_models, pull(best_tuned_workflow, wflow_id)) %>%
select_best(metric = "roc_auc")
# Fit final model on Train and predict on Test set
final_model_pred <-
extract_workflow(all_models, pull(best_tuned_workflow, wflow_id)) %>% # extract the workflow
finalize_workflow(final_model) %>%
last_fit(df_split) # fit the model on Train and score on Test
# final workflow extraction
wf_final_model <- extract_workflow(final_model_pred)
创建模型并训练工作流程 (wf_final_model) 后,我将其保存并希望用于对保留样本进行预测。但是,当我尝试这样做时,我收到一条错误消息:
predict(wf_final_model, df_holdout)
Error: Missing data in columns: var_02_X4, var_02_X7, var_02_X9, var_02_X10, var_02_X11, var_02_X12, var_02_X13, var_02_X15, var_02_X17, var_02_X18, var_02_X20, var_02_X21, var_02_X22, var_02_X23, var_02_X24, var_02_X25, var_02_X26, var_02_X27, var_02_X28, var_02_X29, var_02_X30, var_02_X31, var_02_X33, var_02_X34, var_30_X2, var_30_X3, var_30_X6, var_30_X7, var_30_X9, var_30_X11, var_30_X13, var_30_X14, var_30_X15, var_30_X16, var_30_X17, var_30_X18, var_30_X19, var_30_X20, var_30_X22, var_30_X23, var_30_X24, var_30_X25, var_30_X26, var_30_X27, var_30_X33, var_30_X43, var_30_X46, var_30_X48, var_30_X49, var_30_X51, var_30_X56, var_30_X57, var_30_X60, var_36_X14, var_36_X18, var_36_X21, var_36_X24, var_36_X28, var_36_X29, var_36_X32, var_36_X44, var_36_X57, var_36_X61, var_36_X63, var_36_X85, var_36_X125, var_36_X130, var_36_X136, var_36_X144, var_36_X147, var_36_X148, var_36_X166, var_36_X169, var_36_X171, var_89_X3, var_89_X4, var_89_X5, var_89_X6, var_89_X7, var_89_X8, var_89_X9, va
In addition: Warning messages:
1: Novel levels found in column 'var_02': '2', '5'. The levels have been removed, and values have been coerced to 'NA'.
2: Novel levels found in column 'var_30': '39', '41', '42', '47', '54'. The levels have been removed, and values have been coerced to 'NA'.
3: Novel levels found in column 'var_36': '118'. The levels have been removed, and values have been coerced to 'NA'.
4: Novel levels found in column 'var_89': '2'. The levels have been removed, and values have been coerced to 'NA'.
5: There are new levels in a factor: NA
6: There are new levels in a factor: NA
7: There are new levels in a factor: NA
8: There are new levels in a factor: NA
我在训练集中、测试集中或保留集中都没有任何具有此类名称的变量。 据我了解,这些变量描述了相互作用,但我不确定如何处理它。 您能帮我修复错误以获得预测吗?
您看到的变量名称,
var_02_X4
、var_02_X7
、var_02_X9
、var_02_X10
,是由 step_dummy()
创建的,例如var_02
有级别 X4
、X7
、X9
、X10
等等。
解决这个问题的方法是在
step_unknown()
之前添加 step_dummy()
。
# create a recipe
df_recipe <- recipe(target ~., data = df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric()) %>%
step_corr(threshold = 0.7) %>%
step_unknown(all_nomial_predictors()) %>%
step_dummy(all_nominal_predictors())
您不需要
-all_outcomes()
,因为 all_nominal_predictors()
不会选择结果。