tidymodels 中的 XGBoost 贝叶斯优化

问题描述 投票:0回答:0

我尝试将贝叶斯优化应用于 tidymodels 框架内的二元分类问题 (XGBOOST)。

  1. 我的代码中是否存在任何缺陷 - 该模型已在 72 CPU Linux 机器上运行了 2 天。虽然我的数据集相当大,大约有 6GB(100 万行和 2000 列),但我缺乏经验来判断计算时间是否符合预期,或者是否存在任何我没有看到的问题。

此外,当我在示例“mtcars”数据集上运行代码时,会出现以下错误,所以我认为我一定错过了一些东西

❯  Generating a set of 10 initial parameter results
→ A | warning: No control observations were detected in `truth` with control level '1'.
There were issues with some computations   A: x1
✓ Initialization complete


── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

i Current best:     roc_auc=0.5 (@iter 0)
i Gaussian process model
! All of the roc_auc values were identical. The Gaussian process model cannot be fit to the data. Try expanding the range of the tuning parameters.
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.

更一般性的后续问题:

  1. 我可以将模型规范中的 nthread 参数与 do parallel 参数结合起来以加快计算速度,例如nthread=3 且 cluster = 10 - 让 3 个核心在 1 个进程上工作?

  2. 如果我的目标变量稍微不平衡(比例为 1/3),我应该使用哪个指标来优化模型。我认为 F1 (f_meas) 应该是最相关的?

# Load necessary libraries
library(tidymodels)
library(doParallel)
library(xgboost)

# Load example data
data(mtcars)

# Create a binary target
mtcars$am <- as.factor(mtcars$am)

set.seed(123)
df_split <- initial_split(mtcars, strata = am)

df_train <- training(df_split)
df_test <- testing(df_split)
df_train_folds <- vfold_cv(df_train, strata = am)


# /////////////////////////////////////////////////////////////////////////////

#prep
recipe_df <- recipe(am ~ ., data=df_train) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())

xgb_prep<- prep(recipe_df,verbose=T,)



# Create model

xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


# set ranges for parameters
params <- parameters(
  learn_rate(),
  tree_depth(), 
  min_n(), 
  loss_reduction()
) %>%
  update(
    learn_rate = learn_rate(c(0.01, 0.3)),  # range for the learning rate
    tree_depth = tree_depth(c(3, 10)),  # range for the tree depth
    min_n = min_n(c(1, 10)),  # range for the minimum number of observations
    loss_reduction = loss_reduction(c(1, 5))  # range for the loss reduction
  ) %>%
  finalize(df_train)

# Merge into workflow
xgb_wf <- workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_prep)


#Parallel Processing
gc()
numCores = 30 #detectCores() 72
cl = parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
options(tidymodels.dark = TRUE)

myxgb_res <- tune_bayes(
  xgb_wf,
  resamples = df_train_folds,
  param_info = params,
  initial = 10,
  iter = 30, 
  metrics = metric_set(roc_auc),
  control = control_bayes(
    no_improve = 5, 
    save_pred = T, 
    verbose = T,
    seed = 123
  ),
  parallel_over = "everything",
)

stopCluster(cl)
doParallel::stopImplicitCluster()

感谢您的建议!

r parallel-processing xgboost confusion-matrix tidymodels
© www.soinside.com 2019 - 2024. All rights reserved.