R partykit::glmtree 找不到解决方案

Question

我想将乐谱放入垃圾箱中。分数表示违约的可能性 = 1。应使用 R 中 partykit 库中的

glmtree

自动找到垃圾箱。垃圾箱应包含具有类似违约率的分数。这以前有效，但现在分数在预测默认值方面已经变得更好，而且似乎 partykit 没有找到解决方案。

这是一个使用接近我的合成数据的示例。

library(tidyverse)

sigmoid <- function(x) {
  1 / (1 + exp(-x))
}

n_sample <- 10^5
score <- runif(n_sample, min = -5, max = 5)
defaults <- rbinom(length(score), size = 1, prob = sigmoid(score))

df <- tibble(score = score, default_flag = defaults) 

partykit::glmtree(formula = default_flag ~ score,
                  data = as.data.frame(df),
                  family = binomial)

将分数分入 100 个同等大小的分箱时，平均违约率如下所示

df %>% 
   mutate(score_bin = cut(score, breaks = 100)) %>% 
   group_by(score_bin) %>% 
   summarise(default_rate = sum(default_flag)/ n()) %>% 
   plot()

我的直觉是，partykit 找不到解决方案，因为许多削减效果很好，即将分成具有更多默认值和更少默认值的组。这有道理吗？

如何让 partykit::glmtree 找到此示例的分箱？

我已经尝试过了

将 maxit 增加到 100
增加最小尺寸
使用 ctree 代替，可以快速找到解决方案

Answer 1

在您的示例中，

glmtree()

确实找到了一棵对应于 23 个 bin 的树，这与

ctree()

找到的 24 个 bin 非常相似。

两种算法之间的主要区别在于

glmtree()

慢很多（！），因为它没有利用您仅估计每个箱内的平均值/截距的特性。因此，它总是建立一个 GLM，运行 IWLS 算法（迭代加权最小二乘法）等。相比之下，

ctree()

是为这种恒定拟合的情况而设计的，并且还在 C 中编码了该算法的许多方面，这带来了如您的示例所示，样本量较大。

因此，如果您想使用恒定拟合树，我可能会使用

ctree()

来表示这样的样本大小。结果通常非常相似。或者，我会考虑

glmtree()

不仅仅是截距。如果您对分数进行默认的逻辑回归，则根本不需要分割，因为您模拟了逻辑模型中的数据。

让我们比较一下这三棵树：

## data as data.frame
df <- as.data.frame(df)

## GLM tree with intercept only (many splits, slow)
tr <- glmtree(default_flag ~ score, data = df, family = binomial)

## GLM tree with score as linear logistic regressor (no splits, fast)
tr2 <- glmtree(default_flag ~ score | score, data = df, family = binomial)

## CTree with intercept only (many splits, fast)
tr3 <- ctree(factor(default_flag) ~ score, data = df)

生成的拟合曲线如下所示：

## fitted probabilities
ndf <- data.frame(score = seq(from = -4.5, to = 4.5, by = 0.01))
ndf$prob <- predict(tr, newdata = ndf, type = "response")
ndf$prob2 <- predict(tr2, newdata = ndf, type = "response")
ndf$prob3 <- predict(tr3, newdata = ndf, type = "prob")[,2]
plot(sigmoid(score) ~ score, data = ndf, type = "l", col = "lightgray", lwd = 4)
lines(prob ~ score, data = ndf, col = 2, lwd = 1.5)
lines(prob2 ~ score, data = ndf, col = 3, lwd = 1.5)
lines(prob3 ~ score, data = ndf, col = 4, lwd = 1.5)

注意：我不小心删除了该图 - 今天晚些时候将重新发布。

R partykit::glmtree 找不到解决方案

问题描述投票：0回答：1

1个回答

最新问题

R partykit::glmtree 找不到解决方案

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1