提取glm的摘要输出中的NA值

问题描述 投票:0回答:2

我正在对大型数据框的许多不同子集进行逻辑回归。为此,我使用以下代码(使用dplyrpurrr):

# define model to be run
mod_fun <- function(df) {
  glm(presence ~ transect, data = df, family = "binomial")
}

# nest data and run model
mod.glm <- dat %>%
  nest(-c(region, fYear, species, road)) %>%
  mutate(model = map(data, mod_fun))

# define functions to extract model coefficients
b_fun <- function(mod) {
  coef(summary(mod))[2]
}
p_fun <- function(mod) {
  coef(summary(mod))[8]
}

# extract coefficients
slope<-mod.glm %>% group_by(species, region, fYear, road) %>%
  transmute(beta = map_dbl(model, b_fun),
            p_val = map_dbl(model, p_fun))

您可以看到,我只想提取斜率的估计值和p值(称为transect)。为此,我使用了索引coef(summary(mod))[2]等。问题是我的数据帧中还存在一些子集,导致某些系统的系数被设置为NA的系统超定。使用coef(summary(mod))[2]提取coef()输出的第二个值,并且由于coef()中的NA被忽略,因此这将不再是我要提取的transect的估算值。到目前为止,我尝试取消coef(summary(mod_2), complete = TRUE)(->没有任何变化,但仍未显示NA)并直接处理值coef(summary(mod_2), complete = TRUE)["transect","Estimate"](->会引发错误)。有谁知道我该如何解决这个问题?

到目前为止我尝试过的:

# two example models; mod_2 will result in NAs
mod_1 <- glm(presence ~ transect, data = dat[dat$fYear == 1&  dat$species=="Plantago lanceolata",], family = "binomial")
mod_2 <- glm(presence ~ transect, data = dat[dat$fYear == 2&  dat$species=="Plantago lanceolata",], family = "binomial")

coef(summary(mod_1))[2] # works fine
coef(summary(mod_2))[2] # not the value I want

coef(summary(mod_1), complete = TRUE)["transect","Estimate"] # works fine
coef(summary(mod_2), complete = TRUE)["transect","Estimate"] # error

coef(summary(mod_2), complete = TRUE) # NAs for transect are still not displayed

summary(mod_2)$coefficients["transect","Estimate"] # is not working either

数据:

dput(dat)
structure(list(region = c("HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI"), road = c("MK", "MK", 
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", 
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", 
"MK", "MK", "MK", "MK", "MK"), transect = c(1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 4L, 4L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 
11L, 11L, 12L, 12L, 13L, 13L, 15L, 15L, 1L), fYear = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), species = c("Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata"
), presence = c(1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -29L), groups = structure(list(
    fYear = c(1L, 1L, 2L), road = c("MK", "MK", "MK"), species = c("Plantago lanceolata", 
    "Poa pratensis", "Plantago lanceolata"), .rows = list(c(1L, 
    3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L
    ), c(2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 22L, 24L, 
    26L, 28L), 29L)), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

感谢您的帮助!

r na purrr glm
2个回答
0
投票

我不知道如何从系数表中提取NA行。相反,答案可能是在提取所需的元素后添加NA行。可以用complete()完成。

在此示例中,我使用broom::tidy(),这意味着我必须对此事进行过滤,但是您当然可以对您的功能执行类似的操作。

library(purrr)
library(tidyr)
library(dplyr)

mod.glm %>%
     group_by(species, region, fYear, road) %>%
     transmute(results = map(model, broom::tidy) ) %>%
     unnest(results) %>%
     complete(term = "transect") %>%
     filter(term != "(Intercept)") %>%
     ungroup()

# A tibble: 3 x 9
  species          region fYear road  term    estimate std.error statistic p.value
  <chr>            <chr>  <int> <chr> <chr>      <dbl>     <dbl>     <dbl>   <dbl>
1 Plantago lanceo~ HWI        1 MK    transe~  -42.7   31127.     -0.00137   0.999
2 Plantago lanceo~ HWI        2 MK    transe~   NA        NA      NA        NA    
3 Poa pratensis    HWI        1 MK    transe~   -0.206     0.188  -1.10      0.272

0
投票

采用完全不同的路线,您可以更改提取函数以在出现错误时返回NA。这是用于tryCatch()之类的功能的工作,但我发现purrr中的possibly()对于此类任务非常方便。

possibly()环绕一个函数。 otherwise参数说明使用该函数时发生错误时要返回的值。

这是您的两个功能,包装在possibly()中。我已将它们更改为专门用于系数摘要的“横断”行,因此如果不存在此行将出错。

b_fun <- possibly(
     function(mod) {
          coef(summary(mod))["transect", 1]
          }, otherwise = NA)

p_fun <- possibly(
     function(mod) {
          coef(summary(mod))["transect", 4]
          }, otherwise = NA)

# extract coefficients
mod.glm %>%
     group_by(species, region, fYear, road) %>%
     transmute(beta = map_dbl(model, b_fun),
               p_val = map_dbl(model, p_fun) ) %>%
     ungroup() 

# A tibble: 3 x 6
  species             region fYear road     beta  p_val
  <chr>               <chr>  <int> <chr>   <dbl>  <dbl>
1 Plantago lanceolata HWI        1 MK    -42.7    0.999
2 Poa pratensis       HWI        1 MK     -0.206  0.272
3 Plantago lanceolata HWI        2 MK     NA     NA    
© www.soinside.com 2019 - 2024. All rights reserved.