用dplyr引导置信区间

问题描述 投票:1回答:3

看看答案here我如何估计自举时间间隔?这个问题也在ggplot2列表中被提出。

library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
    sd.mpg = sd(mpg, na.rm = TRUE),
    n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
 lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
 upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)
r dplyr confidence-interval
3个回答
5
投票

Hmisc包具有smean.cl.boot函数,可以轻松计算简单的bootstrap置信区间。最难的部分(IMO)将此结果的多个输出(该函数返回一个3元素的数字向量)合并到dplyr工作流程中(请参阅dplyr::mutate to add multiple values

library(Hmisc)  ## optional if using Hmisc:: below
library(dplyr)
mtcars %>%
  group_by(vs) %>%
  do(data.frame(rbind(Hmisc::smean.cl.boot(.$mpg))))

新列标记为MeanLowerUpper,但额外的setNames调用将修复...

如果做了很多这样的事情,

bootf <- function(x,var="mpg") {
    newstuff <- rbind(Hmisc::smean.cl.boot(x[[var]])) %>%
         data.frame %>%
         setNames(paste(var,c("mean","lwr","upr"),sep="_"))
    return(newstuff)
}
mtcars %>% group_by(vs) %>% do(bootf(.))
mtcars %>% group_by(cyl) %>% do(bootf(.))

2
投票

使用上面的代码,

data.frame(boot=1:1000) %>%
  group_by(boot) %>% 
  do(sample_n(mtcars, nrow(mtcars), replace=TRUE)) %>%
  group_by(boot, vs) %>%
dplyr::summarise(mean.mpg = mean(mpg, na.rm = TRUE),
                 sd.mpg = sd(mpg, na.rm = TRUE),
                 n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg) %>% 
    group_by(vs) %>% summarise_each(funs(mean), vars = -boot)

答案是

# A tibble: 2 x 7
     vs mean.mpg   sd.mpg n.mpg   se.mpg lower.ci.mpg upper.ci.mpg
  <dbl>    <dbl>    <dbl> <dbl>    <dbl>        <dbl>        <dbl>
1     0 16.62142 3.679562 17.97 0.876537     15.09220     18.15063
2     1 24.53193 5.125643 14.03 1.388702     22.05722     27.00663

2
投票

ORIGINAL ANSWER: Bootstrapping a single column

下面的代码包括一个简单的引导函数以及一些返回信息数据框的附加代码:

my_boot = function(x, times=1000) {

   # Get column name from input object
   var = deparse(substitute(x))
   var = gsub("^\\.\\$","", var)

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return data frame of results
  data.frame(var, n=length(x), mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(my_boot(.$mpg))
     vs    var     n     mean lower.ci upper.ci
  <dbl> <fctr> <int>    <dbl>    <dbl>    <dbl>
1     0    mpg    18 16.61667 15.14972 18.06139
2     1    mpg    14 24.55714 22.36357 26.80750

UPDATE: Bootstrapping any selection of columns

根据您的评论,这里有一个更新的方法来获取任何列选择的bootsrapped置信区间:

library(reshape2)
library(tidyr)

my_boot = function(x, times=1000) {

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return results as a data frame
  data.frame(mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(as.data.frame(apply(., 2, my_boot))) %>% 
  melt(id.var="vs") %>%
  separate(variable, sep="\\.", extra="merge", into=c("col","stat")) %>%
  dcast(vs + col ~ stat, value.var="value")
   vs  col    lower.ci        mean    upper.ci
1   0   am   0.1111111   0.3333333   0.5555556
2   0 carb   3.0000000   3.6111111   4.2777778
3   0  cyl   6.8888889   7.4444444   7.8888889
4   0 disp 262.3205556 307.1500000 352.4481944
5   0 drat   3.1877639   3.3922222   3.6011528
6   0 gear   3.2222222   3.5555556   3.9444444
7   0   hp 164.0500000 189.7222222 218.5625000
8   0  mpg  14.9552778  16.6166667  18.3225000
9   0 qsec  16.1888750  16.6938889  17.1744583
10  0   vs   0.0000000   0.0000000   0.0000000
11  0   wt   3.2929569   3.6885556   4.0880069
12  1   am   0.2142857   0.5000000   0.7857143
13  1 carb   1.2857143   1.7857143   2.3571429
14  1  cyl   4.1428571   4.5714286   5.0000000
15  1 disp 105.5703571 132.4571429 161.4657143
16  1 drat   3.5992143   3.8592857   4.1100000
17  1 gear   3.5714286   3.8571429   4.1428571
18  1   hp  79.7125000  91.3571429 103.2142857
19  1  mpg  21.8498214  24.5571429  27.3289286
20  1 qsec  18.7263036  19.3335714  20.0665893
21  1   vs   1.0000000   1.0000000   1.0000000
22  1   wt   2.2367000   2.6112857   2.9745571

Other updates to answer questions in the comments

更新:在@BenBolker的回答中回答你对我的评论:如果你想要sample返回的结果,你可以这样做:

boot.dat = replicate(1000, sample(mtcars$mpg[mtcars$vs==1], replace=TRUE))

这将返回一个包含1000列的矩阵,每个列都是mtcars$mpgvs==1的单独bootstrap样本。你也可以这样做:

boot.by.vs = sapply(split(mtcars, mtcars$vs), function(df) {
   replicate(1000, sample(df$mpg, replace=TRUE))
}, simplify=FALSE)

这将返回一个列表,其中第一个列表元素是vs==0的bootstrap样本矩阵,第二个列表元素是vs==1

更新2:要回答你的第二个评论,这里是如何引导整个数据框(并假设你想要保存所有副本,而不是总结它们。下面的代码返回1000个自举版本的mtcars1的列表。这个列表将是如果您有大量数据,那么您可能只想保留每个引导样本的汇总结果,如列方式。

boot.df = lapply(1:1000, function(i) mtcars[sample(1:nrow(mtcars), replace=TRUE), ])
© www.soinside.com 2019 - 2024. All rights reserved.