自动 stat_summary 统计数据与手动标准误差之间的差异

问题描述 投票:0回答:1

我正在使用

{ggplot2}
绘制分组变量的均值和置信区间,并注意到如果我手动计算置信区间或使用
stat_summary()
我会得到略有不同的结果。

有谁知道可能导致这种差异的原因是什么?

下面的可重现代码(注意 - 变量

y
故意倾斜以模仿我的实际数据集,但想知道这是否是导致问题的原因)。

# Generate data
# Number of observations per group
n_per_group <- 50

# Generate left-skewed data
group1 <- rgamma(n_per_group, shape = 2, scale = 1)
group2 <- rgamma(n_per_group, shape = 3, scale = 1.5)
group3 <- rgamma(n_per_group, shape = 4, scale = 2)

# Combine data into a single data frame
df <- data.frame(
  y = rep(c("Group 1", "Group 2", "Group 3"), each = n_per_group),
  x = c(group1, group2, group3)
)

# Using stat_summary()
df %>%
  ggplot(., aes(x = x, y = y, group = y)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_se", 
               geom = "errorbar", 
               width = 0.1) +
  scale_x_continuous(breaks = seq(1, 10, by = 0.5))

# By hand
df %>%
  group_by(y) %>%
  summarise(mean = mean(x, na.rm = T), 
            std.dev = sd(x, na.rm = T), 
            n = n(),
            se = std.dev / sqrt(n)) %>%
  
  ggplot(., aes(y = y)) + 
  geom_errorbar(aes(xmin = mean - 1.96*se, 
                    xmax = mean + 1.96*se), 
                width = 0.1) + 
  geom_point(aes(x = mean)) +
  scale_x_continuous(breaks = seq(1, 10, by = 0.5))
r ggplot2 statistics
1个回答
0
投票

mean_se
函数的ggplot默认值增加1个标准差,而不是1.96。你可以让你的情节像这样匹配:

df %>%
  ggplot(., aes(x = x, y = y, group = y)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = \(x) mean_se(x, mult = 1.96),
               geom = "errorbar", 
               width = 0.1) +
  scale_x_continuous(breaks = seq(1, 10, by = 0.5))

您可以在控制台输入

mean_se
查看其定义:

# mean_se
function (x, mult = 1) 
{
    x <- stats::na.omit(x)
    se <- mult * sqrt(stats::var(x)/length(x))
    mean <- mean(x)
    data_frame0(y = mean, ymin = mean - se, ymax = mean + se, 
        .size = 1)
}
# <bytecode: 0x10bd2b7d8>
# <environment: namespace:ggplot2>
© www.soinside.com 2019 - 2024. All rights reserved.