使用 dplyr 和条件语句总结 R 中的均值和置信区间

问题描述 投票:0回答:0

My

df
如下所示:
x
y
变量具有 0、1 和 NA 值,而
z
变量是一个数值范围从 0 到 5 的数值。我想测量 conditional 1
x
 中那些 
y
 实例的平均值
,以及
z
值的正常平均值,以及它们各自的置信区间。

df <- tribble(
  ~"name", ~"region", ~x, ~y, ~z,
  "A", "reg1", 0, 1, 1,
  "A", "reg1", 1, 1, NA,
  "B", "reg1", 1, 0, 4,
  "C", "reg2", 1, 0, 2,
  "B", "reg2", 0, NA, 0,
  "C", "reg1", NA, 0, 5,
  "C", "reg1", 0, 1, 2,
  "B", "reg1", NA, 1, 3,
  "B", "reg2", 1, NA, NA,
  "A", "reg2", 1, 1, 1,
  "A", "reg2", 0, 1, 4,
  "A", "reg2", 1, 1, 2,
  "A", "reg1", 0, 1, 3,
)

我想要一个像这样的列的最终整洁表(只是为了说明我放了两行):

df1 <- tribble(
  ~"name", ~"region", ~"Indicator", ~"mean/prevalence", ~"Upper interval", ~"Lower interval",
  "A", "reg1", "x", 66, 68.5, 62.3,
  "A", "reg1", "z", 2.3, 2.5, 2.1,
)

我的问题是如何组织我的

dplyr
动词。我是这样做的,但这是错误的,因为在每个间隔计算中考虑的人口数量
n()
(它们的长度都相同)。

df %>%
  group_by(name, region) %>%
  summarise(
    meanX = mean(x == 1, na.rm = TRUE)*100,
    nX = n(),
    Xlower_ci = mean(x == 1)*100 - qt(1- 0.05/2, (n() - 1))*sd(x == 1)/sqrt(n()),
    Xupper_ci = mean(x == 1)*100 + qt(1- 0.05/2, (n() - 1))*sd(x == 1)/sqrt(n()),
    meanY = mean(y == 1, na.rm = TRUE)*100,
    nY = n(),
    Ylower_ci = mean(y == 1)*100 - qt(1- 0.05/2, (n() - 1))*sd(y == 1)/sqrt(n()),
    Yupper_ci = mean(y == 1)*100 + qt(1- 0.05/2, (n() - 1))*sd(y == 1)/sqrt(n()),
    meanZ = mean(z, na.rm = TRUE),
    nZ = n(),
    Zlower_ci = mean(z) - qt(1- 0.05/2, (n() - 1))*sd(z)/sqrt(n()),
    Zupper_ci = mean(z) + qt(1- 0.05/2, (n() - 1))*sd(z)/sqrt(n()),
  )

如果我能制作出上面的桌子,那么我就可以用

pivot_longer()
达到
df1
,这是最终的结果。

r dplyr tidyverse data-science confidence-interval
© www.soinside.com 2019 - 2024. All rights reserved.