跨列应用多个汇总函数：summarise_all：无法强制“list”对象输入“double”

Question

我正在尝试创建一个描述性统计表，为数据帧的每一列获取以下统计数据：平均值、标准差、第 10 个、第 50 个和第 90 个分位数。然后我想转置数据集，以便列是不同的统计数据，每行都是数据集中的一个变量。

这是一个示例数据集：

dt <- data.frame(id = 1:100,
                 Numeric_Column_1 = rnorm(100),
                 Numeric_Column_2 = rnorm(100),
                 Numeric_Column_3 = rnorm(100),
                 Numeric_Column_4 = rnorm(100),
                 Numeric_Column_5 = rnorm(100))

以及应生成表格的代码：


desc_table <- dt %>% select(-id)  %>%
  dplyr::summarise_all(.funs = list(mean=mean(.,na.rm=T), sd=sd(.,na.rm=T), P10=~quantile(., c(0.1), na.rm=T), P50=~quantile(., c(0.5), na.rm=T), P90=~quantile(., c(0.9), na.rm=T)), na.rm=TRUE) %>%
  pivot_longer(cols = everything()) %>%
  separate(name,c("Variable", "Stat"),sep = "_") %>%
  pivot_wider(names_from = "Stat", values_from = "value") %>%
  mutate(mean = round(mean, 2), sd= round(sd, 2))

但是我收到以下错误： is.data.frame(x) 中的错误：无法强制“list”对象输入“double” 另外：警告消息：在mean.default(., na.rm = T)中：参数不是数字或逻辑：返回 NA

我该如何解决这个问题？

Answer 1

尝试这样做，修改您的代码以适应现代习惯用法，并将

<colname><separator><statistic>

习惯用法中的分隔符从

"_"

更改为

"."

以避免与列名称冲突（这可能是错误的根源）。 ..

dt %>%
  dplyr::summarise(
    across(
      -id,
      list(
        mean = \(x) mean(x, na.rm = TRUE), 
        sd = \(x) sd(x, na.rm = TRUE), 
        P10 = \(x) quantile(x, 0.1, na.rm = TRUE), 
        P50 = \(x) quantile(x, 0.5, na.rm = TRUE), 
        P90 = \(x) quantile(x, 0.9, na.rm = TRUE)
      ),
      .names = "{.col}.{.fn}"
    ) 
  ) %>%
  pivot_longer(
    everything(), 
    names_sep = "\\.", 
    names_to = c("Variable", "Stat")
  ) %>%
  pivot_wider(names_from = "Stat", values_from = "value") %>%
  mutate(mean = round(mean, 2), sd= round(sd, 2))
# A tibble: 5 × 6
  Variable          mean    sd   P10     P50   P90
  <chr>            <dbl> <dbl> <dbl>   <dbl> <dbl>
1 Numeric_Column_1 -0.04  0.94 -1.20 -0.0872  1.11
2 Numeric_Column_2 -0.15  1.03 -1.46 -0.107   1.07
3 Numeric_Column_3  0.11  1.01 -1.53  0.229   1.14
4 Numeric_Column_4  0.09  1.05 -1.17  0.103   1.53
5 Numeric_Column_5 -0.02  1.02 -1.34 -0.0238  1.38

在

.names

调用中使用

across

就不再需要

separate

步骤。

从长远来看，最好删除管道的最后一个元件并用

knitr::kable(digits = 2)

替换它。这可以保持摘要的内部准确性，同时根据您的显示要求对其进行格式化。

此外，请参阅此页，了解为什么应使用

TRUE

和

FALSE

而不是

和

。

Answer 2

您应该使用

extract

而不是

separate

来使用某些正则表达式，并将

添加到

across

内的函数调用中：

dt %>% 
  select(-id)  %>%
  summarise(across(everything(), list(mean = ~mean(., na.rm = TRUE),
                                      sd = ~sd(.,na.rm=TRUE), 
                                      P10 = ~quantile(., c(0.1), na.rm=TRUE), 
                                      P50 = ~quantile(., c(0.5), na.rm=TRUE), 
                                      P90 = ~quantile(., c(0.9), na.rm=TRUE)))) %>% 
  pivot_longer(cols = everything()) %>% 
  extract(name, into = c("Variable", "Stat"), regex =  "^([A-Z].*_\\d+)_(.*)")  %>% 
  pivot_wider(names_from = "Stat", values_from = "value") %>%
  mutate(mean = round(mean, 2), sd= round(sd, 2))

# A tibble: 5 × 6
  Variable          mean    sd   P10     P50   P90
  <chr>            <dbl> <dbl> <dbl>   <dbl> <dbl>
1 Numeric_Column_1  0.09  0.96 -1.17  0.0428  1.42
2 Numeric_Column_2  0.04  1.05 -1.09 -0.0829  1.42
3 Numeric_Column_3  0.09  1.05 -1.33  0.168   1.42
4 Numeric_Column_4  0     1.04 -1.29 -0.118   1.48
5 Numeric_Column_5  0.09  1.02 -1.11  0.0578  1.19

Answer 3

library(dplyr)
library(tidyr)

set.seed(123)
dt <- data.frame(id = 1:100,
                 Numeric_Column_1 = rnorm(100),
                 Numeric_Column_2 = rnorm(100),
                 Numeric_Column_3 = rnorm(100),
                 Numeric_Column_4 = rnorm(100),
                 Numeric_Column_5 = rnorm(100))

my.summary <- \(x) list(mean=mean(x,na.rm=T), 
                        sd=sd(x,na.rm=T), 
                        P10=quantile(x, c(0.1), na.rm=T), 
                        P50=quantile(x, c(0.5), na.rm=T), 
                        P90=quantile(x, c(0.9), na.rm=T))

dt %>% 
  pivot_longer(-id) %>% 
  summarise(stat = list(my.summary(value)), .by = name) %>% 
  unnest_wider(stat)

#> # A tibble: 5 × 6
#>   name                mean    sd   P10      P50   P90
#>   <chr>              <dbl> <dbl> <dbl>    <dbl> <dbl>
#> 1 Numeric_Column_1  0.0904 0.913 -1.07  0.0618   1.26
#> 2 Numeric_Column_2 -0.108  0.967 -1.29 -0.226    1.06
#> 3 Numeric_Column_3  0.120  0.950 -1.03  0.0359   1.55
#> 4 Numeric_Column_4 -0.0362 1.04  -1.34 -0.00351  1.24
#> 5 Numeric_Column_5  0.106  0.989 -1.18  0.165    1.30

^{创建于 2024-04-22，使用 reprex v2.0.2}

跨列应用多个汇总函数：summarise_all：无法强制“list”对象输入“double”

问题描述投票：0回答：3

3个回答

最新问题

跨列应用多个汇总函数：summarise_all：无法强制“list”对象输入“double”

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3