我有一个数据框,对于每一行,我想随机采样三列(其中三列在行之间可以不同)并取这三个采样值的平均值。作为另一个问题,我有许多行完全不适用(并且由于其他原因我无法删除它们)或仅包含 1 或 2 个非 NA 值。基于这个问题和答案,我尝试了以下方法:
df_new <- df %>%
rowwise %>%
mutate(inflo_mean = mean(sample(na.omit(c_across(everything())), 3)))
这不起作用,我收到有关使用
sample()
的错误:
Error in `mutate()`:
ℹ In argument: `inflo_mean = mean(sample(na.omit(c_across(everything())), 3))`.
ℹ In row 1.
Caused by error in `sample.int()`:
! invalid first argument
然后,我尝试将其分解为更小的步骤,并分别处理不同的 NA 情况,并得出以下结论:
df_new2 <- df %>%
rowwise() %>%
mutate(num_NAs = sum(!is.na(across(starts_with("Col_")))),
v_inflo = list(na.omit(c_across((starts_with("Col_"))))),
inflo_mean = case_when(num_NAs > 2 ~ mean(sample(v_inflo, 3)),
num_NAs == 2 ~ mean(v_inflo),
num_NAs == 1 ~ as.numeric(v_inflo),
num_NAs == 0 ~ NA_real_,
TRUE ~ NA_real_))
同样,这也不起作用,我得到了同样的错误。我检查了列的数据类型,它们都是整数。这里可能有什么问题?或者还有其他解决办法吗?
示例数据:
> dput(df)
structure(list(Col_1 = c(NA, 77L, 82L, 172L), Col_2 = c(NA, 79L,
NA, 135L), Col_3 = c(NA, 81L, NA, 131L), Col_4 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_5 = c(NA, NA, NA,
33L), Col_6 = c(NA, NA, NA, 104L), Col_7 = c(NA, NA, NA, 106L
), Col_8 = c(NA, NA, NA, 93L), Col_9 = c(NA, NA, NA, 50L), Col_10 = c(NA,
NA, NA, 48L), Col_11 = c(NA, NA, NA, 96L), Col_12 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_13 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_14 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_15 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
您可以像下面这样使用
mapply
df$inflo_mean <-
mapply(
\(x, k) mean(sample(na.omit(c(x)), k)),
asplit(df, 1),
pmin(rowSums(!is.na(df)), 3)
)
问题是包含所有
NA
值的行。如果您使用 tryCatch()
捕获错误并将其替换为 NA
,您的原始代码将有效。
library(dplyr)
df <- structure(list(Col_1 = c(NA, 77L, 82L, 172L), Col_2 = c(NA, 79L,
NA, 135L), Col_3 = c(NA, 81L, NA, 131L), Col_4 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_5 = c(NA, NA, NA,
33L), Col_6 = c(NA, NA, NA, 104L), Col_7 = c(NA, NA, NA, 106L
), Col_8 = c(NA, NA, NA, 93L), Col_9 = c(NA, NA, NA, 50L), Col_10 = c(NA,
NA, NA, 48L), Col_11 = c(NA, NA, NA, 96L), Col_12 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_13 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_14 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), Col_15 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
df_new <- df %>%
rowwise %>%
mutate(inflo_mean = tryCatch(mean(sample(na.omit(c_across(everything())), 3)), error = function(e)NA))
df_new %>% select(inflo_mean, everything())
#> # A tibble: 4 × 16
#> # Rowwise:
#> inflo_mean Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10 Col_11
#> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 NA NA NA NA NA NA NA NA NA NA NA NA
#> 2 79 77 79 81 NA NA NA NA NA NA NA NA
#> 3 41.7 82 NA NA NA NA NA NA NA NA NA NA
#> 4 95 172 135 131 NA 33 104 106 93 50 48 96
#> # ℹ 4 more variables: Col_12 <int>, Col_13 <int>, Col_14 <int>, Col_15 <int>
创建于 2024-02-08,使用 reprex v2.0.2