我有一个以下形式的数据框
# Minimum example
> data.frame(variable = c("A", "B", "C", "A", "B", "C"),
+ quantity1 = c(2,4,5,4,6,7),
+ quantity2 = c(3,5,6,7,8,9),
+ group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
variable quantity1 quantity2 group
1 A 2 3 G_A
2 B 4 5 G_A
3 C 5 6 G_A
4 A 4 7 G_B
5 B 6 8 G_B
6 C 7 9 G_B
我想创建一个包含新计算的摘要的数据集。具体来说,我想要:1)计算数量1和数量2中有多少值高于给定数字(新变量:阈值)2)然后,获取两列,指示3)“变量”中的哪些值高于给定值数量 1 和数量 2 的数字,4) 数量 1 和数量 2 中有多少个值高于给定数字。
threshold quantity1values quantity1_nvalues quantity2values quantity2_nvalues group
1 2 B, C 2 A, B, C 3 G_A
2 2 A, B, C 3 A, B, C 3 G_B
3 4 C 1 B, C 2 G_A
4 4 B, C 2 A, B, C 3 G_B
我可以通过列表列、字符串、length() 列和分组摘要的组合来到达这里。但我觉得应该有一个有效的解决方案,可能是编写一个函数来概括这一点,这样如果我添加一个新的阈值,或者有两个以上的数量(即,quantit1、quantity2和quantity3),新值就会被计算。实际上,以长格式获取先前的数据帧也可以。 Base R 或 tidivyerse 解决方案表示赞赏!
更新
我设法通过以下代码理解了我的大部分意思:
# Minimum example
data <- data.frame(variable = c("A", "B", "C", "A", "B", "C"),
quantity1 = c(2,4,5,4,6,7),
quantity2 = c(3,5,6,7,8,9),
group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
data |>
select(variable, quantity1, quantity2, group) |>
mutate(threshold1 = 2,
threshold2 = 4) |>
pivot_longer(cols = starts_with("thres"),
names_to = "threshold", values_to = "threshold_value") |>
pivot_longer(cols = starts_with("quant"),
names_to = "quantity", values_to = "quantity_value") |>
group_by(group, threshold, quantity) |>
mutate(above_threshold = sum(quantity_value > threshold_value))
但是,仍然存在两个问题:1)如何获取“变量”中哪些特定值高于阈值的信息?例如,作为字符串 2)我最初的问题涉及效率(我的意思是代码的简短性和普遍性)。我想通用性不再是一个问题,因为扩展以前的代码以获得更多阈值非常简单。
这里是
tidyverse::purrr
。
library(tidyverse)
# Toy data --------------------
my_df <- tibble::tribble(
~var, ~q1, ~q2, ~group, ~threshold,
"A", 2, 3, "G_A", 3,
"B", 4, 5, "G_A", 10,
"C", 5, 6, "G_A", 6,
"A", 4, 7, "G_B", 1,
"B", 6, 8, "G_B", 0,
"C", 7, 9, "G_B", 7)
# Select all `q` columns and preffix them with "pos_"
new_df <- rename_with(select(my_df, starts_with("q")), \(col) str_glue("pos_{col}"))
# Find where `q` is greater than `threshold` (`pos_` columns are lists of logical)
new_df <- map_dfr(my_df$threshold, \(t) map(new_df, \(q_col) list(which(q_col > t))))
new_df <- bind_cols(my_df, new_df)
# Calculate how many `q`s are greater than each `threshold` and with which values
new_df <- mutate(
new_df,
across(
starts_with("pos"),
.fns = list(
n = \(pos_q) map_int(pos_q, length),
val = \(pos_q) map2_chr(pos_q, list(var), \(pos, var) str_flatten_comma(var[pos]))),
.names = "{.fn}_{str_extract(.col, 'q.$')}"))
# Just sorting columns
new_df <- select(new_df, var, group, threshold, matches("\\d$"), -starts_with("pos"))
这是输出:
# Output
new_df
#> # A tibble: 6 × 9
> new_df
# A tibble: 6 × 9
var group threshold q1 q2 n_q1 val_q1 n_q2 val_q2
<chr> <chr> <int> <dbl> <dbl> <int> <chr> <int> <chr>
1 A G_A 2 2 3 5 "B, C, A, B, C" 6 A, B, C, A, B, C
2 B G_A 3 4 5 5 "B, C, A, B, C" 5 B, C, A, B, C
3 C G_A 0 5 6 6 "A, B, C, A, B, C" 6 A, B, C, A, B, C
4 A G_B 5 4 7 2 "B, C" 4 C, A, B, C
5 B G_B 7 6 8 0 "" 2 B, C
6 C G_B 4 7 9 3 "C, B, C" 5 B, C, A, B, C
如果
""
上的 val_q
是不可取的,只是..mutate(across(starts_with("val_q"), \(x) if_else(x == "", NA_character_, x)))
希望有帮助!
创建于 2024-04-30,使用 reprex v2.1.0