如何计算给定变量的哪些值以及多少个值满足另一个变量的条件？

Question

我有一个以下形式的数据框

# Minimum example
> data.frame(variable = c("A", "B", "C", "A", "B", "C"),
+            quantity1 = c(2,4,5,4,6,7),
+            quantity2 = c(3,5,6,7,8,9),
+            group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
  variable quantity1 quantity2 group
1        A         2         3   G_A
2        B         4         5   G_A
3        C         5         6   G_A
4        A         4         7   G_B
5        B         6         8   G_B
6        C         7         9   G_B

我想创建一个包含新计算的摘要的数据集。具体来说，我想要：1）计算数量1和数量2中有多少值高于给定数字（新变量：阈值）2）然后，获取两列，指示3）“变量”中的哪些值高于给定值数量 1 和数量 2 的数字，4) 数量 1 和数量 2 中有多少个值高于给定数字。

  threshold quantity1values quantity1_nvalues quantity2values quantity2_nvalues group
1        2         B, C         2                    A, B, C         3           G_A
2        2         A, B, C      3                    A, B, C         3           G_B
3        4         C           1                     B, C            2           G_A
4        4         B, C        2                     A, B, C         3           G_B

我可以通过列表列、字符串、length() 列和分组摘要的组合来到达这里。但我觉得应该有一个有效的解决方案，可能是编写一个函数来概括这一点，这样如果我添加一个新的阈值，或者有两个以上的数量（即，quantit1、quantity2和quantity3），新值就会被计算。实际上，以长格式获取先前的数据帧也可以。 Base R 或 tidivyerse 解决方案表示赞赏！

更新

我设法通过以下代码理解了我的大部分意思：

# Minimum example
data <- data.frame(variable = c("A", "B", "C", "A", "B", "C"),
           quantity1 = c(2,4,5,4,6,7),
           quantity2 = c(3,5,6,7,8,9),
           group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B")) 
data |> 
  select(variable, quantity1, quantity2, group) |> 
  mutate(threshold1 = 2,
         threshold2 = 4) |> 
  pivot_longer(cols = starts_with("thres"),
               names_to = "threshold", values_to = "threshold_value") |> 
  pivot_longer(cols = starts_with("quant"),
               names_to = "quantity", values_to = "quantity_value") |> 
  group_by(group, threshold, quantity) |>
  mutate(above_threshold = sum(quantity_value > threshold_value))

但是，仍然存在两个问题：1）如何获取“变量”中哪些特定值高于阈值的信息？例如，作为字符串 2）我最初的问题涉及效率（我的意思是代码的简短性和普遍性）。我想通用性不再是一个问题，因为扩展以前的代码以获得更多阈值非常简单。

Answer 1

这里是

tidyverse::purrr

。

library(tidyverse)

# Toy data --------------------
my_df <- tibble::tribble(
  ~var, ~q1, ~q2, ~group, ~threshold,
  "A",   2,   3,  "G_A",         3,
  "B",   4,   5,  "G_A",        10,
  "C",   5,   6,  "G_A",         6,
  "A",   4,   7,  "G_B",         1,
  "B",   6,   8,  "G_B",         0,
  "C",   7,   9,  "G_B",         7)

# Select all `q` columns and preffix them with "pos_"
new_df <- rename_with(select(my_df, starts_with("q")), \(col) str_glue("pos_{col}"))

# Find where `q` is greater than `threshold` (`pos_` columns are lists of logical)
new_df <- map_dfr(my_df$threshold, \(t) map(new_df, \(q_col) list(which(q_col > t)))) 
new_df <- bind_cols(my_df, new_df) 

# Calculate how many `q`s are greater than each `threshold` and with which values
new_df <- mutate(
  new_df,
  across(
    starts_with("pos"),
    .fns = list(
        n = \(pos_q)  map_int(pos_q, length),
      val = \(pos_q) map2_chr(pos_q, list(var), \(pos, var) str_flatten_comma(var[pos]))), 
    .names = "{.fn}_{str_extract(.col, 'q.$')}"))

# Just sorting columns
new_df <- select(new_df, var, group, threshold, matches("\\d$"), -starts_with("pos"))

这是输出：

# Output
new_df
#> # A tibble: 6 × 9
> new_df
# A tibble: 6 × 9
  var   group threshold    q1    q2  n_q1 val_q1              n_q2 val_q2          
  <chr> <chr>     <int> <dbl> <dbl> <int> <chr>              <int> <chr>           
1 A     G_A           2     2     3     5 "B, C, A, B, C"        6 A, B, C, A, B, C
2 B     G_A           3     4     5     5 "B, C, A, B, C"        5 B, C, A, B, C   
3 C     G_A           0     5     6     6 "A, B, C, A, B, C"     6 A, B, C, A, B, C
4 A     G_B           5     4     7     2 "B, C"                 4 C, A, B, C      
5 B     G_B           7     6     8     0 ""                     2 B, C            
6 C     G_B           4     7     9     3 "C, B, C"              5 B, C, A, B, C

如果

""

上的

val_q

是不可取的，只是..

mutate(across(starts_with("val_q"), \(x) if_else(x == "", NA_character_, x)))

希望有帮助！

^{创建于 2024-04-30，使用 reprex v2.1.0}

如何计算给定变量的哪些值以及多少个值满足另一个变量的条件？

问题描述投票：0回答：1

1个回答

最新问题

如何计算给定变量的哪些值以及多少个值满足另一个变量的条件？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1