如何根据多个条件解析创建标志?

问题描述 投票:0回答:3

我有一个包含 2 列(ID 和年份)的数据框。我想创建一个名为“FLAG”的第三列,它的输出基于以下条件(全部按 ID 分组)

  • 如果存在 2020 年或 2021 年(但不是两者)的数据,则输出“Gap”
  • 如果存在 2020 年和 2021 年的数据,则输出“Ap21”
  • 如果存在 2020 年、2021 年和 2022 年的数据,则输出“Ap22”
  • 如果 2020 年和 2021 年(或任何其他不属于上述 3 个条件的实例)的数据都不存在,则输出“notupdated”

我举了一个例子来说明我希望我的数据框最终是什么样子。

  data <- data.frame("ID" = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "D", "D"),  
  Year" = c(2019, 2021, 2022, 2019, 2020, 2021, 2019, 2020, 2021, 2022, 2018, 2019), "Flag" 
  = c("Gap", "Gap", "Gap", "Ap21", "Ap21", "Ap21", "Ap22", "Ap22", "Ap22", "Ap22", 
  "notupdated", "notupdated"))
r dataframe
3个回答
5
投票

dplyr
解决方案:

library(dplyr)

data |>
  mutate(
    Flag = case_when(
      all(c(2020, 2021, 2022) %in% Year) ~ "Ap22",
      all(c(2020, 2021) %in% Year) ~ "Ap21",
      any(c(2020, 2021) %in% Year) ~ "Gap",
      .default = "notupdated"
    ),
    .by = ID
  )

输出:

   ID Year       Flag
1   A 2019        Gap
2   A 2021        Gap
3   A 2022        Gap
4   B 2019       Ap21
5   B 2020       Ap21
6   B 2021       Ap21
7   C 2019       Ap22
8   C 2020       Ap22
9   C 2021       Ap22
10  C 2022       Ap22
11  D 2018 notupdated
12  D 2019 notupdated

1
投票

如果规则应该独立工作,这里有一个替代方案。
特别是规则“存在于 2020 年或 2021 年(但不是两者),那么输出“Gap””需要的不仅仅是

any(c(2020, 2021) %in% Year)

示例

library(dplyr)

data %>% 
  mutate(one = Year == 2020, two = Year == 2021, three = Year == 2022, 
         Flag = case_when(
                  (!any(one) | !any(two)) & (any(one) | any(two)) ~ "Gap"), 
         .by = ID) %>%
  select(-c(one, two, three))
   ID Year Flag
1   A 2019  Gap
2   A 2021  Gap
3   A 2022  Gap
4   B 2019 <NA>
5   B 2020 <NA>
6   B 2021 <NA>
7   C 2019 <NA>
8   C 2020 <NA>
9   C 2021 <NA>
10  C 2022 <NA>
11  D 2018 <NA>
12  D 2019 <NA>

适用于所有规则

library(dplyr)

data %>% 
  mutate(one = Year == 2020, two = Year == 2021, three = Year == 2022, 
         Flag = case_when(
                  (!any(one) | !any(two)) & (any(one) | any(two)) ~ "Gap",
                  any(one) & any(two) & !any(three) ~ "Ap21", 
                  any(one) & any(two) & any(three) ~ "Ap22", 
                  !any(one) & !any(two) & !any(three) ~ "notupdated"), .by = ID) %>% 
  select(-c(one, two, three))
   ID Year       Flag
1   A 2019        Gap
2   A 2021        Gap
3   A 2022        Gap
4   B 2019       Ap21
5   B 2020       Ap21
6   B 2021       Ap21
7   C 2019       Ap22
8   C 2020       Ap22
9   C 2021       Ap22
10  C 2022       Ap22
11  D 2018 notupdated
12  D 2019 notupdated

0
投票

我认为@jan的回答很棒,这只是对这个问题的推理 而且更复杂。

library(cgwtools) # provides an rle for sequences, `seqle`, very handy...
length(cgwtools::seqle(data$Year[which(data$ID == 'A')])$lengths)
[1] 2 
# any gap is a gap
(cgwtools::seqle(data$Year[which(data$ID == 'B')])$values)+(cgwtools::seqle(data$Year[which(data$ID == 'B')])$lengths) -1
[1] 2021
(cgwtools::seqle(data$Year[which(data$ID == 'C')])$values)+(cgwtools::seqle(data$Year[which(data$ID == 'C')])$lengths) -1
[1] 2022
(cgwtools::seqle(data$Year[which(data$ID == 'D')])$values)+(cgwtools::seqle(data$Year[which(data$ID == 'D')])$lengths) -1
[1] 2019
# and use for Flag logic

不是

base
,但是
seqle
确实可以派上用场。

© www.soinside.com 2019 - 2024. All rights reserved.