我对 dplyr 很陌生,不知道我做错了什么。 我有以下数据集:
INSTRUMENT_USED Year UniqueCount
1 QUEST_A 2015 1
2 QUEST_A 2016 1
3 QUEST_A 2017 1
4 QUEST_A 2018 1
5 QUEST_A 2019 1
6 QUEST_A 2020 1
7 QUEST_A 2021 0
8 QUEST_A 2022 0
9 QUEST_A 2023 0
10 QUEST_B 2015 1
11 QUEST_B 2016 1
12 QUEST_B 2017 1
13 QUEST_B 2018 1
14 QUEST_B 2019 0
15 QUEST_B 2020 0
16 QUEST_B 2021 1
17 QUEST_B 2022 0
18 QUEST_B 2023 0
我想创建一个变量“dbreak”,当满足以下两个条件时指示中断:
因此,在上面的示例中,“dbreak”变量在任何地方都应该为“否”,除了 QUEST_A 2021 之外,它为“是”
我设法让它手动工作,但一旦我尝试使
summarise
以变量 Year
为条件,它似乎就不再工作了。
当我运行下面的代码时,我得到了预期的结果
df <- df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
prev = lag(UniqueCount),
dbreak = ifelse(UniqueCount==0 & prev == 1 &
all(UniqueCount[Year <= 2020] == 1) &
all(UniqueCount[Year >= 2021] ==0), "YES", "No"))
INSTRUMENT_USED Year UniqueCount prev dbreak
<fct> <dbl> <dbl> <dbl> <chr>
1 QUEST_A 2015 1 NA No
2 QUEST_A 2016 1 1 No
3 QUEST_A 2017 1 1 No
4 QUEST_A 2018 1 1 No
5 QUEST_A 2019 1 1 No
6 QUEST_A 2020 1 1 No
7 QUEST_A 2021 0 1 Yes
8 QUEST_A 2022 0 0 No
9 QUEST_A 2023 0 0 No
10 QUEST_B 2015 1 NA No
11 QUEST_B 2016 1 1 No
12 QUEST_B 2017 1 1 No
13 QUEST_B 2018 1 1 No
14 QUEST_B 2019 0 1 No
15 QUEST_B 2020 0 0 No
16 QUEST_B 2021 1 0 No
17 QUEST_B 2022 0 1 No
18 QUEST_B 2023 0 0 No
但是当我用
Year
变量替换硬编码年份时,它不再起作用,或者更确切地说,我找不到我的情况出了什么问题。如果我保持完全相同,则全部评估为“否”,如果我修改以删除第二个 all()
语句中的“=”符号,它会识别 QUEST_B 的两个中断(2019 年和 2022 年)。
df<- df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
prev = lag(UniqueCount),
dbreak = ifelse(UniqueCount==0 & prev == 1 &
all(UniqueCount[Year <= Year-1] == 1) &
all(UniqueCount[Year > Year] ==0), "Yes", "No"))
INSTRUMENT_USED Year UniqueCount prev dbreak
<fct> <dbl> <dbl> <dbl> <chr>
1 QUEST_A 2015 1 NA No
2 QUEST_A 2016 1 1 No
3 QUEST_A 2017 1 1 No
4 QUEST_A 2018 1 1 No
5 QUEST_A 2019 1 1 No
6 QUEST_A 2020 1 1 No
7 QUEST_A 2021 0 1 Yes
8 QUEST_A 2022 0 0 No
9 QUEST_A 2023 0 0 No
10 QUEST_B 2015 1 NA No
11 QUEST_B 2016 1 1 No
12 QUEST_B 2017 1 1 No
13 QUEST_B 2018 1 1 No
14 QUEST_B 2019 0 1 Yes
15 QUEST_B 2020 0 0 No
16 QUEST_B 2021 1 0 No
17 QUEST_B 2022 0 1 Yes
18 QUEST_B 2023 0 0 No
有什么想法吗?
我将计算前导差异作为辅助列,然后测试是否 (a) 正好有一个差异为 1,(b) 所有其他差异均为 0,以及 (c) 当前行的差异为 1,如果是的话 '是'否则'否':
df %>%
group_by(INSTRUMENT_USED) %>%
arrange(INSTRUMENT_USED,Year) %>%
mutate(
diff = UniqueCount - lead(UniqueCount, default = 0),
dbreak = ifelse(
sum(diff == 1) == 1 & sum(diff == 0) == (n() - 1) & diff == 1,
"Yes", "No"
)
)
# # A tibble: 18 × 5
# # Groups: INSTRUMENT_USED [2]
# INSTRUMENT_USED Year UniqueCount diff dbreak
# <chr> <int> <int> <int> <chr>
# 1 QUEST_A 2015 1 0 No
# 2 QUEST_A 2016 1 0 No
# 3 QUEST_A 2017 1 0 No
# 4 QUEST_A 2018 1 0 No
# 5 QUEST_A 2019 1 0 No
# 6 QUEST_A 2020 1 1 Yes
# 7 QUEST_A 2021 0 0 No
# 8 QUEST_A 2022 0 0 No
# 9 QUEST_A 2023 0 0 No
# 10 QUEST_B 2015 1 0 No
# 11 QUEST_B 2016 1 0 No
# 12 QUEST_B 2017 1 0 No
# 13 QUEST_B 2018 1 1 No
# 14 QUEST_B 2019 0 0 No
# 15 QUEST_B 2020 0 -1 No
# 16 QUEST_B 2021 1 1 No
# 17 QUEST_B 2022 0 0 No
# 18 QUEST_B 2023 0 0 No
看看这是否适用于您的数据集。
df %>%
mutate(dbreak = if_else(max(consecutive_id(UniqueCount)) == 2 &
lag(UniqueCount, default=UniqueCount[1]) != UniqueCount &
UniqueCount[1] == 1, "Yes", "No"),
.by = INSTRUMENT_USED)
INSTRUMENT_USED Year UniqueCount dbreak
1 QUEST_A 2015 1 No
2 QUEST_A 2016 1 No
3 QUEST_A 2017 1 No
4 QUEST_A 2018 1 No
5 QUEST_A 2019 1 No
6 QUEST_A 2020 1 No
7 QUEST_A 2021 0 Yes
8 QUEST_A 2022 0 No
9 QUEST_A 2023 0 No
10 QUEST_B 2015 1 No
11 QUEST_B 2016 1 No
12 QUEST_B 2017 1 No
13 QUEST_B 2018 1 No
14 QUEST_B 2019 0 No
15 QUEST_B 2020 0 No
16 QUEST_B 2021 1 No
17 QUEST_B 2022 0 No
18 QUEST_B 2023 0 No