我有以下df:
df <- tibble(country = c("US", "US", "US", "US", "US", "US", "US", "US", "US", "Mex", "Mex"),
year = c(1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2000, 2001),
score = c(NA, NA, NA, NA, 426, NA, NA, 430, NA, 450, NA))
我想做的是:创建一个新的变量years_from_implementation
,它是一个国家对score
具有非NA值的第一年为0,并表示所有其他值从0开始的年数。
换句话说,硬编码,我希望它返回以下df:
df <- tibble(country = c("US", "US", "US", "US", "US", "US", "US", "US", "US", "Mex", "Mex"),
year = c(1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2000, 2001),
score = c(NA, NA, NA, NA, 426, NA, NA, 430, NA, 450, NA),
years_from_implementation = c(-4,-3,-2,-1,0,1,2,3,4,0,1))
这一切都是在country
分组时完成的。
我试图将df <- mutate(df, before_after = case_when(!is.na(score) ~ 0))
与fill
命令结合,但无法获得任何动作。
Tidyverse解决方案将是首选,但真正的任何帮助将非常值得赞赏。
提前致谢!
这是一个dplyr
选项
library(dplyr)
df %>%
group_by(country) %>%
mutate(years_from_implementation = 1:n() - which(score == first(score[!is.na(score)]))) %>%
ungroup()
## A tibble: 11 x 4
# country year score years_from_implementation
# <chr> <dbl> <dbl> <int>
# 1 US 1999 NA -4
# 2 US 2000 NA -3
# 3 US 2001 NA -2
# 4 US 2002 NA -1
# 5 US 2003 426 0
# 6 US 2004 NA 1
# 7 US 2005 NA 2
# 8 US 2006 430 3
# 9 US 2007 NA 4
#10 Mex 2000 450 0
#11 Mex 2001 NA 1
我们可以找出第一个非NA score
出现的行索引,然后为每个组创建一个从1 - index
到n() - index
的序列。
library(dplyr)
df %>%
group_by(country) %>%
mutate(index = which.max(!is.na(score)),
years_from_implementation = (1 - index[1]):(n() - index[1])) %>%
select(-index)
# country year score years_from_implementation
# <chr> <dbl> <dbl> <int>
# 1 US 1999 NA -4
# 2 US 2000 NA -3
# 3 US 2001 NA -2
# 4 US 2002 NA -1
# 5 US 2003 426 0
# 6 US 2004 NA 1
# 7 US 2005 NA 2
# 8 US 2006 430 3
# 9 US 2007 NA 4
#10 Mex 2000 450 0
#11 Mex 2001 NA 1