编辑该问题最初被要求输入data.table
。具有任何包装的解决方案将很有趣。
我对一个更普遍的问题的特定变化有些困惑。我有与data.table一起使用的面板数据,我想使用group。data.table的功能来填写一些缺少的值。不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充。是否有可能在data.tables中执行一种条件na.locf?
基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA。
例如,在以下数据中,我想按每个id组填写associatedid变量。因此,id==1
,year==2003
将填充为ABC123
,因为它在NA之前和之后的值,但对于相同的id则不是2000。 id== 2
不会更改,因为下一个值与NA之前的值不同。 id==3
将填写2003和2004。
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 <NA>
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 <NA>
#> 17 3 2004 <NA>
#> 18 3 2005 ABC123
dt = data.table(mydf, key = c("id"))
期望的输出
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
这全部是关于编写修改的na.locf函数。之后,您可以像其他任何函数一样将其插入到data.table中。
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
一旦有了该功能,就可以照常使用data.table:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
此返回:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
据我所知,这正是您所寻找的。
我在new.locf定义中提供了一些套期保值,因此您可能仍然需要做些思考,但这应该可以使您入门。
我一直在尝试采用两次通过方法,在第一次通过时将更改NA,以将“ p_”粘贴到起始值的fron中(在id内),然后通过第二次通过检查最后一个序列与下一个实际值一致。到目前为止,我一直提供我的代码,但这并不是真正的答案,因此不要指责。 (将associatedid
重命名为asid
可能会更容易。)
lapply( split(df, df$id),
function(d){ d$associatedid <- as.character(d$associatedid)
missloc <- with( d, tapply(is.na(associatedid), id, which))
for (n in missloc) if(
d$associatedid[n+1] %in% c(d$associatedid[n-1],
paste0("p_" , d$associatedid[n-1])&
grepl( gsub("p\\_", "", d$associatedid[n-1]), d$associatedid[n+1] )
{ d$associatedid[n] <- d$associatedid[n-1]
} else{
#tentative NA replacement
d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
})