按组填充的有条件NA

问题描述 投票:3回答:2

编辑该问题最初被要求输入data.table。具有任何包装的解决方案将很有趣。


我对一个更普遍的问题的特定变化有些困惑。我有与data.table一起使用的面板数据,我想使用group。data.table的功能来填写一些缺少的值。不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充。是否有可能在data.tables中执行一种条件na.locf?

基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA。

例如,在以下数据中,我想按每个id组填写associatedid变量。因此,id==1year==2003将填充为ABC123,因为它在NA之前和之后的值,但对于相同的id则不是2000。 id== 2不会更改,因为下一个值与NA之前的值不同。 id==3将填写2003和2004。

mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))

mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003         <NA>
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003         <NA>
#> 17  3 2004         <NA>
#> 18  3 2005       ABC123

dt = data.table(mydf, key = c("id"))

期望的输出

#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123
r dplyr data.table plyr na
2个回答
2
投票

这全部是关于编写修改的na.locf函数。之后,您可以像其他任何函数一样将其插入到data.table中。

new.locf <- function(x){
  # might want to think about the end of this loop
  # this works here but you might need to add another case
  # if there are NA's as the last value.
  #
  # anyway, loop through observations in a vector, x.
  for(i in 2:(length(x)-1)){
    nextval = i
    # find the next, non-NA value
    # again, not tested but might break if there isn't one?
    while(nextval <= length(x)-1 & is.na(x[nextval])){
      nextval = nextval + 1
    }
    # if the current value is not NA, great!
    if(!is.na(x[i])){
      x[i] <- x[i]
    }else{
      # if the current value is NA, and the last value is a value
      # (should given the nature of this loop), and
      # the next value, as calculated above, is the same as the last
      # value, then give us that value. 
      if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
        x[i] <- x[nextval]
      }else{
        # finally, return NA if neither of these conditions hold
        x[i] <- NA
      }
    }
  }
  # return the new vector
  return(x) 
}

一旦有了该功能,就可以照常使用data.table:

dt2 <- dt[,list(year = year,
                # when I read your data in, associatedid read as factor
                associatedid = new.locf(as.character(associatedid))
                ),
          by = "id"
          ]

此返回:

> dt2
    id year associatedid
 1:  1 2000           NA
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000           NA
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003           NA
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000           NA
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

据我所知,这正是您所寻找的。

我在new.locf定义中提供了一些套期保值,因此您可能仍然需要做些思考,但这应该可以使您入门。


0
投票

我一直在尝试采用两次通过方法,在第一次通过时将更改NA,以将“ p_”粘贴到起始值的fron中(在id内),然后通过第二次通过检查最后一个序列与下一个实际值一致。到目前为止,我一直提供我的代码,但这并不是真正的答案,因此不要指责。 (将associatedid重命名为asid可能会更容易。)

lapply( split(df, df$id), 
    function(d){ d$associatedid <- as.character(d$associatedid)
    missloc <- with( d, tapply(is.na(associatedid), id,  which))
    for (n in missloc) if( 
           d$associatedid[n+1] %in% c(d$associatedid[n-1],
                                   paste0("p_" , d$associatedid[n-1])&
    grepl( gsub("p\\_", "",  d$associatedid[n-1]), d$associatedid[n+1] )
                        { d$associatedid[n] <- d$associatedid[n-1]
                     } else{
               #tentative NA replacement
         d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
 })
© www.soinside.com 2019 - 2024. All rights reserved.