用数据表填充条件NA

问题描述 投票:3回答:2

我对一个更普遍的问题的特定变化有些困惑。我有与data.table一起使用的面板数据,我想使用group。data.table的功能来填写一些缺少的值。不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充。是否有可能在data.tables中执行一种条件na.locf?

基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA。

例如,在以下数据中,我想按每个id组填写associatedid变量。因此,id==1year==2003将填充为ABC123,因为它在NA之前和之后的值,但对于相同的id则不是2000。 id== 2不会更改,因为下一个值与NA之前的值不同。 id==3将填写2003和2004。

df = read.table(header=T, text = "id year associatedid
            1 2000 NA
            1 2001 ABC123
            1 2002 ABC123
            1 2003 NA
            1 2004 ABC123
            1 2005 ABC123
            2 2000 NA
            2 2001 ABC123
            2 2002 ABC123
            2 2003 NA
            2 2004 DEF456
            2 2005 DEF456
            3 2000 NA
            3 2001 ABC123
            3 2002 ABC123
            3 2003 NA
            3 2004 NA
            3 2005 ABC123
            ")

dt = data.table(df,key = c("id"))

任何建议或建议都非常感谢。谢谢!

r data.table plyr na
2个回答
2
投票

这全部是关于编写修改的na.locf函数。之后,您可以像其他任何函数一样将其插入到data.table中。

new.locf <- function(x){
  # might want to think about the end of this loop
  # this works here but you might need to add another case
  # if there are NA's as the last value.
  #
  # anyway, loop through observations in a vector, x.
  for(i in 2:(length(x)-1)){
    nextval = i
    # find the next, non-NA value
    # again, not tested but might break if there isn't one?
    while(nextval <= length(x)-1 & is.na(x[nextval])){
      nextval = nextval + 1
    }
    # if the current value is not NA, great!
    if(!is.na(x[i])){
      x[i] <- x[i]
    }else{
      # if the current value is NA, and the last value is a value
      # (should given the nature of this loop), and
      # the next value, as calculated above, is the same as the last
      # value, then give us that value. 
      if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
        x[i] <- x[nextval]
      }else{
        # finally, return NA if neither of these conditions hold
        x[i] <- NA
      }
    }
  }
  # return the new vector
  return(x) 
}

一旦有了该功能,就可以照常使用data.table:

dt2 <- dt[,list(year = year,
                # when I read your data in, associatedid read as factor
                associatedid = new.locf(as.character(associatedid))
                ),
          by = "id"
          ]

此返回:

> dt2
    id year associatedid
 1:  1 2000           NA
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000           NA
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003           NA
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000           NA
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

据我所知,这正是您所寻找的。

我在new.locf定义中提供了一些套期保值,因此您可能仍然需要做些思考,但这应该可以使您入门。


0
投票

我一直在尝试采用两次通过方法,在第一次通过时将更改NA,以将“ p_”粘贴到起始值的fron中(在id内),然后通过第二次通过检查最后一个序列与下一个实际值一致。到目前为止,我一直提供我的代码,但这并不是真正的答案,因此不要指责。 (将associatedid重命名为asid可能会更容易。)

lapply( split(df, df$id), 
    function(d){ d$associatedid <- as.character(d$associatedid)
    missloc <- with( d, tapply(is.na(associatedid), id,  which))
    for (n in missloc) if( 
           d$associatedid[n+1] %in% c(d$associatedid[n-1],
                                   paste0("p_" , d$associatedid[n-1])&
    grepl( gsub("p\\_", "",  d$associatedid[n-1]), d$associatedid[n+1] )
                        { d$associatedid[n] <- d$associatedid[n-1]
                     } else{
               #tentative NA replacement
         d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
 })
© www.soinside.com 2019 - 2024. All rights reserved.