用行尾值填充NA值乘以增长率吗?

问题描述 投票:10回答:5

用先前值乘以NA来填充(1 + growth)值的一种好方法是什么?

df <- data.frame(
  year = 0:6,
  price1 = c(1.1, 2.1, 3.2, 4.8, NA, NA, NA),
  price2 = c(1.1, 2.1, 3.2, NA, NA, NA, NA)
)
growth <- .02

在这种情况下,我想用price14.8*1.024.8*1.02^2填充4.8*1.02^3中的缺失值。同样,我想用price23.2*1.023.2*1.02^23.2*1.02^3填充3.2*1.02^4中的缺失值。

我已经尝试过,但是我认为需要将其设置为以某种方式重复[apply?):

library(dplyr)
df %>%
  mutate(price1 = ifelse(is.na(price1),
    lag(price1) * (1 + growth), price1
  ))

我还没有将dplyr用于其他任何事情,因此,欢迎您使用基数R或plyr或类似内容。

r plyr dplyr apply na
5个回答
3
投票

[[0]似乎无法处理访问新分配的滞后值。即使dplyr位于列的中间,这也是一种可行的解决方案。

NA

7
投票

仅假定结尾的NA:

df <- apply(
  df, 2, function(x){
    if(sum(is.na(x)) == 0){return(x)}
    ## updated with optimized portion from @josilber
    r <- rle(is.na(x))
    na.loc <- which(r$values)
    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])
    lastValIs <- 1:length(x)
    lastValI[is.na(x)] <- b
    x[is.na(x)] <-
      sapply(which(is.na(x)), function(i){
        return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i]))
      })
    return(x)
  })

如果也有内部NAgrow <- function(x,growth=0.02) { isna <- is.na(x) lastval <- tail(x[!isna],1) x[isna] <- lastval*(1+growth)^seq(sum(isna)) return(x) } 值,这会有点棘手。

适用于除第一列之外的所有列:

NA

6
投票

可以使用df[-1] <- lapply(df[-1],NAgrow) ## year price1 price2 ## 1 0 1.100000 1.100000 ## 2 1 2.100000 2.100000 ## 3 2 3.200000 3.200000 ## 4 3 4.800000 3.264000 ## 5 4 4.896000 3.329280 ## 6 5 4.993920 3.395866 ## 7 6 5.093798 3.463783 获得紧凑的R基解决方案:

Reduce

给予:

growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y
replace(df, TRUE, lapply(df, Reduce, f = growthfun, acc = TRUE))

注:问题中的数据没有非尾随的NA值,但是如果有的话,我们可以使用Zoo中的 year price1 price2 1 0 1.100000 1.100000 2 1 2.100000 2.100000 3 2 3.200000 3.200000 4 3 4.800000 3.264000 5 4 4.896000 3.329280 6 5 4.993920 3.395866 7 6 5.093798 3.463783 首先用特殊值(例如NaN)替换尾随的NA。寻找它而不是NA:

na.fill

5
投票

以下基于library(zoo) DF <- as.data.frame(na.fill(df, c(NA, NA, NaN))) growthfun <- function(x, y) if (is.nan(y)) (1+growth)*x else y replace(DF, TRUE, lapply(DF, Reduce, f = growthfun, acc = TRUE)) 的解决方案可在任何位置使用NA,并且不依赖于循环来填充缺失值:

rle

我将使用for循环添加另外两个解决方案,一个在基本R中,一个在Rcpp中:

NAgrow.rle <- function(x) {
  if (is.na(x[1]))  stop("Can't have NA at beginning")
  r <- rle(is.na(x))
  na.loc <- which(r$values)
  b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])
  x[is.na(x)] <- ave(x[b], b, FUN=function(y) y[1]*(1+growth)^seq_along(y))
  x
}
df[,-1] <- lapply(df[,-1], NAgrow.rle)
#   year   price1   price2
# 1    0 1.100000 1.100000
# 2    1 2.100000 2.100000
# 3    2 3.200000 3.200000
# 4    3 4.800000 3.264000
# 5    4 4.896000 3.329280
# 6    5 4.993920 3.395866
# 7    6 5.093798 3.463783

基于NAgrow.for <- function(x) { for (i in which(is.na(x))) { x[i] <- x[i-1] * (1+growth) } x } library(Rcpp) cppFunction( "NumericVector NAgrowRcpp(NumericVector x, double growth) { const int n = x.size(); NumericVector y(x); for (int i=1; i < n; ++i) { if (R_IsNA(x[i])) { y[i] = (1.0 + growth) * y[i-1]; } } return y; }") 的解决方案(rlecrimson)花费的时间大约是基于for循环的简单解决方案(josilber.rle)的两倍,并且正如所期望的那样,Rcpp解决方案是最快的,可以在大约0.002秒。

josilber.for

1
投票

您可以尝试这种功能

set.seed(144)
big.df <- data.frame(ID=1:100000,
                     price1=sample(c(1:10, NA), 100000, replace=TRUE),
                     price2=sample(c(1:10, NA), 100000, replace=TRUE))
crimson <- function(df) apply(df[,-1], 2, function(x){
  if(sum(is.na(x)) == 0){return(x)}
  ## updated with optimized portion from @josilber
  r <- rle(is.na(x))
  na.loc <- which(r$values)
  b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])
  lastValIs <- 1:length(x)
  lastValIs[is.na(x)] <- b
  x[is.na(x)] <-
    sapply(which(is.na(x)), function(i){
      return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i]))
    })
  return(x)
})
ggrothendieck <- function(df) {
  growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y
  lapply(df[,-1], Reduce, f = growthfun, acc = TRUE)
}
josilber.rle <- function(df) lapply(df[,-1], NAgrow.rle)
josilber.for <- function(df) lapply(df[,-1], NAgrow.for)
josilber.rcpp <- function(df) lapply(df[,-1], NAgrowRcpp, growth=growth)
library(microbenchmark)
microbenchmark(crimson(big.df), ggrothendieck(big.df), josilber.rle(big.df), josilber.for(big.df), josilber.rcpp(big.df))
# Unit: milliseconds
#                   expr        min         lq       mean     median         uq         max neval
#        crimson(big.df)  98.447546 131.063713 161.494366 152.477661 183.175840  379.643222   100
#  ggrothendieck(big.df) 437.015693 667.760401 822.530745 817.864707 925.974019 1607.352929   100
#   josilber.rle(big.df)  59.678527 115.220519 132.874030 127.476340 151.665657  262.003756   100
#   josilber.for(big.df)  21.076516  57.479169  73.860913  72.959536  84.846912  178.412591   100
#  josilber.rcpp(big.df)   1.248793   1.894723   2.373469   2.190545   2.697246    5.646878   100

[1] 1.100000 2.100000 3.200000 4.800000 4.896000 4.993920 5.093798

© www.soinside.com 2019 - 2024. All rights reserved.