我有大约400万行个人数据,如下所示:
names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA_real_
df1 <- data.frame(names,luckyToday,luckyYesterday)
df1
# names luckyToday luckyYesterday
# 1 Peter 0 NA
# 2 Peter 0 NA
# 3 Peter 0 NA
# 4 Peter NA NA
# 5 Peter 0 NA
# 6 Peter 0 NA
# 7 Peter 1 NA
# 8 Lisa NA NA
# 9 Bert 1 NA
# 10 Carine NA NA
# 11 Carine 0 NA
# 12 Carine 0 NA
# 13 Carine 0 NA
# 14 Carine 1 NA
# 15 Carine 1 NA
数据包含对人的观察(一些观察1次,一些观察更多)和幸运感(1 =幸运,0 =不幸,NA =无信息)。作为一个滞后变量,我想引入一个新变量(“luckyY yesterday”),它告诉我这个人在上次观察期间是否幸运。所以我希望数据看起来像这样:
df2
# names luckyToday luckyYesterday
# 1 Peter 0 NA
# 2 Peter 0 0
# 3 Peter 0 0
# 4 Peter NA 0
# 5 Peter 0 0
# 6 Peter 0 0
# 7 Peter 1 0
# 8 Lisa NA NA
# 9 Bert 1 NA
# 10 Carine NA NA
# 11 Carine 0 0
# 12 Carine 0 0
# 13 Carine 0 0
# 14 Carine 1 0
# 15 Carine 1 1
我知道R不是应用这种数据争论的完美程序,但它是必要的。
我想考虑以下事项:
我自己尝试了2个for循环,但是我的数据耗时超过400万次。任何人都可以使用更快的解决方案帮助我,例如data.table或apply函数吗?我非常感激!
干杯
您可以使用shift
中的data.table
函数来观察昨天和na.locf
函数中的zoo
函数,以便在昨天或明天填写NA,具体取决于fromLast
参数是F还是T,如果您不想混合观察,也可以按名称分组不同的人:
library(data.table); library(zoo)
setDT(df1)[,luckyYesterday := shift(na.locf(luckyToday, fromLast = TRUE)), names]
df1
# names luckyToday luckyYesterday
# 1: Peter 0 NA
# 2: Peter 0 0
# 3: Peter 0 0
# 4: Peter NA 0
# 5: Peter 0 0
# 6: Peter 0 0
# 7: Peter 1 0
# 8: Lisa NA NA
# 9: Bert 1 NA
# 10: Carine NA NA
# 11: Carine 0 0
# 12: Carine 0 0
# 13: Carine 0 0
# 14: Carine 1 0
# 15: Carine 1 1
names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter",
"Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA
df1 <- data.frame(names,luckyToday,luckyYesterday)
# New code
library(data.table)
data.table(df1)[,list(luckyToday, c(NA, luckyToday[1:(.N-1)])),by=list(names)]