我有一个关于基于 2 个现有数据框的组合创建新数据框的问题。
对于以下示例,对于
df1
中的每一行(具有 var1 和 var2 的唯一组合),应从 df2
创建一个新数据框(其 var1 和 var2 具有相同的值,但具有多个具有这些变量的 ID)。新创建的 DF 应该只包含具有 date_ID 的记录 < date_agg. Obviously, my real dataframes are much larger so I'm hoping for something not too computationally intensive ;-)
By the way, the new dataframes should contain all records from df2 that follow the date_ID < date_agg rule, not just records with the same var1/var2 combination.
样本数据:
set.seed(1)
var1 <- c("A","B","C","D")
var2 <- c("X","Y","Z")
df1 <- expand.grid(var1,var2)
df1$date_agg = sample(seq(as.Date('2000/01/01'), as.Date('2023/01/01'), by="day"), 12, replace = TRUE)
df2 <- data.frame(ID = sample(1:1000, replace=FALSE),
var1 = sample(c("A","B","C","D"),1000, replace = TRUE),
var2 = sample(c("X","Y","Z"),1000, replace = TRUE),
date_ID = sample(seq(as.Date('2000/01/01'), as.Date('2023/01/01'), by="day"), 1000, replace = TRUE))
没有所需的输出,因此很难检查。但我相信您正在寻找这样的
data.table
方法
library(data.table)
# set to data.table format
setDT(df1); setDT(df2)
# split df1 to individual rows
L <- split(df1, f = seq.int(nrow(df1)))
# perform joins, resulting in a list of data.tables
L.joins <- lapply(L, function(x) {
df2[x, .(ID, var1, var2, date_ID = x.date_ID, date_agg = i.date_agg),
on = .(var1 = Var1 , var2 = Var2, date_ID < date_agg)]
})
# the first entry of the list looks like
L.joins[[1]]
# ID var1 var2 date_ID date_agg
# <int> <char> <char> <Date> <Date>
# 1: 940 A X 2001-02-16 2002-10-13
# 2: 500 A X 2000-09-02 2002-10-13
# 3: 829 A X 2000-03-23 2002-10-13
# 4: 25 A X 2000-10-25 2002-10-13
# 5: 835 A X 2000-06-10 2002-10-13
# 6: 710 A X 2000-03-19 2002-10-13
# 7: 409 A X 2000-05-18 2002-10-13
# 8: 551 A X 2002-07-07 2002-10-13
# 9: 21 A X 2001-06-10 2002-10-13
#10: 496 A X 2002-06-10 2002-10-13
#11: 160 A X 2001-07-19 2002-10-13
#12: 773 A X 2000-06-21 2002-10-13