使用 R 中的 data.table 包,我尝试使用合并方法创建两个 data.table 的笛卡尔积,就像在基础 R 中所做的那样。
在基础上进行以下工作:
#assume this order data
orders <- data.frame(date = as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01')),
first.name = as.character(c('John','George','Henry')),
last.name = as.character(c('Doe','Smith','Smith')),
qty = c(10,50,6))
#and these dates
dates <- data.frame(date = seq(from = as.POSIXct('2012-08-28'),
to = as.POSIXct('2012-09-07'), by = 'day'))
#get the unique customers
cust<-unique(orders[,c('first.name','last.name')])
#using merge from base R, get the cartesian product
merge(dates, cust, by = integer(0))
但是,使用 data.table 时相同的技术不起作用,并且会引发此错误:
"Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : A non-empty vector of column names for `by` is required."
#data.table approach
library(data.table)
orders.dt <- data.table(orders)
dates.dt <- data.table(dates)
cust.dt <- unique(orders.dt[, list(first.name, last.name)])
#try to use merge (data.table) in the same manner as base
merge(dates.dt, cust.dt, by = integer(0))
Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : A non-empty vector of column names for `by` is required.
我希望结果能够反映所有日期的所有客户名称,就像在基础中一样,但以以 data.table 为中心的方式进行。这可能吗?
如果您首先从
cust
数据框中的第一个和最后一个构建全名,则可以使用 CJ
(交叉连接)。您不能使用所有三个向量,因为会有 99 个项目,而且名字会与姓氏不恰当地混合在一起。
> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99
这将返回所需的 data.table 对象:
> CJ(dates$date,paste(cust$first.name, cust$last.name) )
V1 V2
1: 2012-08-28 George Smith
2: 2012-08-28 Henry Smith
3: 2012-08-28 John Doe
4: 2012-08-29 George Smith
5: 2012-08-29 Henry Smith
6: 2012-08-29 John Doe
7: 2012-08-30 George Smith
8: 2012-08-30 Henry Smith
9: 2012-08-30 John Doe
10: 2012-08-31 John Doe
11: 2012-08-31 George Smith
12: 2012-08-31 Henry Smith
13: 2012-09-01 John Doe
14: 2012-09-01 George Smith
15: 2012-09-01 Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02 Henry Smith
18: 2012-09-02 John Doe
19: 2012-09-03 Henry Smith
20: 2012-09-03 John Doe
21: 2012-09-03 George Smith
22: 2012-09-04 Henry Smith
23: 2012-09-04 John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05 Henry Smith
27: 2012-09-05 John Doe
28: 2012-09-06 George Smith
29: 2012-09-06 Henry Smith
30: 2012-09-06 John Doe
31: 2012-09-07 George Smith
32: 2012-09-07 Henry Smith
33: 2012-09-07 John Doe
V1 V2
merge.data.table(x, y)
是一个方便的函数,它包装了对 x[y]
的调用,因此合并需要基于两个 data.table
中的列。 (这就是该错误消息试图告诉您的内容)。
一种解决方法是向两个 data.tables 添加一个虚拟列,其唯一目的是使合并成为可能:
## Add a column "k", and append it to each data.table's vector of keyed columns.
setkeyv(cust.dt[,k:=1], c(key(cust.dt), "k"))
setkeyv(dates.dt[,k:=1], c(key(dates.dt), "k"))
## Merge and then remove the dummy column
res <- merge(dates.dt, cust.dt, by="k")
head(res[,k:=NULL])
# date first.name last.name
# 1: 2012-08-28 George Smith
# 2: 2012-08-28 Henry Smith
# 3: 2012-08-28 John Doe
# 4: 2012-08-29 George Smith
# 5: 2012-08-29 Henry Smith
# 6: 2012-08-29 John Doe
## Maybe also clean up cust.dt and dates.dt
# cust.dt[,k:=NULL]
# dates.dt[,k=NULL]
@JoshO'Brien 的解决方案使用了
merge
,但下面是一个类似的替代方案,但没有(据我所知)。
如果我正确理解
?data.table::merge
中的文档,X[Y]
应该比data.table::merge(X,Y)
稍快(从1.8.7版本开始)。它参考了FAQ 2.12来解决这个问题,但是FAQ有点混乱。首先,正确的参考应该是1.12,而不是2.12。而且它们没有表明它们是指 merge 的基本版本还是 data.table 版本,或者两者兼而有之。因此,这可能只是一个看起来更混乱的等效解决方案,或者可能更快。[来自 Matthew 的编辑] 谢谢:现在在 v1.8.7 中进行了改进(
?merge.data.table
,常见问题解答 1.12 并添加了新的常见问题解答 2.24)
DT_orders<-data.table(date=as.POSIXct(c('2012-08-28','2012-08-29','2012-08-29','2012-09-01')),
first.name=as.character(c('John','John','George','Henry')),
last.name=as.character(c('Doe','Doe','Smith','Smith')),
qty=c(10,2,50,6),
key="first.name,last.name")
# Note that I added a second record to the orders table for John Doe, to make sure it could handle duplicate first/last name combinations.
DT_dates<-data.table(date=seq(from=as.POSIXct('2012-08-28'),
to=as.POSIXct('2012-09-07'),by='day'),
key="date")
DT_custdates<-data.table(k=1,unique(DT_dates),key="k")[unique(DT_orders)[,list(k=1,first.name,last.name)]][,k:=NULL]
还有两个选项:
力量:
base::merge(..)
base::merge.data.frame(dates.dt, cust.dt, by = integer(0))
# date first.name last.name
# 1 2012-08-28 John Doe
# 2 2012-08-29 John Doe
# 3 2012-08-30 John Doe
# 4 2012-08-31 John Doe
# 5 2012-09-01 John Doe
# 6 2012-09-02 John Doe
# 7 2012-09-03 John Doe
# 8 2012-09-04 John Doe
# 9 2012-09-05 John Doe
# 10 2012-09-06 John Doe
# 11 2012-09-07 John Doe
# 12 2012-08-28 George Smith
# 13 2012-08-29 George Smith
# 14 2012-08-30 George Smith
# 15 2012-08-31 George Smith
# 16 2012-09-01 George Smith
# 17 2012-09-02 George Smith
# 18 2012-09-03 George Smith
# 19 2012-09-04 George Smith
# 20 2012-09-05 George Smith
# 21 2012-09-06 George Smith
# 22 2012-09-07 George Smith
# 23 2012-08-28 Henry Smith
# 24 2012-08-29 Henry Smith
# 25 2012-08-30 Henry Smith
# 26 2012-08-31 Henry Smith
# 27 2012-09-01 Henry Smith
# 28 2012-09-02 Henry Smith
# 29 2012-09-03 Henry Smith
# 30 2012-09-04 Henry Smith
# 31 2012-09-05 Henry Smith
# 32 2012-09-06 Henry Smith
# 33 2012-09-07 Henry Smith
另一个
data.table
-技巧:
dates.dt[, as.list(cust.dt), by = dates.dt]
# date first.name last.name
# <POSc> <char> <char>
# 1: 2012-08-28 John Doe
# 2: 2012-08-28 George Smith
# 3: 2012-08-28 Henry Smith
# 4: 2012-08-29 John Doe
# 5: 2012-08-29 George Smith
# 6: 2012-08-29 Henry Smith
# 7: 2012-08-30 John Doe
# 8: 2012-08-30 George Smith
# 9: 2012-08-30 Henry Smith
# 10: 2012-08-31 John Doe
# ---
# 24: 2012-09-04 Henry Smith
# 25: 2012-09-05 John Doe
# 26: 2012-09-05 George Smith
# 27: 2012-09-05 Henry Smith
# 28: 2012-09-06 John Doe
# 29: 2012-09-06 George Smith
# 30: 2012-09-06 Henry Smith
# 31: 2012-09-07 John Doe
# 32: 2012-09-07 George Smith
# 33: 2012-09-07 Henry Smith