考虑data.table dt
:
id boro block date end_date
1: 1 1 1 01/01/1991 01/01/1992
2: 1 1 2 01/01/1991 01/01/1992
3: 1 2 3 01/01/1991 01/01/1992
4: 1 2 4 01/01/1991 NA
5: 2 1 1 01/01/1992 01/01/1993
6: 2 1 2 01/01/1992 01/01/1993
7: 2 2 3 01/01/1992 NA
8: 2 2 5 01/01/1992 NA
9: 3 1 1 01/01/1993 NA
10: 3 1 2 01/01/1993 NA
11: 3 2 6 01/01/1993 NA
12: 3 2 7 01/01/1993 NA
str(dt)
输出的地方:
Classes ‘data.table’ and 'data.frame': 12 obs. of 5 variables: $ id
$ id: num 1 1 1 1 2 2 2 2 3 3 ...
$ boro: num 1 1 2 2 1 1 2 2 1 1
$ block: num 1 2 3 4 1 2 3 5 1 2 ...
$ date: Date, format: "1991-01-01" "1991-01-01" "1991-01-01" "1991-01-01"...
$ end_date: Date, format: "1992-01-01" "1992-01-01" "1992-01-01" NA ...
- attr(*, ".internal.selfref")=<externalptr>
我正在尝试按照date
和end_date
提供的日期范围扩展行。 IE,对于第一行我想将其扩展为:
id boro block qtr
1: 1 1 1 1991-01-01
2: 1 1 1 1991-04-01
3: 1 1 1 1991-07-01
4: 1 1 1 1991-10-01
如果end_date
是NA,我想返回一行包含字段id
,boro
,block
以及与date
对应的四分之一。 IE,第4行,返回
id boro block qtr
1: 1 2 4 1991-01-01
关于这里提出的类似问题的建议,我尝试过使用:
dt[,.(id,boro,block,qtr = seq(date, end_date, by = "quarter")),by = 1:nrow(dt)]
但我收到以下输出:
Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) :
'to' must be a finite number
为了打击end_date
可能是NA的事实,我尝试过:
dt[,ifelse(!(is.na(end_date)),
.(id,boro,block,qtr = seq(date, end_date, by = "quarter")),
.(id,boro,block,qtr = seq(date,date, by = "quarter"))),
by = 1:nrow(dt)]
但由于原因不明,这输出:
nrow V1
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 2
7: 7 2
8: 8 2
9: 9 3
10: 10 3
11: 11 3
12: 12 3
注意:我的实际数据有1900万行和70列。因此效率很重要,因此使用data.table。
dat[dat[, c("gr", "date", "end_date") :=
c(.(.I), lapply(.SD[,4:5], as.Date, format= '%d/%m/%Y'))][,
seq(date, as.Date(ifelse(is.na(end_date), date,end_date), '1970-01-01'),
'quarter'), gr ],on='gr']
id boro block date end_date gr V1
1: 1 1 1 1991-01-01 1992-01-01 1 1991-01-01
2: 1 1 1 1991-01-01 1992-01-01 1 1991-04-01
3: 1 1 1 1991-01-01 1992-01-01 1 1991-07-01
4: 1 1 1 1991-01-01 1992-01-01 1 1991-10-01
5: 1 1 1 1991-01-01 1992-01-01 1 1992-01-01
6: 1 1 2 1991-01-01 1992-01-01 2 1991-01-01
7: 1 1 2 1991-01-01 1992-01-01 2 1991-04-01
8: 1 1 2 1991-01-01 1992-01-01 2 1991-07-01
9: 1 1 2 1991-01-01 1992-01-01 2 1991-10-01
10: 1 1 2 1991-01-01 1992-01-01 2 1992-01-01
11: 1 2 3 1991-01-01 1992-01-01 3 1991-01-01
12: 1 2 3 1991-01-01 1992-01-01 3 1991-04-01
13: 1 2 3 1991-01-01 1992-01-01 3 1991-07-01
14: 1 2 3 1991-01-01 1992-01-01 3 1991-10-01
15: 1 2 3 1991-01-01 1992-01-01 3 1992-01-01
16: 1 2 4 1991-01-01 <NA> 4 1991-01-01
17: 2 1 1 1992-01-01 1993-01-01 5 1992-01-01
18: 2 1 1 1992-01-01 1993-01-01 5 1992-04-01
19: 2 1 1 1992-01-01 1993-01-01 5 1992-07-01
20: 2 1 1 1992-01-01 1993-01-01 5 1992-10-01
21: 2 1 1 1992-01-01 1993-01-01 5 1993-01-01
22: 2 1 2 1992-01-01 1993-01-01 6 1992-01-01
23: 2 1 2 1992-01-01 1993-01-01 6 1992-04-01
24: 2 1 2 1992-01-01 1993-01-01 6 1992-07-01
25: 2 1 2 1992-01-01 1993-01-01 6 1992-10-01
26: 2 1 2 1992-01-01 1993-01-01 6 1993-01-01
27: 2 2 3 1992-01-01 <NA> 7 1992-01-01
28: 2 2 5 1992-01-01 <NA> 8 1992-01-01
29: 3 1 1 1993-01-01 <NA> 9 1993-01-01
30: 3 1 2 1993-01-01 <NA> 10 1993-01-01
31: 3 2 6 1993-01-01 <NA> 11 1993-01-01
32: 3 2 7 1993-01-01 <NA> 12 1993-01-01
id boro block date end_date gr V1
这是使用data.table
非equi连接的可能方法:
dtcols <- c("date", "end_date")
dt[, (dtcols) := lapply(.SD, as.Date, format="%m/%d/%Y"), .SDcols=dtcols]
#create the quarters
quarters <- dt[,.(qtr=seq(min(date), max(end_date, na.rm=TRUE), by="quarter"))]
#perform non-equi join and then handle NA end_date
quarters[dt, .(id, boro, block, x.qtr, i.date, i.end_date),
by=.EACHI, on=.(qtr>=date, qtr<end_date)][,
.(id, boro, block,
qtr=as.Date(ifelse(is.na(i.end_date), i.date, x.qtr), origin="1970-01-01"))]
输出:
id boro block qtr
1: 1 1 1 1991-01-01
2: 1 1 1 1991-04-01
3: 1 1 1 1991-07-01
4: 1 1 1 1991-10-01
5: 1 1 2 1991-01-01
6: 1 1 2 1991-04-01
7: 1 1 2 1991-07-01
8: 1 1 2 1991-10-01
9: 1 2 3 1991-01-01
10: 1 2 3 1991-04-01
11: 1 2 3 1991-07-01
12: 1 2 3 1991-10-01
13: 1 2 4 1991-01-01
14: 2 1 1 1992-01-01
15: 2 1 1 1992-04-01
16: 2 1 1 1992-07-01
17: 2 1 1 1992-10-01
18: 2 1 2 1992-01-01
19: 2 1 2 1992-04-01
20: 2 1 2 1992-07-01
21: 2 1 2 1992-10-01
22: 2 2 3 1992-01-01
23: 2 2 5 1992-01-01
24: 3 1 1 1993-01-01
25: 3 1 2 1993-01-01
26: 3 2 6 1993-01-01
27: 3 2 7 1993-01-01
id boro block qtr
数据:
library(data.table)
dt <- fread("id boro block date end_date
1 1 1 01/01/1991 01/01/1992
1 1 2 01/01/1991 01/01/1992
1 2 3 01/01/1991 01/01/1992
1 2 4 01/01/1991 NA
2 1 1 01/01/1992 01/01/1993
2 1 2 01/01/1992 01/01/1993
2 2 3 01/01/1992 NA
2 2 5 01/01/1992 NA
3 1 1 01/01/1993 NA
3 1 2 01/01/1993 NA
3 2 6 01/01/1993 NA
3 2 7 01/01/1993 NA")