展开data.table使用日期范围,并处理其中一个日期为NA的句柄

问题描述 投票:0回答:2

考虑data.table dt

    id boro block       date   end_date
 1:  1    1     1 01/01/1991 01/01/1992
 2:  1    1     2 01/01/1991 01/01/1992
 3:  1    2     3 01/01/1991 01/01/1992
 4:  1    2     4 01/01/1991         NA
 5:  2    1     1 01/01/1992 01/01/1993
 6:  2    1     2 01/01/1992 01/01/1993
 7:  2    2     3 01/01/1992         NA
 8:  2    2     5 01/01/1992         NA
 9:  3    1     1 01/01/1993         NA
10:  3    1     2 01/01/1993         NA
11:  3    2     6 01/01/1993         NA
12:  3    2     7 01/01/1993         NA

str(dt)输出的地方:

Classes ‘data.table’ and 'data.frame':  12 obs. of  5 variables:  $ id 
$ id: num  1 1 1 1 2 2 2 2 3 3 ...  
$ boro: num  1 1 2 2 1 1 2 2 1 1
$ block: num  1 2 3 4 1 2 3 5 1 2 ...  
$ date: Date, format: "1991-01-01" "1991-01-01" "1991-01-01" "1991-01-01"...
$ end_date: Date, format: "1992-01-01" "1992-01-01" "1992-01-01" NA ...
  - attr(*, ".internal.selfref")=<externalptr>

我正在尝试按照dateend_date提供的日期范围扩展行。 IE,对于第一行我想将其扩展为:

     id boro block        qtr
 1:    1    1     1 1991-01-01
 2:    1    1     1 1991-04-01
 3:    1    1     1 1991-07-01
 4:    1    1     1 1991-10-01

如果end_date是NA,我想返回一行包含字段idboroblock以及与date对应的四分之一。 IE,第4行,返回

     id boro block        qtr
 1:    1    2    4 1991-01-01

关于这里提出的类似问题的建议,我尝试过使用:

dt[,.(id,boro,block,qtr = seq(date, end_date, by = "quarter")),by = 1:nrow(dt)]

但我收到以下输出:

Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) : 
  'to' must be a finite number

为了打击end_date可能是NA的事实,我尝试过:

dt[,ifelse(!(is.na(end_date)),
               .(id,boro,block,qtr = seq(date, end_date, by = "quarter")),
               .(id,boro,block,qtr = seq(date,date, by = "quarter"))),
       by = 1:nrow(dt)]

但由于原因不明,这输出:

    nrow V1
 1:    1  1
 2:    2  1
 3:    3  1
 4:    4  1
 5:    5  2
 6:    6  2
 7:    7  2
 8:    8  2
 9:    9  3
10:   10  3
11:   11  3
12:   12  3

注意:我的实际数据有1900万行和70列。因此效率很重要,因此使用data.table。

r date data.table
2个回答
2
投票
dat[dat[, c("gr", "date", "end_date") := 
                 c(.(.I), lapply(.SD[,4:5], as.Date, format= '%d/%m/%Y'))][,
           seq(date, as.Date(ifelse(is.na(end_date), date,end_date), '1970-01-01'),
           'quarter'), gr ],on='gr']



 id boro block       date   end_date gr         V1
 1:  1    1     1 1991-01-01 1992-01-01  1 1991-01-01
 2:  1    1     1 1991-01-01 1992-01-01  1 1991-04-01
 3:  1    1     1 1991-01-01 1992-01-01  1 1991-07-01
 4:  1    1     1 1991-01-01 1992-01-01  1 1991-10-01
 5:  1    1     1 1991-01-01 1992-01-01  1 1992-01-01
 6:  1    1     2 1991-01-01 1992-01-01  2 1991-01-01
 7:  1    1     2 1991-01-01 1992-01-01  2 1991-04-01
 8:  1    1     2 1991-01-01 1992-01-01  2 1991-07-01
 9:  1    1     2 1991-01-01 1992-01-01  2 1991-10-01
10:  1    1     2 1991-01-01 1992-01-01  2 1992-01-01
11:  1    2     3 1991-01-01 1992-01-01  3 1991-01-01
12:  1    2     3 1991-01-01 1992-01-01  3 1991-04-01
13:  1    2     3 1991-01-01 1992-01-01  3 1991-07-01
14:  1    2     3 1991-01-01 1992-01-01  3 1991-10-01
15:  1    2     3 1991-01-01 1992-01-01  3 1992-01-01
16:  1    2     4 1991-01-01       <NA>  4 1991-01-01
17:  2    1     1 1992-01-01 1993-01-01  5 1992-01-01
18:  2    1     1 1992-01-01 1993-01-01  5 1992-04-01
19:  2    1     1 1992-01-01 1993-01-01  5 1992-07-01
20:  2    1     1 1992-01-01 1993-01-01  5 1992-10-01
21:  2    1     1 1992-01-01 1993-01-01  5 1993-01-01
22:  2    1     2 1992-01-01 1993-01-01  6 1992-01-01
23:  2    1     2 1992-01-01 1993-01-01  6 1992-04-01
24:  2    1     2 1992-01-01 1993-01-01  6 1992-07-01
25:  2    1     2 1992-01-01 1993-01-01  6 1992-10-01
26:  2    1     2 1992-01-01 1993-01-01  6 1993-01-01
27:  2    2     3 1992-01-01       <NA>  7 1992-01-01
28:  2    2     5 1992-01-01       <NA>  8 1992-01-01
29:  3    1     1 1993-01-01       <NA>  9 1993-01-01
30:  3    1     2 1993-01-01       <NA> 10 1993-01-01
31:  3    2     6 1993-01-01       <NA> 11 1993-01-01
32:  3    2     7 1993-01-01       <NA> 12 1993-01-01
    id boro block       date   end_date gr         V1

1
投票

这是使用data.table非equi连接的可能方法:

dtcols <- c("date", "end_date")
dt[, (dtcols) := lapply(.SD, as.Date, format="%m/%d/%Y"), .SDcols=dtcols]

#create the quarters
quarters <- dt[,.(qtr=seq(min(date), max(end_date, na.rm=TRUE), by="quarter"))]

#perform non-equi join and then handle NA end_date
quarters[dt, .(id, boro, block, x.qtr, i.date, i.end_date), 
    by=.EACHI, on=.(qtr>=date, qtr<end_date)][,
        .(id, boro, block, 
            qtr=as.Date(ifelse(is.na(i.end_date), i.date, x.qtr), origin="1970-01-01"))]

输出:

    id boro block        qtr
 1:  1    1     1 1991-01-01
 2:  1    1     1 1991-04-01
 3:  1    1     1 1991-07-01
 4:  1    1     1 1991-10-01
 5:  1    1     2 1991-01-01
 6:  1    1     2 1991-04-01
 7:  1    1     2 1991-07-01
 8:  1    1     2 1991-10-01
 9:  1    2     3 1991-01-01
10:  1    2     3 1991-04-01
11:  1    2     3 1991-07-01
12:  1    2     3 1991-10-01
13:  1    2     4 1991-01-01
14:  2    1     1 1992-01-01
15:  2    1     1 1992-04-01
16:  2    1     1 1992-07-01
17:  2    1     1 1992-10-01
18:  2    1     2 1992-01-01
19:  2    1     2 1992-04-01
20:  2    1     2 1992-07-01
21:  2    1     2 1992-10-01
22:  2    2     3 1992-01-01
23:  2    2     5 1992-01-01
24:  3    1     1 1993-01-01
25:  3    1     2 1993-01-01
26:  3    2     6 1993-01-01
27:  3    2     7 1993-01-01
    id boro block        qtr

数据:

library(data.table)
dt <- fread("id boro block       date   end_date
1    1     1 01/01/1991 01/01/1992
1    1     2 01/01/1991 01/01/1992
1    2     3 01/01/1991 01/01/1992
1    2     4 01/01/1991         NA
2    1     1 01/01/1992 01/01/1993
2    1     2 01/01/1992 01/01/1993
2    2     3 01/01/1992         NA
2    2     5 01/01/1992         NA
3    1     1 01/01/1993         NA
3    1     2 01/01/1993         NA
3    2     6 01/01/1993         NA
3    2     7 01/01/1993         NA")
© www.soinside.com 2019 - 2024. All rights reserved.