使用data.table创建2个新字段

问题描述 投票:0回答:2

寻找关于使R代码在下面产生预期结果的一些建议。任何帮助将非常感激

以下是逻辑,代码,样本数据和所需输出。当前代码未产生所需结果。我可以使用for循环获得所需的结果,但这需要太长时间

Logic:

  1. 如果ID是遇到的第一个ID,那么group = 1和groupdate = date
  2. 否则,如果不是第一个ID和日期 - 上一个日期> 10或日期 - 上一个groupdate> 10然后group =上一个组#+ 1和groupdate = date
  3. 否则,如果不是第一个ID和日期 - 上一个日期<= 10或日期 - 上一个groupdate <= 10然后group =上一个组#和groupdate =上一个日期。

Sample Code:

df1 <- read.table(header=T,text='ID  DATE        ITEM
           1   1/1/2014    P1
           1   1/15/2014   P2
           1   1/20/2015   P3
           1   1/22/2015   P4
           1   3/10/2015   P5
           2   1/13/2015   P1
           2   1/20/2015   P2
           2   1/28/2015   P3
           2   2/28/2015   P4
           2   3/20/2015   P5')
df1

library(data.table)
setDT(df1)[, GROUP:={
     dt <- as.Date(DATE, "%m/%d/%Y")
     gr1 <-cumsum((dt-shift(dt, fill=dt[1L]))>10)+1L; list(gr1)} ,
        by =  ID]
df1[, GROUPDATE := DATE[1L] , by = .(GROUP, ID)]

df1

===============

# Desired output.
ID  DATE        ITEM    GROUP   GROUPDATE
1   1/1/2014    P1  1   1/1/2014
1   1/15/2014   P2  2   1/15/2014
1   1/20/2014   P3  2   1/15/2014
1   1/22/2014   P4  2   1/15/2014
1   3/10/2015   P5  3   3/10/2015
2   1/13/2015   P1  1   1/13/2015
2   1/20/2015   P2  1   1/13/2015
2   1/28/2015   P3  2   1/28/2015
2   2/28/2015   P4  3   2/28/2015
2   3/20/2015   P5  4   3/20/2015
r data.table
2个回答
1
投票

这个答案描述了两种方法,一种是迭代的,另一种是使用非等连接。

虽然两者都返回OP提供的样本数据集的所需输出,但我觉得应该使用更大的数据集对它们进行更彻底的测试。

Iterative approach

OP要求将一个组计数器增加一个

  1. 如果实际日期与上一日期之间的间隔超过10天,即上一行的日期, - 要么 -
  2. 如果在实际日期和组的第一天之间存在超过10天的间隔(“groupdate”)

为每个ID

条件(1)可以在一次冲动中直接计算。不幸的是,在评估条件(2)时,条件(2)中引用的组的第一天可能会改变。因此,这可能需要多次传递才能找到最终解决方案。

下面的答案评估条件(1)并使用此临时解决方案迭代评估条件(2),直到观察到组分配没有更多变化。

OP在他的问题中提供了两个不同的数据表。在这里,我们采取包含预期结果的第二个:

library(data.table)
df2 <- fread("ID  DATE        ITEM    GROUP   GROUPDATE
1   1/1/2014    P1  1   1/1/2014
1   1/15/2014   P2  2   1/15/2014
1   1/20/2014   P3  2   1/15/2014
1   1/22/2014   P4  2   1/15/2014
1   3/10/2015   P5  3   3/10/2015
2   1/13/2015   P1  1   1/13/2015
2   1/20/2015   P2  1   1/13/2015
2   1/28/2015   P3  2   1/28/2015
2   2/28/2015   P4  3   2/28/2015
2   3/20/2015   P5  4   3/20/2015")

DATEGROUPDATE列需要事先强制转换为Date类。

cols <- c("DATE", "GROUPDATE")
df2[, (cols) := lapply(.SD, as.IDate, "%m/%d/%Y"), .SDcols = cols]

请注意,有一个隐含的假设,即行在每个DATE中由ID排序。

现在,我们可以开始计算了。为了比较结果,我选择了不同的列名。

dt <- copy(df2) # just for convenience to easily switch between df1 and df2
# create group count for gaps of more than 10 days
dt[, grp := cumsum((DATE - shift(DATE, fill = 0)) > 10L), by = ID]
# set group date
dt[, gdt := first(DATE), by = .(ID, grp)]
# update group count according to conditon (2)
tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
# repeat as long as there changes in group counts
while (dt[, any(grp != tmp)]) {
  # complete update of group count
  dt[, grp := tmp]
  # set new group date
  dt[, gdt := first(DATE), by = .(ID, grp)]
  # update group count according to conditon (2)
  tmp <- dt[, cumsum((DATE - shift(gdt, fill = 0)) > 10L), by = ID][, V1]
}
dt
    ID       DATE ITEM GROUP  GROUPDATE grp        gdt
 1:  1 2014-01-01   P1     1 2014-01-01   1 2014-01-01
 2:  1 2014-01-15   P2     2 2014-01-15   2 2014-01-15
 3:  1 2014-01-20   P3     2 2014-01-15   2 2014-01-15
 4:  1 2014-01-22   P4     2 2014-01-15   2 2014-01-15
 5:  1 2015-03-10   P5     3 2015-03-10   3 2015-03-10
 6:  2 2015-01-13   P1     1 2015-01-13   1 2015-01-13
 7:  2 2015-01-20   P2     1 2015-01-13   1 2015-01-13
 8:  2 2015-01-28   P3     2 2015-01-28   2 2015-01-28
 9:  2 2015-02-28   P4     3 2015-02-28   3 2015-02-28
10:  2 2015-03-20   P5     4 2015-03-20   4 2015-03-20

Non-equi join

条件可以重写:如果与前一行的差距不超过10天且与组的第一行的差距不超过10天,则一行属于一个组。

dt <- copy(df2)
# append row id 
dt[, rn := .I]
tmp <- 
  # non-equi join to find all rows which lie within a 10 days interval
  dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
     on = .(ID, DATE >= start, DATE <= end)][
       # aggregate for row id to find the earliest start date
       , min(DATE), by = .(ID, rn)][
         # create group id (starting at one for each ID)
         , rleid(V1), by = ID]
# append group id to original data.table
dt[, grp := tmp$V1][
  # set group date
  , gdt := first(DATE), by = .(ID, grp)]
dt
    ID       DATE ITEM GROUP  GROUPDATE rn grp        gdt
 1:  1 2014-01-01   P1     1 2014-01-01  1   1 2014-01-01
 2:  1 2014-01-15   P2     2 2014-01-15  2   2 2014-01-15
 3:  1 2014-01-20   P3     2 2014-01-15  3   2 2014-01-15
 4:  1 2014-01-22   P4     2 2014-01-15  4   2 2014-01-15
 5:  1 2015-03-10   P5     3 2015-03-10  5   3 2015-03-10
 6:  2 2015-01-13   P1     1 2015-01-13  6   1 2015-01-13
 7:  2 2015-01-20   P2     1 2015-01-13  7   1 2015-01-13
 8:  2 2015-01-28   P3     2 2015-01-28  8   2 2015-01-28
 9:  2 2015-02-28   P4     3 2015-02-28  9   3 2015-02-28
10:  2 2015-03-20   P5     4 2015-03-20 10   4 2015-03-20
Step-by step explanation

想法是在一个操作中找到特定行所属的“groupdate”。这是通过找到一行可能属于的所有可能的10天intervall并通过获取每行的最早开始日期来聚合来实现的。这用于创建组正在使用rleid()函数。

下面的表达式创建一个包含10天间隔的辅助数据.table

dt[, .(ID, start = DATE, end = DATE + 10L)]
    ID      start        end
 1:  1 2014-01-01 2014-01-11
 2:  1 2014-01-15 2014-01-25
 3:  1 2014-01-20 2014-01-30
 4:  1 2014-01-22 2014-02-01
 5:  1 2015-03-10 2015-03-20
 6:  2 2015-01-13 2015-01-23
 7:  2 2015-01-20 2015-01-30
 8:  2 2015-01-28 2015-02-07
 9:  2 2015-02-28 2015-03-10
10:  2 2015-03-20 2015-03-30

非equi连接查找位于10天间隔内的所有行:

dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
     on = .(ID, DATE >= start, DATE <= end)]
    ID       DATE ITEM GROUP  GROUPDATE rn     DATE.1
 1:  1 2014-01-01   P1     1 2014-01-01  1 2014-01-11
 2:  1 2014-01-15   P2     2 2014-01-15  2 2014-01-25
 3:  1 2014-01-15   P3     2 2014-01-15  3 2014-01-25
 4:  1 2014-01-15   P4     2 2014-01-15  4 2014-01-25
 5:  1 2014-01-20   P3     2 2014-01-15  3 2014-01-30
 6:  1 2014-01-20   P4     2 2014-01-15  4 2014-01-30
 7:  1 2014-01-22   P4     2 2014-01-15  4 2014-02-01
 8:  1 2015-03-10   P5     3 2015-03-10  5 2015-03-20
 9:  2 2015-01-13   P1     1 2015-01-13  6 2015-01-23
10:  2 2015-01-13   P2     1 2015-01-13  7 2015-01-23
11:  2 2015-01-20   P2     1 2015-01-13  7 2015-01-30
12:  2 2015-01-20   P3     2 2015-01-28  8 2015-01-30
13:  2 2015-01-28   P3     2 2015-01-28  8 2015-02-07
14:  2 2015-02-28   P4     3 2015-02-28  9 2015-03-10
15:  2 2015-03-20   P5     4 2015-03-20 10 2015-03-30

例如,区间[2014-01-15, 2014-01-25]包括第2,3和4行。另一方面,第4行(DATE:2014-01-22)属于三个不同的区间:[2014-01-15, 2014-01-25][2014-01-20, 2014-01-30][2014-01-22, 2014-02-01]

现在,我们选择每行最早开始日期的间隔:

dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
   on = .(ID, DATE >= start, DATE <= end)][
     , min(DATE), by = .(ID, rn)]
    ID rn         V1
 1:  1  1 2014-01-01
 2:  1  2 2014-01-15
 3:  1  3 2014-01-15
 4:  1  4 2014-01-15
 5:  1  5 2015-03-10
 6:  2  6 2015-01-13
 7:  2  7 2015-01-13
 8:  2  8 2015-01-20
 9:  2  9 2015-02-28
10:  2 10 2015-03-20

具有相同V1的后续行属于同一组。因此,我们可以使用rleid()函数创建组ID。

tmp <- 
  dt[dt[, .(ID, start = DATE, end = DATE + 10L)], 
     on = .(ID, DATE >= start, DATE <= end)][
       , min(DATE), by = .(ID, rn)][
         , rleid(V1), by = ID]

tmp
    ID rn        gdt grp
 1:  1  1 2014-01-01   1
 2:  1  2 2014-01-15   2
 3:  1  3 2014-01-15   2
 4:  1  4 2014-01-15   2
 5:  1  5 2015-03-10   3
 6:  2  6 2015-01-13   1
 7:  2  7 2015-01-13   1
 8:  2  8 2015-01-20   2
 9:  2  9 2015-02-28   3
10:  2 10 2015-03-20   4

最后的步骤是

# append group id to original data.table
dt[, grp := tmp$V1][
  # set group date
  , gdt := first(DATE), by = .(ID, grp)]

0
投票

我认为最简单的方法是在矢量化函数中捕获与日期相关的逻辑,请参阅groupDates(x) bellow。我已经为该函数使用了递归逻辑,可能有更优化的方法来实现它。

对于计数变化的字段,您可以使用cumsum

df1 <- read.table(header=T,text='ID  DATE        ITEM
           1   1/1/2014    P1
           1   1/15/2014   P2
           1   1/20/2014   P3
           1   1/22/2014   P4
           1   3/10/2015   P5
           2   1/13/2015   P1
           2   1/20/2015   P2
           2   1/28/2015   P3
           2   2/28/2015   P4
           2   3/20/2015   P5')

dt <- as.data.table(df1)

groupDates <- function(x) {
  x.prev <- c(head(x, 1), head(x, -1))
  x.diff <- abs(c(0, diff(x)))
  x.diff.big <- x.diff > 10 | x.diff == 0
  x.diff.prev.big <- c(TRUE, head(x.diff.big, -1))
  x[!x.diff.big & x.diff.prev.big] <- NA
  x <- safe.ifelse(!is.na(x), x, x.prev)
  d <- diff(x)
  if(min(d[d > 0]) < 10) {
    groupDates(x)
  }
  else {
    x
  }
}

dt[, date := as.Date(DATE, format = "%m/%d/%Y")]
dt[, group.date := groupDates(date), ID]
dt[, previous.date := shift(group.date, fill = first(date)), ID]
dt[, group.i := 1:.N, ID]   
dt[, previous.date.interval := abs(date -previous.date) > 10, ID]
dt[, group := cumsum(previous.date.interval) + 1L, ID]
dt[, .(ID, DATE, group.date, group)]
#    ID      DATE group.date group
# 1:  1  1/1/2014 2014-01-01     1
# 2:  1 1/15/2014 2014-01-15     2
# 3:  1 1/20/2014 2014-01-15     2
# 4:  1 1/22/2014 2014-01-15     2
# 5:  1 3/10/2015 2015-03-10     3
# 6:  2 1/13/2015 2015-01-13     1
# 7:  2 1/20/2015 2015-01-13     1
# 8:  2 1/28/2015 2015-01-28     2
# 9:  2 2/28/2015 2015-02-28     3
#10:  2 3/20/2015 2015-03-20     4
© www.soinside.com 2019 - 2024. All rights reserved.