R-使用开始和结束日期计算一段时间内的项目数

问题描述 投票:6回答:5

我想用开始日期和结束日期计算一段时间内的项目数。

一些样本数据

START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
df <- data.frame(START,END)
df

       START        END
1 2014-01-01 2014-01-04
2 2014-01-02 2014-01-03
3 2014-01-03 2014-01-03
4 2014-01-03 2014-01-04

显示这些项目的时间计数(基于其开始和结束时间)的表格如下:

DATETIME    COUNT
2014-01-01   1 
2014-01-02   2 
2014-01-03   4 
2014-01-04   2 

这可以用R来完成,特别是使用dplyr吗?非常感谢。

r duration dplyr
5个回答
6
投票

这样做会。您可以根据需要更改列名称。

as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
#         Var1 Freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

正如评论中所指出的,上述解决方案中的Var1现在是一个因素,而不是日期。要将日期类保留在第一列中,您可以对上述解决方案做更多的工作,或使用plyr::count而不是as.data.frame(table(...))

library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
#            x freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

3
投票

你可以使用data.table

library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
                           list(COUNT=.N), by=DATETIME]
 DT
 #     DATETIME COUNT
 #1: 2014-01-01     1
 #2: 2014-01-02     2
 #3: 2014-01-03     4
 #4: 2014-01-04     2

从1.9.4+版本开始,您还可以使用foverlaps()函数执行“重叠连接”。它更有效率,因为它不必首先扩展每一行的日期,然后计数。这是如何做:

require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference

## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))

## 2. Now find overlapping coordinates 
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)

现在,我们只需要按yid(= lookup中的指数)进行分组并计算:

## 3. count
ans[, .N, by=yid]
#    yid N
# 1:   1 1
# 2:   2 2
# 3:   3 4
# 4:   4 2

第一列对应于lookup中的行号。如果缺少某些数字,则计数为0。


1
投票

使用dplyr和分组数据:

data_frame(
            START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
            END   = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
           ) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df

df %>% 
  group_by(.,group) %>% 
  do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))

如果您希望在给定每个用户的时间间隔的情况下查找不同页面/计算机等上的登录次数,则这是一个常见问题

> df
Source: local data frame [8 x 3]

  group      START        END
  (chr)     (date)     (date)
1     a 2014-01-01 2014-01-04
2     a 2014-01-02 2014-01-03
3     a 2014-01-03 2014-01-03
4     a 2014-01-03 2014-01-04
5     b 2014-01-01 2014-01-04
6     b 2014-01-02 2014-01-03
7     b 2014-01-03 2014-01-03
8     b 2014-01-03 2014-01-04
> 
> df %>% 
+   group_by(.,group) %>% 
+   do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]

  group       Var1  Freq
  (chr)     (fctr) (int)
1     a 2014-01-01     1
2     a 2014-01-02     2
3     a 2014-01-03     4
4     a 2014-01-04     2
5     b 2014-01-01     1
6     b 2014-01-02     2
7     b 2014-01-03     4
8     b 2014-01-04     2

0
投票

使用dplyrforeach

library(dplyr)
library(foreach)

df <- data.frame(START = as.Date(c("2014-01-01",
                                   "2014-01-02",
                                   "2014-01-03",
                                   "2014-01-03")),
                 END = as.Date(c("2014-01-04",
                                 "2014-01-03",
                                 "2014-01-03",
                                 "2014-01-04")))
df

r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
             .combine = rbind) %do% {
  df %>%
    filter(DATETIME >= START & DATETIME <= END) %>%
    summarise(DATETIME, COUNT = n())
}
r

0
投票

我刚刚提出了另一种基于润滑剂的解决方案,对于较大的数据帧而言更快,在较新且相关的SO post here中具有较宽的日期范围

© www.soinside.com 2019 - 2024. All rights reserved.