我有这样组织的数据集:
ID Species DateTime
P1 A 2015-03-16 18:42:00
P2 A 2015-03-16 19:34:00
P3 A 2015-03-16 19:58:00
P4 A 2015-03-16 21:02:00
P5 B 2015-03-16 21:18:00
P6 A 2015-03-16 21:19:00
P7 A 2015-03-16 21:33:00
P8 B 2015-03-16 21:35:00
P9 B 2015-03-16 23:43:00
我想为每个物种选择独立的图片(即图片彼此相隔1h),在这个数据集中用R.
在这个例子中,对于物种A,我只想保留P1,P3和P4。 P2不会被考虑,因为它落在以P1开始的1h时段内。 P3被认为是因为其DateTime(19h58)在19h42之后下降。现在,接下来的1h时段将持续到20h58。对于物种B,只有P5和P9。
因此,在此过滤器之后,我的数据集将如下所示:
ID Species DateTime
P1 A 2015-03-16 18:42:00
P3 A 2015-03-16 19:58:00
P4 A 2015-03-16 21:02:00
P5 B 2015-03-16 21:18:00
P9 B 2015-03-16 23:43:00
有人知道如何在R中执行此操作吗?
可能有更优雅的方式来做,但这有效:
library(dplyr)
isHourApart <- function(dt) {
min <- 0
keeps <- c()
for (d in dt) {
if (d >= min + 60 * 60) {
min <- d
keeps <- c(keeps, TRUE)
} else {
keeps <- c(keeps, FALSE)
}
}
keeps
}
df %>%
group_by(Species) %>%
filter(isHourApart(DateTime))
> df
# A tibble: 5 x 3
# Groups: Species [2]
ID Species DateTime
<chr> <fct> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00
请注意,DateTime列是POSIXct类。
这是使用data.table
执行此操作的一种方法:
library(data.table)
library(lubridate)
df1 <- read.table(text = "ID Species DateTime
P1 A '2015-03-16 18:42:00'
P3 A '2015-03-16 19:58:00'
P4 A '2015-03-16 21:02:00'
P5 B '2015-03-16 21:18:00'
P9 B '2015-03-16 23:43:00'",
header = TRUE, stringsAsFactors = FALSE)
setDT(df1)
df1[, DateTime := ymd_hms(DateTime)]
df1[, date_range := DateTime + 60 * 60]
df2 <- copy(df1)
df2[, date := DateTime]
df2[, DateTime := NULL]
df <- df2[df1, .(ID, Species, date = x.date, DateTime, date_range), on=.(ID, Species, date >= DateTime, date <= date_range), nomatch = 0L, allow.cartesian = TRUE]
df[, c("date", "date_range") := NULL]
ID Species DateTime
1: P1 A 2015-03-16 18:42:00
2: P3 A 2015-03-16 19:58:00
3: P4 A 2015-03-16 21:02:00
4: P5 B 2015-03-16 21:18:00
5: P9 B 2015-03-16 23:43:00
这是dplyr
解决方案:
require(dplyr);
df %>%
arrange(Species, DateTime) %>%
group_by(Species) %>%
mutate(
DateTime = as.POSIXct(DateTime),
diff = abs(lag(DateTime) - DateTime),
diff = ifelse(is.na(diff), 0, diff),
cumdiff = cumsum(as.numeric(diff)) %/% 60,
x = abs(lag(cumdiff) - cumdiff)) %>%
filter(is.na(x) | x > 0) %>%
select(ID, Species, DateTime) %>%
ungroup() %>%
as.data.frame()
# ID Species DateTime
#1 P1 A 2015-03-16 18:42:00
#2 P3 A 2015-03-16 19:58:00
#3 P4 A 2015-03-16 21:02:00
#4 P5 B 2015-03-16 21:18:00
#5 P9 B 2015-03-16 23:43:00
df <- read.table(text = "ID Species DateTime
P1 A '2015-03-16 18:42:00'
P2 A '2015-03-16 19:34:00'
P3 A '2015-03-16 19:58:00'
P4 A '2015-03-16 21:02:00'
P5 B '2015-03-16 21:18:00'
P6 A '2015-03-16 21:19:00'
P7 A '2015-03-16 21:33:00'
P8 B '2015-03-16 21:35:00'
P9 B '2015-03-16 23:43:00'", header = T);
我们可以简单地创建一个60分钟间隔的新列,然后保持每个Species
的第一个发生。
df %>%
mutate(by60 = cut(DateTime, "60 min")) %>%
group_by(Species, by60) %>%
slice(1)
输出1
# A tibble: 5 x 4
# Groups: Species, by60 [5]
ID Species DateTime by60
<chr> <chr> <dttm> <fct>
1 P1 A 2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4 A 2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5 B 2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9 B 2015-03-16 23:43:00 2015-03-16 23:42:00
如果我们想放弃那个虚拟列:
df %>%
mutate(by60 = cut(DateTime, "60 min")) %>%
group_by(Species, by60) %>%
slice(1) %>%
ungroup() %>%
select(-by60)
输出2
# A tibble: 5 x 3
ID Species DateTime
<chr> <chr> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00