假设这样的df:
df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))
id time1 time2
1 1 2008-10-12 2009-03-20
2 1 2008-08-10 2009-06-15
3 2 2006-01-09 2006-02-13
4 2 2008-03-13 2008-04-17
5 3 2008-09-12 2008-10-17
6 3 2007-05-30 2007-07-04
7 4 2003-09-29 2004-01-15
8 4 2003-09-29 2004-01-15
9 5 2003-04-01 2003-07-04
10 5 2003-04-01 2003-07-04
我尝试做的是,首先在变量“time1”和“time2”之间创建一个lubridate
间隔。其次,我想按“id”进行分组,并比较下一行是否与当前行相同,当前行是否与前一行相同。我可以通过以下方式实现:
library(tidyverse)
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
正如您所看到的,问题是,对于id == 2和id == 3,两个条件都被评估为TRUE,即使间隔不相同。对于id == 1,它正确地计算为FALSE,对于id == 4和id == 5,它正确地计算为TRUE。
现在,当我将间隔转换为字符时,它会对它进行全面评估:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <chr> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 0 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 0 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
问题是,为什么它会将某些区间评估为相同,而不是?
UPDATE
如果查看Interval
类的代码,您将看到创建对象时它会存储开始日期,然后计算开始和结束之间的差异,并将其存储为.Data
。
interval <- function(start, end = NULL, tzone = tz(start)) {
if (is.null(tzone)) {
tzone <- tz(end)
if (is.null(tzone))
tzone <- "UTC"
}
if (is.character(start) && is.null(end)) {
return(parse_interval(start, tzone))
}
if (is.Date(start)) start <- date_to_posix(start)
if (is.Date(end)) end <- date_to_posix(end)
start <- as_POSIXct(start, tzone)
end <- as_POSIXct(end, tzone)
span <- as.numeric(end) - as.numeric(start)
starts <- start + rep(0, length(span))
if (tzone != tz(starts)) starts <- with_tz(starts, tzone)
new("Interval", span, start = starts, tzone = tzone)
}
换句话说,返回的对象没有“结束日期”的概念。 end
参数的默认值是NULL
,这意味着您甚至可以创建没有结束日期的间隔。
interval("2019-03-29")
[1] 2019-03-29 UTC--NA
“结束日期”只是从Interval
对象格式化打印时发生的计算生成的文本。
format.Interval <- function(x, ...) {
if (length([email protected]) == 0) return("Interval(0)")
paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
format(x@start + [email protected], tz = x@tzone, usetz = TRUE), sep = "")
}
int_end <- function(int) int@start + [email protected]
这两个代码片段都来自https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r。
访问overlap
的基础属性可以让您完成比较而无需转换为字符。你必须检查start
和.Data
是否相等。转换为字符更清晰,但如果你试图避免它,那么你就是这样做的。
ifelse(lead(overlap@start) == overlap@start & lead([email protected]) == [email protected], 1, 0)
完全采取:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2),
overlap_char = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
cond1_new = ifelse(lead(overlap@start) == overlap@start & lead([email protected]) == [email protected], 1, 0),
cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
cond2_new = ifelse(lag(overlap@start) == overlap@start & lag([email protected]) == [email protected], 1, 0))
id time1 time2 overlap overlap_char cond1_original cond1_new cond2_original cond2_new
<int> <date> <date> <S4: Interval> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC 0 0 NA NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC NA NA 0 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC 0 0 NA NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC NA NA 0 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC 0 0 NA NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC NA NA 0 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 1 1 NA NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC NA NA 1 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC 1 1 NA NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC NA NA 1 1
你可以在这里阅读更多关于Interval
s的信息:https://lubridate.tidyverse.org/reference/Interval-class.html
我相信你的确切案例与==
比较有关。如上所示,“重叠”是一个列表,而不是矢量。来自?==
,它说:
x和y中的至少一个必须是原子向量,但如果另一个是列表,R会尝试将其强制转换为原子向量的类型:如果列表由长度为1的元素组成,则这将成功强迫到正确的类型。
如果两个参数是不同类型的原子向量,则一个被强制为另一个的类型,优先级(递减)顺序为字符,复数,数字,整数,逻辑和原始。
我们可以强迫numeric
和character
“重叠”以查看差异。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
mutate(overlap.n = as.numeric(overlap),
overlap.c = as.character(overlap))
# A tibble: 10 x 8
# Groups: id [5]
id time1 time2 overlap cond1 cond2 overlap.n overlap.c
<int> <date> <date> <S4: Interval> <dbl> <dbl> <dbl> <chr>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA 13737600 2008-10-12 U…
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0 26697600 2008-08-10 U…
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA 3024000 2006-01-09 U…
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1 3024000 2008-03-13 U…
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA 3024000 2008-09-12 U…
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1 3024000 2007-05-30 U…
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA 9331200 2003-09-29 U…
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1 9331200 2003-09-29 U…
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA 8121600 2003-04-01 U…
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1 8121600 2003-04-01 U…
根据上面的输出,我认为使用==
将“重叠”区间强制转换为numeric
向量,导致持续时间比较@hmhensen上面提到。当你强迫强制到character
而不是numeric
时,你会得到你想要的结果。
我认为这与lubridate
实际计算的内容有关。
当我计算date1
和date2
之间的差异时,会发生这种情况:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = time2 - time1)
id time1 time2 overlap
1 1 2008-10-12 2009-03-20 159 days
2 1 2008-08-10 2009-06-15 309 days
3 2 2006-01-09 2006-02-13 35 days
4 2 2008-03-13 2008-04-17 35 days
5 3 2008-09-12 2008-10-17 35 days
6 3 2007-05-30 2007-07-04 35 days
7 4 2003-09-29 2004-01-15 108 days
8 4 2003-09-29 2004-01-15 108 days
9 5 2003-04-01 2003-07-04 94 days
10 5 2003-04-01 2003-07-04 94 days
所以我们可以告诉每天的间隔是相同的。
现在,overlap
实际上在计算什么?要找出我稍微更改了代码以报告超前和滞后而不是1。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))
# A tibble: 10 x 6
# Groups: id [5]
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 3024000
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 3024000
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 9331200
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 8121600
在这里,我们看到lead
和lag
实际上计算了特定时间间隔的差异,而不是查看实际的间隔开始和结束日期。这就是为什么它将某些间隔视为相等而字符串视为不相等的原因。
让我们来看看interval
生成的对象。
a <- interval(df$time1, df$time2)
str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"
这是一个有三个插槽的S4级:.Data
,start
和tzone
。
调用a
向我们展示了间隔。
a
[1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
[4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
[7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC
但是当你对a
进行计算时,它在.Data
上进行了计算,这是一个从指定日期开始的秒序列(参见?interval
)。
[email protected]
#[1] 13737600 26697600 3024000 3024000 3024000 3024000 9331200 9331200 8121600 8121600
对于间隔的开始日期,我们需要访问start
插槽。
a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"
而且时区......
a@tzone
#[1] "UTC"
我们还可以看看元素之间的关系是什么。最后一个元素和倒数第二个元素具有相同的间隔。
a[9] == a[10]
#[1] TRUE
而且它们是相同的物体。
identical(a[9], a[10])
#[1] TRUE
但是当你检查元素是否相等时,它真正检查的是什么?元素3和4具有相同的时间差,但不是相同的间隔。因此,当您检查他们的滞后/线索是否相等时,它返回TRUE
。但由于他们有不同的间隔日期,他们不应该。因此,当我们检查它们是否相同时,我们才会得到我们的预期。
a[3] == a[4]
#[1] TRUE
a[3]@.Data == a[4]@.Data
#[1] TRUE
identical(a[3], a[4])
#[1] FALSE
所以发生了什么事? a[3] == a[4]
真正检查的是a[3]@.Data == a[4]@.Data
,因此它正在检查3024000
是否等于3024000
。它这样做它返回TRUE
。但是相同的检查所有的插槽并发现它们不相同,因为每个插槽中的start
都不同。
然后我考虑使用相同的超前/滞后,以便我们可以在代码中放入一个逻辑,但看看这个。
a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC
# now lead
lead(a[9])
#2003-04-01 UTC--NA
输出看起来不像预期的a[10]
。
#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"
所以lead
和lag
对S4类物体有不同的影响。为了更好地处理您的第一次尝试输出,我这样做了:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = lead(overlap),
cond2 = lag(overlap))
我收到很多警告信息
#In mutate_impl(.data, dots) :
# Vectorizing 'Interval' elements may not preserve their attributes
我对R对象了解不足以理解S4类中的数据是如何存储的,但它看起来肯定与典型的S3对象不同。
似乎像使用as.character
一样,是你要走的路。