我创建了这个示例数据框:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3),
car = c("subaru", "audi", "subaru", "toyota", "toyota", "audi", "subaru", "nissan", "nissan"),
buy_date = c("01/01/2000", "01/01/2001", "01/02/2001", "01/01/2000", "01/05/2000", "01/01/2005", "01/03/2005", "01/01/2000", "02/01/2000"))
df$buy_date <- as.Date(df$buy_date, format="%d/%m/%Y") #it doesnt work properly, but that is not the issue atm
产生这个 df:
id | 车 | 购买日期 |
---|---|---|
1 | 斯巴鲁 | 2000-01-01 |
1 | 奥迪 | 2001-01-01 |
1 | 斯巴鲁 | 2001-02-01 |
2 | 丰田 | 2000-01-01 |
2 | 丰田 | 2004-12-01 |
2 | 奥迪 | 2005-01-01 |
2 | 斯巴鲁 | 2005-03-01 |
3 | 日产 | 2000-01-01 |
3 | 日产 | 2000-01-02 |
我想过滤以下内容:对于每个 ID,如果他/她在 180 天内购买了两种不同类型的汽车,我想保留 ID 中的行。所以它应该返回一个像这样的列表:
id | 车 | 购买日期 |
---|---|---|
1 | 奥迪 | 2001-01-01 |
1 | 斯巴鲁 | 2001-02-01 |
2 | 丰田 | 2004-12-01 |
2 | 奥迪 | 2005-01-01 |
2 | 斯巴鲁 | 2005-03-01 |
希望能帮到你。 提前致谢
您可以使用两个内部联接来完成此操作。请注意,在您的示例数据中,id #2 购买的第二辆丰田汽车与“2004-12-01”日期不匹配,因此我更改了它(请参见下面的输入)
library(dplyr)
df$buy_date <- as.Date(df$buy_date, format="%d/%m/%Y")
inner_join(
df,
inner_join(df,df,by="id", multiple="all") %>%
filter(car.x!=car.y, abs(buy_date.x-buy_date.y)<180) %>%
select(id, car=car.x, buy_date=buy_date.x) %>%
distinct()
)
输出:
id car buy_date
1 1 audi 2001-01-01
2 1 subaru 2001-02-01
3 2 toyota 2004-12-01
4 2 audi 2005-01-01
5 2 subaru 2005-03-01
输入:
structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3), car = c("subaru",
"audi", "subaru", "toyota", "toyota", "audi", "subaru", "nissan",
"nissan"), buy_date = c("01/01/2000", "01/01/2001", "01/02/2001",
"01/01/2000", "01/12/2004", "01/01/2005", "01/03/2005", "01/01/2000",
"02/01/2000")), class = "data.frame", row.names = c(NA, -9L))
构造条件然后
filter
ing
library(dplyr)
df %>%
mutate(dif = c(diff.Date(buy_date), 0) < 180,
dup = ifelse(dif, car, NA), .by = id) %>%
filter(!(duplicated(dup) | duplicated(dup, fromLast=T) & !is.na(dup)) &
dif, .by = id) %>%
select(-c(dif, dup))
id car buy_date
1 1 audi 2001-01-01
2 1 subaru 2001-02-01
3 2 toyota 2000-01-01
4 2 audi 2005-01-01
5 2 subaru 2005-03-01