我有4个数据帧。
数据帧:取货地址
ID x1 x2 x3 x4 x5 x6.....x1090
1001 Place1 Place2 NA NA NA NA
1002 Place1 Place2 Place 3 Place4 NA NA
1003 Place5 Place3 Place 2 Place 2 NA NA
1004 Place6 Place7 NA NA NA NA
数据框。投递地址
ID x1 x2 x3 x4 x5 x6.....x1090
1001 Place2 Place1 NA NA NA NA
1002 Place2 Place1 Place4 Place3 NA NA
1003 Place3 Place5 Place6 Place7 NA NA
1004 Place7 Place6 NA NA NA NA
数据框架:日期
ID x1 x2 x3 x4 x5 x6.....x1090
1001 4/1/2020 4/1/2020 NA NA NA NA
1002 4/1/2020 4/1/2020 4/3/2020 4/3/2020 NA NA
1003 4/3/2020 4/3/2020 4/4/2020 4/5/2020 NA NA
1004 4/5/2020 4/5/2020 NA NA NA NA
数据帧。时间
ID x1 x2 x3 x4 x5 x6.....x1090
1001 8:00:00 AM 4:00:00 PM NA NA NA NA
1002 9:00:05 AM 12:35:05 PM 11:00:00 AM 4:00:00 AM NA NA
1003 3:00:00 PM 6:00:00 PM 7:00:00 AM 3:00:00 PM NA NA
1004 7:00:00 AM 3:00:00 PM NA NA NA NA
我想把这些行程按日期分组,然后找到行程链,输出结果是这样的。
ID Date Tripchain Time
1001 4/1/2020 place 1- place 2- place 1 Morning- Afternoon
1002 4/1/2020 place 1- place 2- place 1 Morning -Afternoon
1002 4/3/2020 place 3- place 4- place 3 Morning -Afternoon
1003 4/3/2020 place 5- place 3- place 5 Afternoon-Afternoon
1003 4/4/2020 place 2- place 7 Morning
1003 4/5/2020 place 2- place 7 Afternoon
1004 4/5/2020 place 6- place 7- place 6 Morning-Afternoon
首先,我认为你的折叠格式为 place1-place2-...
在一个字符串中,会有可怕的规模。如果只是为了报告(绝不是 使用 该数据的格式),那么好吧,但除此之外,我建议你把它分开。
这里要做的第一件事就是从单个帧转换为组合长格式。我再进一步,将datestimes转换成R-native的 POSIXt
对象。
dat <- list(
pivot_longer(pickup, -ID, names_to = "x", values_to = "pickup"),
pivot_longer(dropoff, -ID, names_to = "x", values_to = "dropoff"),
pivot_longer(dates, -ID, names_to = "x", values_to = "date"),
pivot_longer(times, -ID, names_to = "x", values_to = "time")
) %>%
Reduce(function(a, b) full_join(a, b, by = c("ID", "x")), .) %>%
filter(complete.cases(.)) %>%
mutate(
timestamp = as.POSIXct(paste(date, time), format = "%m/%d/%Y %I:%M:%S_%p"),
date = as.Date(timestamp)
) %>%
select(-x, -time)
dat
# # A tibble: 12 x 5
# ID pickup dropoff date timestamp
# <int> <chr> <chr> <date> <dttm>
# 1 1001 Place1 Place2 2020-04-01 2020-04-01 08:00:00.000
# 2 1001 Place2 Place1 2020-04-01 2020-04-01 16:00:00.000
# 3 1002 Place1 Place2 2020-04-01 2020-04-01 09:00:05.000
# 4 1002 Place2 Place1 2020-04-01 2020-04-01 12:35:05.000
# 5 1002 Place3 Place4 2020-04-03 2020-04-03 11:00:00.000
# 6 1002 Place4 Place3 2020-04-03 2020-04-03 04:00:00.000
# 7 1003 Place5 Place3 2020-04-03 2020-04-03 15:00:00.000
# 8 1003 Place3 Place5 2020-04-04 2020-04-03 18:00:00.000
# 9 1003 Place2 Place6 2020-04-04 2020-04-04 07:00:00.000
# 10 1003 Place2 Place7 2020-04-05 2020-04-05 15:00:00.000
# 11 1004 Place6 Place7 2020-04-05 2020-04-05 07:00:00.000
# 12 1004 Place7 Place6 2020-04-05 2020-04-05 15:00:00.000
从这里开始,为了到达你想去的地方,我想我们可以把它们排序(按时间戳),把它们分组(按id),然后合并。哦,并转换成人类可读的 "时间-日期"。
time2human <- function(x) {
as.character(
cut(as.numeric(format(x, format = "%H")), c(-1, 4, 12, 17, 20, 25),
labels = c("Night", "Morning", "Afternoon", "Evening", "Night"))
)
}
dat %>%
arrange(timestamp) %>%
group_by(ID, date) %>%
summarize(
Tripchain = paste(c(pickup, last(dropoff)), collapse = " - "),
Timeframe = paste(time2human(min(timestamp)), time2human(max(timestamp)),
sep = " - "),
Time1 = first(timestamp), Time2 = last(timestamp)
) %>%
ungroup()
# # A tibble: 7 x 6
# ID date Tripchain Timeframe Time1 Time2
# <int> <date> <chr> <chr> <dttm> <dttm>
# 1 1001 2020-04-01 Place1 - Place2 - Place1 Morning - Afternoon 2020-04-01 08:00:00.000 2020-04-01 16:00:00.000
# 2 1002 2020-04-01 Place1 - Place2 - Place1 Morning - Morning 2020-04-01 09:00:05.000 2020-04-01 12:35:05.000
# 3 1002 2020-04-03 Place4 - Place3 - Place4 Night - Morning 2020-04-03 04:00:00.000 2020-04-03 11:00:00.000
# 4 1003 2020-04-03 Place5 - Place3 Afternoon - Afternoon 2020-04-03 15:00:00.000 2020-04-03 15:00:00.000
# 5 1003 2020-04-04 Place3 - Place2 - Place6 Evening - Morning 2020-04-03 18:00:00.000 2020-04-04 07:00:00.000
# 6 1003 2020-04-05 Place2 - Place7 Afternoon - Afternoon 2020-04-05 15:00:00.000 2020-04-05 15:00:00.000
# 7 1004 2020-04-05 Place6 - Place7 - Place6 Morning - Afternoon 2020-04-05 07:00:00.000 2020-04-05 15:00:00.000
我保留了 Time1
和 Time2
以防您仍然需要完整的时间戳。您将需要更新 time2human
功能,这样切割就会更符合你的喜好。(我的 "小时 "为-1和+25,以确保我完全绑定了所有可能的小时;我可以更精确和小心,但这已经足够安全了。)
最后,如果你喜欢用你开始使用的格式来显示日期和时间,可以使用 format(..., format=...)
以获得你想要的东西;我更喜欢(并建议)把它们放在 Date
和 POSIXt
格式,因为它们本质上是数字格式,因此很容易进行比较排序。一旦它们转换为字符串,类似的操作就变得不那么清晰了(比如试图对以下内容进行排序 4/1/2019
之前 3/1/2020
例如,)。)