如何将多个数据框的数据分组

问题描述 投票:0回答:1

我有4个数据帧。

数据帧:取货地址

ID    x1     x2      x3       x4         x5    x6.....x1090
1001  Place1 Place2  NA       NA         NA    NA
1002  Place1 Place2  Place 3  Place4     NA    NA
1003  Place5 Place3  Place 2  Place 2    NA    NA
1004  Place6 Place7  NA       NA         NA    NA 

数据框。投递地址

ID    x1     x2      x3       x4         x5    x6.....x1090
1001  Place2 Place1  NA       NA         NA    NA
1002  Place2 Place1  Place4   Place3     NA    NA
1003  Place3 Place5  Place6   Place7     NA    NA
1004  Place7 Place6  NA       NA         NA    NA 

数据框架:日期

ID    x1         x2         x3        x4         x5    x6.....x1090
1001  4/1/2020   4/1/2020   NA        NA         NA    NA
1002  4/1/2020   4/1/2020   4/3/2020  4/3/2020   NA    NA
1003  4/3/2020   4/3/2020   4/4/2020  4/5/2020   NA    NA
1004  4/5/2020   4/5/2020   NA        NA         NA    NA 

数据帧。时间

    ID    x1           x2           x3            x4           x5    x6.....x1090
    1001  8:00:00 AM   4:00:00 PM   NA            NA           NA    NA
    1002  9:00:05 AM  12:35:05 PM   11:00:00 AM   4:00:00 AM   NA    NA
    1003  3:00:00 PM   6:00:00 PM   7:00:00 AM    3:00:00 PM   NA    NA
    1004  7:00:00 AM   3:00:00 PM   NA            NA           NA    NA 

我想把这些行程按日期分组,然后找到行程链,输出结果是这样的。

ID    Date      Tripchain                      Time                 
1001  4/1/2020  place 1- place 2- place 1      Morning- Afternoon
1002  4/1/2020  place 1- place 2- place 1      Morning -Afternoon
1002  4/3/2020  place 3- place 4- place 3      Morning -Afternoon
1003  4/3/2020  place 5- place 3- place 5      Afternoon-Afternoon
1003  4/4/2020  place 2- place 7               Morning
1003  4/5/2020  place 2- place 7               Afternoon
1004  4/5/2020  place 6- place 7- place 6      Morning-Afternoon
r grouping
1个回答
0
投票

首先,我认为你的折叠格式为 place1-place2-... 在一个字符串中,会有可怕的规模。如果只是为了报告(绝不是 使用 该数据的格式),那么好吧,但除此之外,我建议你把它分开。

这里要做的第一件事就是从单个帧转换为组合长格式。我再进一步,将datestimes转换成R-native的 POSIXt 对象。

dat <- list(
  pivot_longer(pickup, -ID, names_to = "x", values_to = "pickup"),
  pivot_longer(dropoff, -ID, names_to = "x", values_to = "dropoff"),
  pivot_longer(dates, -ID, names_to = "x", values_to = "date"),
  pivot_longer(times, -ID, names_to = "x", values_to = "time")
) %>%
  Reduce(function(a, b) full_join(a, b, by = c("ID", "x")), .) %>%
  filter(complete.cases(.)) %>%
  mutate(
    timestamp = as.POSIXct(paste(date, time), format = "%m/%d/%Y %I:%M:%S_%p"),
    date = as.Date(timestamp)
  ) %>%
  select(-x, -time)
dat
# # A tibble: 12 x 5
#       ID pickup dropoff date       timestamp              
#    <int> <chr>  <chr>   <date>     <dttm>                 
#  1  1001 Place1 Place2  2020-04-01 2020-04-01 08:00:00.000
#  2  1001 Place2 Place1  2020-04-01 2020-04-01 16:00:00.000
#  3  1002 Place1 Place2  2020-04-01 2020-04-01 09:00:05.000
#  4  1002 Place2 Place1  2020-04-01 2020-04-01 12:35:05.000
#  5  1002 Place3 Place4  2020-04-03 2020-04-03 11:00:00.000
#  6  1002 Place4 Place3  2020-04-03 2020-04-03 04:00:00.000
#  7  1003 Place5 Place3  2020-04-03 2020-04-03 15:00:00.000
#  8  1003 Place3 Place5  2020-04-04 2020-04-03 18:00:00.000
#  9  1003 Place2 Place6  2020-04-04 2020-04-04 07:00:00.000
# 10  1003 Place2 Place7  2020-04-05 2020-04-05 15:00:00.000
# 11  1004 Place6 Place7  2020-04-05 2020-04-05 07:00:00.000
# 12  1004 Place7 Place6  2020-04-05 2020-04-05 15:00:00.000

从这里开始,为了到达你想去的地方,我想我们可以把它们排序(按时间戳),把它们分组(按id),然后合并。哦,并转换成人类可读的 "时间-日期"。

time2human <- function(x) {
  as.character(
    cut(as.numeric(format(x, format = "%H")), c(-1, 4, 12, 17, 20, 25),
        labels = c("Night", "Morning", "Afternoon", "Evening", "Night"))
  )
}

dat %>%
  arrange(timestamp) %>%
  group_by(ID, date) %>%
  summarize(
    Tripchain = paste(c(pickup, last(dropoff)), collapse = " - "),
    Timeframe = paste(time2human(min(timestamp)), time2human(max(timestamp)),
                      sep = " - "),
    Time1 = first(timestamp), Time2 = last(timestamp)
  ) %>%
  ungroup()
# # A tibble: 7 x 6
#      ID date       Tripchain                Timeframe             Time1                   Time2                  
#   <int> <date>     <chr>                    <chr>                 <dttm>                  <dttm>                 
# 1  1001 2020-04-01 Place1 - Place2 - Place1 Morning - Afternoon   2020-04-01 08:00:00.000 2020-04-01 16:00:00.000
# 2  1002 2020-04-01 Place1 - Place2 - Place1 Morning - Morning     2020-04-01 09:00:05.000 2020-04-01 12:35:05.000
# 3  1002 2020-04-03 Place4 - Place3 - Place4 Night - Morning       2020-04-03 04:00:00.000 2020-04-03 11:00:00.000
# 4  1003 2020-04-03 Place5 - Place3          Afternoon - Afternoon 2020-04-03 15:00:00.000 2020-04-03 15:00:00.000
# 5  1003 2020-04-04 Place3 - Place2 - Place6 Evening - Morning     2020-04-03 18:00:00.000 2020-04-04 07:00:00.000
# 6  1003 2020-04-05 Place2 - Place7          Afternoon - Afternoon 2020-04-05 15:00:00.000 2020-04-05 15:00:00.000
# 7  1004 2020-04-05 Place6 - Place7 - Place6 Morning - Afternoon   2020-04-05 07:00:00.000 2020-04-05 15:00:00.000

我保留了 Time1Time2 以防您仍然需要完整的时间戳。您将需要更新 time2human 功能,这样切割就会更符合你的喜好。(我的 "小时 "为-1和+25,以确保我完全绑定了所有可能的小时;我可以更精确和小心,但这已经足够安全了。)

最后,如果你喜欢用你开始使用的格式来显示日期和时间,可以使用 format(..., format=...) 以获得你想要的东西;我更喜欢(并建议)把它们放在 DatePOSIXt 格式,因为它们本质上是数字格式,因此很容易进行比较排序。一旦它们转换为字符串,类似的操作就变得不那么清晰了(比如试图对以下内容进行排序 4/1/2019 之前 3/1/2020例如,)。)

© www.soinside.com 2019 - 2024. All rights reserved.