使用两个数据框匹配多个条件的值

问题描述 投票:0回答:2

我在R中还很陌生,需要一些帮助。我有两个具有相当相似信息的数据框。第一个数据框包含有关航空公司错误连接的信息,而另一个数据框是同一航空公司的完整时间表。现在,我需要在misconnection data.frame中添加一个新列,其中包括时间表中的航班,可以代替转机中的延迟航班。

我要替换的航班需要满足一系列条件(在一定的时间范围内,必须在同一工作日,并且需要飞往相同的目的地)。另外,我希望R选择(按时间)离运输途中的新到达时间最近的航班(来自错误连接data.frame)。

错误连接data.frame如下所示(总共1620行):

miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
                     Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),  
                     arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),    
                     next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
                     weekday = c("4", "7", "7", "3"))

view(miscon)

        flight.date    Outbound.airport    arr.time    next.pos.dep    Weekday
1       2019-08-05     MXP                 19:25:00    19:36:00        4
2       2019-10-03     KRK                 20:52:00    21:17:00        7
3       2019-07-21     KLU                 07:33:00    07:58:00        7
4       2019-05-29     OTP                 18:49:00    19:14:00        3

时间表data.frame看起来像这样:

tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
                 end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
                 weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
                 Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),  
                 dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))    

view(tt)

    start.date    end.date     Weekday     Outbound.airport    dep.time
1   2019-03-25    2019-10-21   1234567     KLU                 12:20:00   
2   2019-05-02    2019-10-27   1.3..67     KLU                 15:55:00
3   2019-07-30    2019-08-26   1.34567     MXP                 19:55:00
4   2019-03-30    2019-06-01   ..3.5..     OTP                 20:34:00

在Excel中,我已经使用索引匹配解决了此问题。但是,对于excel来说,问题有点大,这就是为什么我需要将其转换为R的原因。尝试在R中使用match和mutate函数,但似乎我要匹配的值必须相等-我这样做没想到我的会是。

还使用DescTools软件包找到了解决类似问题的有趣解决方案,但我尝试将其成功实现。

get_close2 <- function(xx=tt, yy=miscon) {
  pos <- vector(mode = "numeric")
  for(i in 1:dim(yy)[1]) {
    pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
    #print(pos[i])
    yy$new.flight[i] <- pos[i]
  }
  out <- yy
  return(out)
}

get_close2()

为此,我只尝试了一种情况。它生成了一个列,但仅包含NA。显然,我现在很遥远,这就是为什么我要寻求帮助。希望问题是清楚的。最终结果最好如下所示:

miscon
        flight.date    Outbound.airport    arr.time    next.pos.dep    Weekday   new.flight.time
1       2019-12-05     MXP                 19:25:00    19:36:00        4         19:55:00
2       2019-10-03     KRK                 20:52:00    21:17:00        7         NA
3       2019-07-21     KLU                 07:33:00    07:58:00        7         12:20:00
4       2019-05-29     OTP                 18:49:00    19:14:00        3         20:34:00
r dataframe merge match closest
2个回答
0
投票

[好吧,这并不漂亮,但是您有一个相当复杂的问题,而且对于我所寻找的东西,这还不是很清楚-您需要在比提供的小示例更大的数据集上进行检查。确定第一。

# setup
library(data.table)
setDT(tt)
setDT(miscon)

# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]

# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
  , new.dep.time := as.numeric(dep.time - arr.time)][
  , c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
  , new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
  new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]

0
投票

我认为您可以按照以下步骤进行操作。首先,我将重新排列Weekday列,以便在航班进行的每个工作日都有一行:

library(data.table)
library(dplyr)
library(tidyr)

tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>% 
  gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
  filter(weekday %in% 1:7) %>%
  select(-key)

然后我将在机场和工作日左转miscontt

tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]

现在您具有所有可能连接的data.frame。剩下的唯一事情就是找到每个连接的两次飞行之间的最短时间。

df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]

现在您可以按最小时间延迟(timediff)过滤行:

df = df[ , .SD[which.min(timediff)],  by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")

> df
   weekday Outbound.airport flight.date            arr.time        next.pos.dep start.date   end.date     new.flight.time   timediff
1:       7              KLU  2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2:       4              MXP  2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00  1800 secs
3:       3              OTP  2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00  6300 secs

解决方案是dplyrdata.table的混合体。

© www.soinside.com 2019 - 2024. All rights reserved.