如果/否则，如果：仅在R中不满足第一个条件后，才在设置的距离内选择第一个匹配记录

Question

我想在不满足第一个搜索条件的情况下，在设定的距离内选择最近的所有者。这些位置称为reflo（参考位置），它们具有相应的x和y坐标（分别称为locx和locy）。

条件：

如果lifetime_census$reflo==owners$reflo.x[i]，则满足条件
如果是lifetime_census$reflo!=owners$reflo.x[i]，则找到下一个最近的记录（30米以内）
如果30米之内没有记录，则分配NA

先前的所有者（> 20,000）存储在名为lifetime_census的数据集中。这是数据示例：

id         previous_id  reflo  locx    locy   lifespan  
16161      5587         -310    -3     10     1810    
16848      5101         Q1      17.3   0.8    55    
21815      6077         M2      13     1.8    979
23938      6130         -49     -4     9      374
29615      7307         B.1     2.5    1      1130

然后我有了一个owners数据集（这里是一个示例）：

squirrel_id      spr_census reflo.x    spring_locx      spring_locy 
6391              2005       M3           13             2.5  
6130              2005       -310         -3             10    
23586             2019       B9           2              9

为了说明我要实现的目标：

squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner   
6391        2004       M3       13         2.5         6077            
6130        2005       -310     -3         10          5587   
23586       2019       B9       2          9           NA

我目前正在尝试的是：

n <- length(owners$squirrel_id)
distance <- 30 #This can be easily changed to bigger or smaller values

for(i in 1:n) {
  last_owner <- subset(lifetime_census,
    lifetime_census$reflo==owners$reflo.x[i] & #using the exact location
((30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)) #this sets the search limit

owners[i,"previous_owner"] <- last_owner$previous_id[i]

}

我无法弄清楚如何让循环依次遍历条件，然后仅在未找到完全匹配的情况下，才在搜索限制内选择记录。

有什么想法吗？

编辑（针对发布的答案）：

根据以下建议，一种可行的解决方案似乎是：

distance_xy = function (x1, y1, x2, y2) {
  sqrt((x2 - x1)^2 + (y2 -y1)^2)
}

previous_id_fn <- function(v, w){
  dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w)) 
  id <- lifetime_census$previous_id[[which.min(dists)]]
  if (min(dists, na.rm = TRUE) > 30) { id <- NA }
  return(id)
}

对于上面的代码：

因为这是在所有条件循环之前出现的，这意味着我无法添加任何其他修饰符来确保先前的所有者在特定时间范围内。有没有办法限制此previous_id的过去时间（即，找不到任何所有者，而是选择最近的所有者和最近的所有者）。就目前而言，它甚至找到了“未来”的所有者，这不好。
是否有办法使所有其他与previous_id关联的列？

将其添加到我的循环中：

for(i in 1:n) {
    print(i)
  last_spring_owner <- subset(lifetime_census,
    lifetime_census$previous_id != owners$squirrel_id[i] &
    lifetime_census$reflo== owners$reflo.x[i] &
    lifetime_census$census_year <= owners$spr_census[i])

    owners[i,"spr_owner"] <- last_spring_owner$previous_id[i]
    owners[i, "prev_spr_census"] <- last_spring_owner$census_year[i]
}

并且在这里，有没有一种方法可以确保mutate函数不仅提取出previous_id，还提取出所有其他相关行（如上述循环中的prev_spr_census）？

Answer 1

由于您有2套条件，我建议也将任务分为两部分。另外，在组合两个数据帧时，我总是建议找到合适的联接。

对于完全匹配，dplyr::inner_join将为您提供正确的行。

对于下一部分，您可以排除完全匹配的内容，并使用distance_left_join包中的fuzzyjoin来匹配其余行。它还带有最大距离的选项。

然后，您可以简单地绑定两个结果

library(data.table)
lifetime_census <- fread('id         previous_id  reflo  locx    locy   lifespan  
16161      5587         -310    -3     10     1810    
16848      5101         Q1      17.3   0.8    55    
21815      6077         M2      13     1.8    979
23938      6130         -49     -4     9      374
29615      7307         B.1     3      1      1130')
lifetime_census
#>       id previous_id reflo locx locy lifespan
#> 1: 16161        5587  -310 -3.0 10.0     1810
#> 2: 16848        5101    Q1 17.3  0.8       55
#> 3: 21815        6077    M2 13.0  1.8      979
#> 4: 23938        6130   -49 -4.0  9.0      374
#> 5: 29615        7307   B.1  3.0  1.0     1130
owners <- fread('squirrel_id      spr_census reflo.x    spring_locx      spring_locy 
6391              2005       M3           13             2.5  
6130              2005       -310         -3             10    
23586             2019       B9           2              9')
owners
#>    squirrel_id spr_census reflo.x spring_locx spring_locy
#> 1:        6391       2005      M3          13         2.5
#> 2:        6130       2005    -310          -3        10.0
#> 3:       23586       2019      B9           2         9.0

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(fuzzyjoin)

# Search for exact match
df1 <- inner_join(owners,lifetime_census ,by=c(reflo.x='reflo')) %>% 
  select(squirrel_id:spring_locy,previous_id)
df1
#>   squirrel_id spr_census reflo.x spring_locx spring_locy previous_id
#> 1        6130       2005    -310          -3          10        5587


df2 <- 
  owners %>% 
  anti_join(df1,by=c('squirrel_id')) %>% # Remove rows with exact matches
    distance_left_join(lifetime_census,
                       by=c(spring_locx='locx',spring_locy='locy'), # Match columns
                       max_dist=1, # Since you want a maximum distance of 30m = 1 unit
                       distance_col='dist') %>% # Optional, if you want to see the distance
    select(squirrel_id:spring_locy,previous_id,dist)

bind_rows(df1,df2)  
#>   squirrel_id spr_census reflo.x spring_locx spring_locy previous_id dist
#> 1        6130       2005    -310          -3        10.0        5587   NA
#> 2        6391       2005      M3          13         2.5        6077  0.7
#> 3       23586       2019      B9           2         9.0          NA   NA

^{由reprex package（v0.3.0）在2020-03-02创建}

Answer 2

以下解决了问题。

计算距离的功能：

distance_xy = function (x1, y1, x2, y2) {
  sqrt((x2 - x1)^2 + (y2 -y1)^2)
}

确定30米范围内的先前ID。如果所有距离均大于30米，则将id设置为NA。

library(tidyverse)

previous_id_fn <- function(v, w, years){
   dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w)) 
   df <- data.frame(previous = lifetime_census$previous_id, 
                    dist = dists, 
                    life = lifetime_census$lifespan) %>% 
               filter(life < years)
   id <- df$previous[[which.min(df$dist)]]
   if (min(df$dist, na.rm = TRUE) > 30) { id <- NA }
   return(id)
}

首先将data.frame所有者与data.frame lifetime_census结合在一起，以获得具有previous_id的列。然后将上面定义的函数应用于data.frame的每一行。

owners %>%
  left_join(., lifetime_census, by = c("reflo.x" = "reflo")) %>%
  select(squirrel_id:spring_locy, previous_id) %>%
  rowwise() %>%
  mutate(previous_id = ifelse(is.na(previous_id), 
                            previous_id_fn(spring_locx, spring_locy, 1000), 
  previous_id))

编辑：

我在函数previous_id_fn（）中添加了一个年份参数。如果寿命>年，函数现在将返回NA。

Answer 3

我建议这样（假设locx的单位与distance的单位相同：

distance = 30

distance_xy = function (x1, y1, x2, y2) {
  sqrt((x2 - x1)^2 + (y2 -y1)^2)
}

for (i in 1:dim(owners)[1]) {
  if (owners$reflo.x[i] %in% lifetime_census$reflo) {
    owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
  } else {
    dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
      if (any(dt <= distance)) {
        owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
      } else {
        owners$previous_id[i] = NA
      }
    }
  }

给出：

   squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
1        6391       2005      M3          13         2.5           6077
2        6130       2005    -310          -3        10.0           5587
3       23586       2019      B9           2         9.0           5587

请注意，如果reflo有多个匹配项，则此操作将失败。

如果/否则，如果：仅在R中不满足第一个条件后，才在设置的距离内选择第一个匹配记录

问题描述投票：1回答：3

3个回答

最新问题

如果/否则，如果：仅在R中不满足第一个条件后，才在设置的距离内选择第一个匹配记录

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3