从 R 中的地址提取街道名称

问题描述 投票:0回答:1

我正在尝试从街道地址中提取街道名称。我只想提取字符串开头的街道/门牌号,并提取最后一个街道后缀(RD、ST、DR、HWY 等)之后的所有内容。

两个注意事项:

  1. 如果有两个街道后缀,我想保留第一个,并删除第二个(例如,South Pkwy Dr 变为 South Pkwy,而不仅仅是 South)。

  2. 如果街道地址以数字结尾,我想保留后缀。例如,如果地址是 123 County Rd 53,我希望它返回 County Rd 53。如果街道地址是 State Rte 22,我希望它返回 State Rte 22。

下面是示例数据(输入)和所需的输出(输出)

d <- tibble(input=c(
  '505 BLACKBERRY',
  '135 BEARDSLEY ST',
  '15 HUNT CLUB DR',
  '1223 STATE ROUTE 103',
  '455 STATE RTE 43',
  '206 COUNTY RD 4710',
  '17 E 250TH ST',
  '158 BALLINGER AVE SE',
  '150 BALLINGER AVE S',
  '18 BALLINGER AVE T' ,
  '1272 ORANGE SUN TRL',
  '291 S MORELAND BLVD',
  '615 RUSSET WOOD LN',
  '1165 MORROCCO CT',
  '1321 S PKWY DR',
  '250 COUNTY RD 25A S',
  '22 SANSTONE RIDGE WAY',
  '55070 MENDOZA TRL',
  '1609 HUNTSMERE AVE DOWN',
  '243 MISTY WOODS CV S',
  '2292 BAYBERRY CMNS',
  '16 KILDEER CRK',
  '40 BEDFORD XING',
  '4 LEXINGTON SQ',
  '113 SPARROWS CRST',
  '1082 MATHOM LNDG',
  '1050 WILLOW RIDGE LOOP',
  '660 REDTOP LOOP',
  '8 MOUNT ROYAL LOOP',
  '805 SIERRA OVAL',
  '3012 NANTUCKET ROW',
  '6 WOODROW AVE',
  '943 DARROW PARK DR',
  '743 BELVEDERE TER',
  '189 WINCHESTER RD',
  '19 WHITE OAK TRCE',
  '890 BLACKJACK RD EXT',
  '767 N EXCALIBUR DR',
  '109 VININGS FOREST LN SE',
  '508 E 141ST ST',
  '85 ROSE LN ST SW'),
  output= c(
  'BLACKBERRY',
  'BEARDSLEY',
  'HUNT CLUB',
  'STATE ROUTE 103',
  'STATE RTE 43',
  'COUNTY RD 4710',
  'E 250TH',
  'BALLINGER',
  'BALLINGER',
  'BALLINGER' ,
  'ORANGE SUN',
  'S MORELAND',
  'RUSSET WOOD',
  'MORROCCO',
  'S PKWY',
  'COUNTY',
  'SANSTONE RIDGE WAY',
  'MENDOZA',
  'HUNTSMERE',
  'MISTY WOODS',
  'BAYBERRY',
  'KILDEER',
  'BEDFORD',
  'LEXINGTON',
  'SPARROWS',
  'MATHOM',
  'WILLOW RIDGE',
  'REDTOP',
  'MOUNT ROYAL',
  'SIERRA',
  'NANTUCKET',
  'WOODROW',
  'DARROW PARK',
  'BELVEDERE',
  'WINCHESTER',
  'WHITE OAK',
  'BLACKJACK',
  'N EXCALIBUR',
  'VININGS FOREST',
  'E 141ST',
  'ROSE LN'))

这是我尝试过的:

我愚蠢地使用了 Regex 101,并没有意识到它与 R 不兼容。在我意识到这一点之前,我基本上在这里工作。 https://regex101.com/r/UdK6pB/1

然后我尝试以多种方式让它在 R 中工作,包括这段可爱的代码:

d$input <- str_extract(d$input, "(?:(?<=[0-9]{1,5}\\b)).*\\b[[:digit:]]+$|(^[0-9]{1,5}\\b)(.*)(?:(?=\\bAVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE$))")

我也尝试过类似的事情

d$output <- str_remove(d$input,"^[[:digit:]]+\\b")
d$output <- str_remove(d$output, "\\b['AVE'|'ST'|'DR'|'RD'|'LN'|'TRL'|'BLVD'|'CT'|'PKWY'|'JCT'|'SQ'|'HWY'|'WAY'|'CV'|'CMNS'|'CRK'|'XING'|'CRST'|'LNDG'|'LOOP'|'OVAL'|'ROW'|'TER'|'TRCE'|'RTE']$")

然后这个

d$output <- sub("^[[:digit:]]+[[:space:]]", '', d$input, perl = TRUE)
d$output <- sub("[[:space:]]+[AVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE]$", '', d$output, perl=TRUE)

我束手无策,希望有人愿意并且能够帮助我。谢谢。

r regex stringr
1个回答
0
投票

这不是一个简洁、优雅的解决方案,但仍然是一种解决方案(

85 ROSE LN ST SW
有点棘手)。
只是基本的
regex
你。
看看吧。

后缀模式助手:

library(tidyverse)

# Suffix pattern
suffix <- c(
   "AVE", "BLVD", "CMNS", "CRK", "CRST", 
    "CT",   "CV",   "DR",  "LN", "LNDG", 
  "LOOP", "OVAL",   "RD", "ROW",   "SQ", 
    "ST",  "TER", "TRCE", "TRL", "XING")

suffix <- paste0("\\b", suffix, "\\b")
suffix <- str_flatten(suffix, "|")

代码:

new_d <- d %>% 
  rowid_to_column("id") %>% 
  mutate(
    .by = id, 
    # number = str_extract(input, "^\\d+"), # if you want it
    my_output = str_remove(input, "^\\d+\\s+")) %>% 
  
  separate_rows(my_output, sep = "\\s") %>% 
  
  mutate(
    .by = id, 
    index = cumsum(if_else(
      str_detect(my_output, suffix) & !last(str_detect(my_output, "^\\d+$")),
      1, 0))) %>% 
  
  filter(.by = id, index == 0 | index < max(index)) %>% 

  summarise(
    .by = -c(my_output, index),
    my_output = str_flatten(my_output, " "))

输出:

> new_d %>% tudo()
# A tibble: 41 × 4
      id input                    output             my_output         
   <int> <chr>                    <chr>              <chr>             
 1     1 505 BLACKBERRY           BLACKBERRY         BLACKBERRY        
 2     2 135 BEARDSLEY ST         BEARDSLEY          BEARDSLEY         
 3     3 15 HUNT CLUB DR          HUNT CLUB          HUNT CLUB         
 4     4 1223 STATE ROUTE 103     STATE ROUTE 103    STATE ROUTE 103   
 5     5 455 STATE RTE 43         STATE RTE 43       STATE RTE 43      
 6     6 206 COUNTY RD 4710       COUNTY RD 4710     COUNTY RD 4710    
 7     7 17 E 250TH ST            E 250TH            E 250TH           
 8     8 158 BALLINGER AVE SE     BALLINGER          BALLINGER         
 9     9 150 BALLINGER AVE S      BALLINGER          BALLINGER         
10    10 18 BALLINGER AVE T       BALLINGER          BALLINGER         
11    11 1272 ORANGE SUN TRL      ORANGE SUN         ORANGE SUN        
12    12 291 S MORELAND BLVD      S MORELAND         S MORELAND        
13    13 615 RUSSET WOOD LN       RUSSET WOOD        RUSSET WOOD       
14    14 1165 MORROCCO CT         MORROCCO           MORROCCO          
15    15 1321 S PKWY DR           S PKWY             S PKWY            
16    16 250 COUNTY RD 25A S      COUNTY             COUNTY            
17    17 22 SANSTONE RIDGE WAY    SANSTONE RIDGE WAY SANSTONE RIDGE WAY
18    18 55070 MENDOZA TRL        MENDOZA            MENDOZA           
19    19 1609 HUNTSMERE AVE DOWN  HUNTSMERE          HUNTSMERE         
20    20 243 MISTY WOODS CV S     MISTY WOODS        MISTY WOODS       
21    21 2292 BAYBERRY CMNS       BAYBERRY           BAYBERRY          
22    22 16 KILDEER CRK           KILDEER            KILDEER           
23    23 40 BEDFORD XING          BEDFORD            BEDFORD           
24    24 4 LEXINGTON SQ           LEXINGTON          LEXINGTON         
25    25 113 SPARROWS CRST        SPARROWS           SPARROWS          
26    26 1082 MATHOM LNDG         MATHOM             MATHOM            
27    27 1050 WILLOW RIDGE LOOP   WILLOW RIDGE       WILLOW RIDGE      
28    28 660 REDTOP LOOP          REDTOP             REDTOP            
29    29 8 MOUNT ROYAL LOOP       MOUNT ROYAL        MOUNT ROYAL       
30    30 805 SIERRA OVAL          SIERRA             SIERRA            
31    31 3012 NANTUCKET ROW       NANTUCKET          NANTUCKET         
32    32 6 WOODROW AVE            WOODROW            WOODROW           
33    33 943 DARROW PARK DR       DARROW PARK        DARROW PARK       
34    34 743 BELVEDERE TER        BELVEDERE          BELVEDERE         
35    35 189 WINCHESTER RD        WINCHESTER         WINCHESTER        
36    36 19 WHITE OAK TRCE        WHITE OAK          WHITE OAK         
37    37 890 BLACKJACK RD EXT     BLACKJACK          BLACKJACK         
38    38 767 N EXCALIBUR DR       N EXCALIBUR        N EXCALIBUR       
39    39 109 VININGS FOREST LN SE VININGS FOREST     VININGS FOREST    
40    40 508 E 141ST ST           E 141ST            E 141ST           
41    41 85 ROSE LN ST SW         ROSE LN            ROSE LN 

就是这样。

创建于 2024-05-08,使用 reprex v2.1.0

© www.soinside.com 2019 - 2024. All rights reserved.