我正在尝试从街道地址中提取街道名称。我只想提取字符串开头的街道/门牌号,并提取最后一个街道后缀(RD、ST、DR、HWY 等)之后的所有内容。
两个注意事项:
如果有两个街道后缀,我想保留第一个,并删除第二个(例如,South Pkwy Dr 变为 South Pkwy,而不仅仅是 South)。
如果街道地址以数字结尾,我想保留后缀。例如,如果地址是 123 County Rd 53,我希望它返回 County Rd 53。如果街道地址是 State Rte 22,我希望它返回 State Rte 22。
下面是示例数据(输入)和所需的输出(输出)
d <- tibble(input=c(
'505 BLACKBERRY',
'135 BEARDSLEY ST',
'15 HUNT CLUB DR',
'1223 STATE ROUTE 103',
'455 STATE RTE 43',
'206 COUNTY RD 4710',
'17 E 250TH ST',
'158 BALLINGER AVE SE',
'150 BALLINGER AVE S',
'18 BALLINGER AVE T' ,
'1272 ORANGE SUN TRL',
'291 S MORELAND BLVD',
'615 RUSSET WOOD LN',
'1165 MORROCCO CT',
'1321 S PKWY DR',
'250 COUNTY RD 25A S',
'22 SANSTONE RIDGE WAY',
'55070 MENDOZA TRL',
'1609 HUNTSMERE AVE DOWN',
'243 MISTY WOODS CV S',
'2292 BAYBERRY CMNS',
'16 KILDEER CRK',
'40 BEDFORD XING',
'4 LEXINGTON SQ',
'113 SPARROWS CRST',
'1082 MATHOM LNDG',
'1050 WILLOW RIDGE LOOP',
'660 REDTOP LOOP',
'8 MOUNT ROYAL LOOP',
'805 SIERRA OVAL',
'3012 NANTUCKET ROW',
'6 WOODROW AVE',
'943 DARROW PARK DR',
'743 BELVEDERE TER',
'189 WINCHESTER RD',
'19 WHITE OAK TRCE',
'890 BLACKJACK RD EXT',
'767 N EXCALIBUR DR',
'109 VININGS FOREST LN SE',
'508 E 141ST ST',
'85 ROSE LN ST SW'),
output= c(
'BLACKBERRY',
'BEARDSLEY',
'HUNT CLUB',
'STATE ROUTE 103',
'STATE RTE 43',
'COUNTY RD 4710',
'E 250TH',
'BALLINGER',
'BALLINGER',
'BALLINGER' ,
'ORANGE SUN',
'S MORELAND',
'RUSSET WOOD',
'MORROCCO',
'S PKWY',
'COUNTY',
'SANSTONE RIDGE WAY',
'MENDOZA',
'HUNTSMERE',
'MISTY WOODS',
'BAYBERRY',
'KILDEER',
'BEDFORD',
'LEXINGTON',
'SPARROWS',
'MATHOM',
'WILLOW RIDGE',
'REDTOP',
'MOUNT ROYAL',
'SIERRA',
'NANTUCKET',
'WOODROW',
'DARROW PARK',
'BELVEDERE',
'WINCHESTER',
'WHITE OAK',
'BLACKJACK',
'N EXCALIBUR',
'VININGS FOREST',
'E 141ST',
'ROSE LN'))
这是我尝试过的:
我愚蠢地使用了 Regex 101,并没有意识到它与 R 不兼容。在我意识到这一点之前,我基本上在这里工作。 https://regex101.com/r/UdK6pB/1
然后我尝试以多种方式让它在 R 中工作,包括这段可爱的代码:
d$input <- str_extract(d$input, "(?:(?<=[0-9]{1,5}\\b)).*\\b[[:digit:]]+$|(^[0-9]{1,5}\\b)(.*)(?:(?=\\bAVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE$))")
我也尝试过类似的事情
d$output <- str_remove(d$input,"^[[:digit:]]+\\b")
d$output <- str_remove(d$output, "\\b['AVE'|'ST'|'DR'|'RD'|'LN'|'TRL'|'BLVD'|'CT'|'PKWY'|'JCT'|'SQ'|'HWY'|'WAY'|'CV'|'CMNS'|'CRK'|'XING'|'CRST'|'LNDG'|'LOOP'|'OVAL'|'ROW'|'TER'|'TRCE'|'RTE']$")
然后这个
d$output <- sub("^[[:digit:]]+[[:space:]]", '', d$input, perl = TRUE)
d$output <- sub("[[:space:]]+[AVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE]$", '', d$output, perl=TRUE)
我束手无策,希望有人愿意并且能够帮助我。谢谢。
这不是一个简洁、优雅的解决方案,但仍然是一种解决方案(
85 ROSE LN ST SW
有点棘手)。regex
你。后缀模式助手:
library(tidyverse)
# Suffix pattern
suffix <- c(
"AVE", "BLVD", "CMNS", "CRK", "CRST",
"CT", "CV", "DR", "LN", "LNDG",
"LOOP", "OVAL", "RD", "ROW", "SQ",
"ST", "TER", "TRCE", "TRL", "XING")
suffix <- paste0("\\b", suffix, "\\b")
suffix <- str_flatten(suffix, "|")
代码:
new_d <- d %>%
rowid_to_column("id") %>%
mutate(
.by = id,
# number = str_extract(input, "^\\d+"), # if you want it
my_output = str_remove(input, "^\\d+\\s+")) %>%
separate_rows(my_output, sep = "\\s") %>%
mutate(
.by = id,
index = cumsum(if_else(
str_detect(my_output, suffix) & !last(str_detect(my_output, "^\\d+$")),
1, 0))) %>%
filter(.by = id, index == 0 | index < max(index)) %>%
summarise(
.by = -c(my_output, index),
my_output = str_flatten(my_output, " "))
输出:
> new_d %>% tudo()
# A tibble: 41 × 4
id input output my_output
<int> <chr> <chr> <chr>
1 1 505 BLACKBERRY BLACKBERRY BLACKBERRY
2 2 135 BEARDSLEY ST BEARDSLEY BEARDSLEY
3 3 15 HUNT CLUB DR HUNT CLUB HUNT CLUB
4 4 1223 STATE ROUTE 103 STATE ROUTE 103 STATE ROUTE 103
5 5 455 STATE RTE 43 STATE RTE 43 STATE RTE 43
6 6 206 COUNTY RD 4710 COUNTY RD 4710 COUNTY RD 4710
7 7 17 E 250TH ST E 250TH E 250TH
8 8 158 BALLINGER AVE SE BALLINGER BALLINGER
9 9 150 BALLINGER AVE S BALLINGER BALLINGER
10 10 18 BALLINGER AVE T BALLINGER BALLINGER
11 11 1272 ORANGE SUN TRL ORANGE SUN ORANGE SUN
12 12 291 S MORELAND BLVD S MORELAND S MORELAND
13 13 615 RUSSET WOOD LN RUSSET WOOD RUSSET WOOD
14 14 1165 MORROCCO CT MORROCCO MORROCCO
15 15 1321 S PKWY DR S PKWY S PKWY
16 16 250 COUNTY RD 25A S COUNTY COUNTY
17 17 22 SANSTONE RIDGE WAY SANSTONE RIDGE WAY SANSTONE RIDGE WAY
18 18 55070 MENDOZA TRL MENDOZA MENDOZA
19 19 1609 HUNTSMERE AVE DOWN HUNTSMERE HUNTSMERE
20 20 243 MISTY WOODS CV S MISTY WOODS MISTY WOODS
21 21 2292 BAYBERRY CMNS BAYBERRY BAYBERRY
22 22 16 KILDEER CRK KILDEER KILDEER
23 23 40 BEDFORD XING BEDFORD BEDFORD
24 24 4 LEXINGTON SQ LEXINGTON LEXINGTON
25 25 113 SPARROWS CRST SPARROWS SPARROWS
26 26 1082 MATHOM LNDG MATHOM MATHOM
27 27 1050 WILLOW RIDGE LOOP WILLOW RIDGE WILLOW RIDGE
28 28 660 REDTOP LOOP REDTOP REDTOP
29 29 8 MOUNT ROYAL LOOP MOUNT ROYAL MOUNT ROYAL
30 30 805 SIERRA OVAL SIERRA SIERRA
31 31 3012 NANTUCKET ROW NANTUCKET NANTUCKET
32 32 6 WOODROW AVE WOODROW WOODROW
33 33 943 DARROW PARK DR DARROW PARK DARROW PARK
34 34 743 BELVEDERE TER BELVEDERE BELVEDERE
35 35 189 WINCHESTER RD WINCHESTER WINCHESTER
36 36 19 WHITE OAK TRCE WHITE OAK WHITE OAK
37 37 890 BLACKJACK RD EXT BLACKJACK BLACKJACK
38 38 767 N EXCALIBUR DR N EXCALIBUR N EXCALIBUR
39 39 109 VININGS FOREST LN SE VININGS FOREST VININGS FOREST
40 40 508 E 141ST ST E 141ST E 141ST
41 41 85 ROSE LN ST SW ROSE LN ROSE LN
就是这样。
创建于 2024-05-08,使用 reprex v2.1.0