使用r从(地址)字符串中提取门牌号

问题描述 投票:3回答:4

我想将地址解析(提取)成房屋编号和街道名称。以后我应该可以将提取的“值”写入新列(shops $ HouseNumber和shops $ Streetname)。

所以可以说我有一个名为“商店”的数据框:

> shops
      Name                 city        street
 1    Something            Fakecity    New Street 3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet 1-3
 3    SomethingDifferent   Fakecity    Fake Street 14a

所以有一种方法可以将街道列分为两个列表,一个带有街道名称,另一个用于房屋编号,包括“ 1-3”,“ 14a”之类的情况,因此最终可以将结果分配给数据框,看起来像。

 > shops
      Name                 city        Streetname                    HouseNumber
 1    Something            Fakecity    New Street                    3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet   1-3
 3    SomethingDifferent   Fakecity    Fake Street                   14a 

示例:Easyfakestreet 5-> Easyfakestreet,5

由于我的一些街道字符串将带有连字符的街道地址并且具有非数字组成部分的事实,这有点复杂。

示例:新街3-> ['新街','3']Some-complicated-Casestreet 1-3-> ['Some-complicated-Casestreet','1-3']假街14a-> ['假街','14a']

我将不胜感激!

r split street-address
4个回答
8
投票

[可能的tidyr解决方案

library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
#                 Name     city                   Streetname HouseNumber
# 1          Something Fakecity                  New Street            3
# 2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
# 3 SomethingDifferent Fakecity                 Fake Street          14a

5
投票

您可以尝试:

shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)

数据

shops$street
#[1] "New Street 3"                    "Some-Complicated-Casestreet 1-3" "Fake Street 14a" 

结果

shops$Streetname
#[1] "New Street"                  "Some-Complicated-Casestreet" "Fake` Street" 

shops$HousNumber
#[1] "3"   "1-3" "14a"

2
投票

创建具有与街道和数字均匹配的后向引用的模式,然后使用sub将其依次替换为每个后向引用。不需要软件包:

pat <- "(.*) (\\d.*)"
transform(shops,
   street = sub(pat, "\\1", street), 
   HouseNumber = sub(pat, "\\2", street)
)

给予:

                Name     city                      street  HouseNumber
1          Something Fakecity                  New Street            3
2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
3 SomethingDifferent Fakecity                 Fake Street          14a

这里是pat的可视化:

(.*) (\d.*)

“正则表达式可视化”

Debuggex Demo

注意:

1)我们将其用于shops

shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3", 
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name", 
"city", "street"), class = "data.frame", row.names = c(NA, -3L))

2)可以在这里交替使用David Arenburg的模式。只需将其设置为pat即可。上面的模式的优势在于,它允许在其中嵌入数字的街道名称,而David's的优势在于,街道编号之前可能会缺少空格。


0
投票

您可以使用软件包unglue

library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#>                 Name     city                      street value
#> 1          Something Fakecity                  New Street     3
#> 2     SomethingOther Fakecity Some-Complicated-Casestreet   1-3
#> 3 SomethingDifferent Fakecity                 Fake Street   14a

reprex package(v0.3.0)在2019-10-08创建

© www.soinside.com 2019 - 2024. All rights reserved.