我想将地址解析(提取)成房屋编号和街道名称。以后我应该可以将提取的“值”写入新列(shops $ HouseNumber和shops $ Streetname)。
所以可以说我有一个名为“商店”的数据框:
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
所以有一种方法可以将街道列分为两个列表,一个带有街道名称,另一个用于房屋编号,包括“ 1-3”,“ 14a”之类的情况,因此最终可以将结果分配给数据框,看起来像。
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
示例:Easyfakestreet 5-> Easyfakestreet,5
由于我的一些街道字符串将带有连字符的街道地址并且具有非数字组成部分的事实,这有点复杂。
示例:新街3-> ['新街','3']Some-complicated-Casestreet 1-3-> ['Some-complicated-Casestreet','1-3']假街14a-> ['假街','14a']
我将不胜感激!
[可能的tidyr
解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
您可以尝试:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
数据
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
结果
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
创建具有与街道和数字均匹配的后向引用的模式,然后使用sub
将其依次替换为每个后向引用。不需要软件包:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
给予:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
这里是pat
的可视化:
(.*) (\d.*)
注意:
1)我们将其用于shops
:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2)可以在这里交替使用David Arenburg的模式。只需将其设置为pat
即可。上面的模式的优势在于,它允许在其中嵌入数字的街道名称,而David's的优势在于,街道编号之前可能会缺少空格。
您可以使用软件包unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
由reprex package(v0.3.0)在2019-10-08创建