我有一个数据库,其中有一些重复的条目,报告(不一致的)附加信息。我想删除这些信息,并为每个条目保留最简单的版本。
db <- data.frame(company=c("ENTRY_X","ENTRY_X COUNTY_1","COUNTY_2 ENTRY_X","ENTRY_Y"))
db_desiderata <- data.frame(company=c(rep("ENTRY_X",3),"ENTRY_Y"))
条目可能是冗长的字符串(有些带有空格)。一些例子是:"通用汽车公司 "和 "通用汽车公司"。"General Motors Company "和 "General Motors".我设法分离出所有需要用其子串替换的条目(在db$included中).我计划以递归方式运行它。
尝试的代码(所有的工作,我卡在如何继续)。
db$included <- lapply(db$company, function(x) c(grep(x,db$company,value=T)))
db$lenght <- lapply(db$included, function(x) length(unlist(x)))
db$included <- ifelse(db$lenght==1,NA,db$included)
如果数据严格符合这些模式,下面的操作应该是可行的。
我将使用Chuck P的数据的一个变体来说明如何工作,以及如果不遵循这些模式会出现的问题。
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General Auto",
"General Motors Company", "General Motors", "European General Motors Company",
"General", "Asia General Toys") )
companies <- Reduce( f = function(y,x) {if(grepl(pattern = y, x=x)) y else x},
x=db$company, accumulate = TRUE)
这就得到了
companies
[1] General Foods General Foods General Foods General Auto General Motors Company
[6] General Motors General Motors General General
我想经过你的评论,我对你的情况有了更多的了解,但我还是会对全自动的解决方案非常谨慎,一不小心或一个术语太过。一般 (双关语),你就完蛋了... ...
我把你早期的作品重新命名了一下。 想想你原来的 length
作为更多衡量潜力的标准。 我会用人的眼光去看潜力一栏,然后挑出要替换的地方。 我会用 stringr::str_replace_all
. 如果你使用我下面展示的命名向量,你应该能够用剪切和粘贴处理广泛的情况。 "^.*General Motors.*$"
意思是说,如果你在字符串中的任何地方找到它,不管是前面还是后面。 你可以反复工作,只是不断地添加到命名向量中,直到你把它清理干净。
library(dplyr)
library(stringr)
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General", "General Auto", "General Motors Company", "General Motors", "European General Motors Company"))
db$similar_company <- sapply(db$company, function(x) c(grep(x, db$company, value=T)), simplify = TRUE)
db$potential <- sapply(db$similar_company, function(x) length(unlist(x)), simplify = TRUE)
glimpse(db)
#> Rows: 8
#> Columns: 3
#> $ company <chr> "General Foods", "More General Foods", "General Foods…
#> $ similar_company <named list> [<"General Foods", "More General Foods", "Gene…
#> $ potential <int> 3, 1, 1, 8, 1, 2, 3, 1
db %>% arrange(desc(potential)) %>% select(-similar_company)
#> company potential
#> 1 General 8
#> 2 General Foods 3
#> 3 General Motors 3
#> 4 General Motors Company 2
#> 5 More General Foods 1
#> 6 General Foods Cereal Division 1
#> 7 General Auto 1
#> 8 European General Motors Company 1
db$newcompany <-
str_replace_all(db$company, c("^.*General Foods.*$" = "General Foods",
"^.*General Motors.*$" = "General Motors"),
)
db %>% select(company, newcompany)
#> company newcompany
#> 1 General Foods General Foods
#> 2 More General Foods General Foods
#> 3 General Foods Cereal Division General Foods
#> 4 General General
#> 5 General Auto General Auto
#> 6 General Motors Company General Motors
#> 7 General Motors General Motors
#> 8 European General Motors Company General Motors
创建于 2020-05-08 由 重读包 (v0.3.0)