我有一个包含公司名称和地址信息的数据表。我想从公司名称中删除法人实体和最常见的单词。 因此我编写了一个函数并将其应用到我的 data.table 中。
search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals",
"services", "international", "consulting", "the", "for")
clean_strings <- function(string, search_for=search_for_default){
clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")] # clean up geographical names
res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" ")) #remove legal entities and common words
res <- paste(unique(unlist(str_split(res, " "))), collapse=" ") # paste string together
return(res) }
datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]
剧本效果很好。但是当我有一个大数据集(>3b 行)时,需要很长时间。 有更有效的方法吗?
示例:
输入:
Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")
预期:
Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")
我会这样做:
library(stringr)
words <- c("corp", "co", "llc", "se", "&", "holding", "professionals", "services",
"international", "consulting", "the", "for", "american", "canadian") |>
paste0("\\b", ...= _, "\\b", collapse = "|")
others <- c("inc\\.", ",") |> paste(... = _, collapse = "|")
Company_Name |>
tolower() |>
str_remove_all(pattern = paste0(words, "|", others)) |>
str_trim()
一些注意事项:
amazon.com
中的点去掉。只需匹配您需要的东西即可。