我有一个 df,其中有一列包含国家/地区名称,但其中一些写得不好。
一些错误是“Xile”,“Espanya”,“Mejiko”等..
我想知道是否有一个函数可以循环遍历列并纠正错误,而不是一一纠正。
有什么想法吗?
谢谢
我尝试过不同的库,但我不太理解(我正在学习)
正如 @MrFlick 所指出的,R 中没有开箱即用的解决方案,最后您应该单独检查每个值以确认其正确。
但是,编写自己的函数可以帮助您更正名称。
您可以在下面找到解决此问题的一种方法(假设应以西班牙语输入国家/地区名称)。它基于计算您的单词与(正确的)国家/地区名称列表之间的编辑距离。
# The rio package handles a variety of file formats and encodings. Here, it is useful to
# handle special characters in the Spanish language, i.e., the diaeresis, the tilde, and
# the acute accent.
devtools::install_github("leeper/rio")
# load Spanish country names
countries <- rio::import(
paste0("https://gist.githubusercontent.com/hanoii/5b2ce60a7e5baba857bed3ec45435987/",
"raw/b359ca0be01f93a079f8af485c94e7d31dd1094a/paises-espa%25C3%25B1ol.txt"),
sep = "\n", header = FALSE
)[, 1]
# Define a function to find 'closest' country names based on Levenshtein distance.
# The distance represents the number of operations (e.g. deletion), changing one
# character at a time, needed to transform one word into the other.
match_country <- function(x, countries) {
for (i in seq_len(length(x))) x[i] <- countries[which.min(adist(x[i], countries))]
return(x)
}
# test it
df <- data.frame(incorrect_names = c("Xile", "Espanya", "Mejiko", "Irxx"))
df$correct_names <- match_country(df$incorrect_names, countries)
df
# incorrect_names correct_names
# 1 Xile Chile
# 2 Espanya España
# 3 Mejiko México
# 4 Irxx Irán
# "Irán" and "Iraq" are equally distant but "Irán" comes first in the list!
请考虑这种方法的局限性,即由距离相等的单词引起的模糊匹配。