更改国家名称

Question

我有一个 df，其中有一列包含国家/地区名称，但其中一些写得不好。

一些错误是“Xile”，“Espanya”，“Mejiko”等..

我想知道是否有一个函数可以循环遍历列并纠正错误，而不是一一纠正。

有什么想法吗？

谢谢

我尝试过不同的库，但我不太理解（我正在学习）

Answer 1

正如 @MrFlick 所指出的，R 中没有开箱即用的解决方案，最后您应该单独检查每个值以确认其正确。

但是，编写自己的函数可以帮助您更正名称。

您可以在下面找到解决此问题的一种方法（假设应以西班牙语输入国家/地区名称）。它基于计算您的单词与（正确的）国家/地区名称列表之间的编辑距离。

# The rio package handles a variety of file formats and encodings. Here, it is useful to 
# handle special characters in the Spanish language, i.e., the diaeresis, the tilde, and 
# the acute accent.
devtools::install_github("leeper/rio")

# load Spanish country names
countries <- rio::import(
  paste0("https://gist.githubusercontent.com/hanoii/5b2ce60a7e5baba857bed3ec45435987/",
         "raw/b359ca0be01f93a079f8af485c94e7d31dd1094a/paises-espa%25C3%25B1ol.txt"),
  sep = "\n", header = FALSE
)[, 1]

# Define a function to find 'closest' country names based on Levenshtein distance.
# The distance represents the number of operations (e.g. deletion), changing one 
# character at a time, needed to transform one word into the other.
match_country <- function(x, countries) {
  for (i in seq_len(length(x))) x[i] <- countries[which.min(adist(x[i], countries))]
  return(x)
}

# test it
df <- data.frame(incorrect_names = c("Xile", "Espanya", "Mejiko", "Irxx"))

df$correct_names <- match_country(df$incorrect_names, countries)

df
#   incorrect_names correct_names
# 1            Xile         Chile
# 2         Espanya        España
# 3          Mejiko        México
# 4            Irxx          Irán

# "Irán" and "Iraq" are equally distant but "Irán" comes first in the list!

请考虑这种方法的局限性，即由距离相等的单词引起的模糊匹配。

更改国家名称

问题描述投票：0回答：1

1个回答

最新问题

更改国家名称

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1