如何识别字符串中提到的所有国家/地区名称并进行相应拆分？

Question

我有一个包含国家和其他地区名称的字符串。我只对国家/地区名称感兴趣，并且理想情况下希望添加几列，每一列都包含字符串中列出的国家/地区名称。以下是数据框设置方式的示例代码：

df <- data.frame(id = c(1,2,3),
                 country = c("Cote d'Ivoire Africa Developing Economies West Africa",
                              "South Africa United Kingdom Africa BRICS Countries",
                             "Myanmar Gambia Bangladesh Netherlands Africa Asia"))

如果我只按空格分割字符串，那些包含空格的国家/地区就会丢失（例如“英国”）。请看这里：

df2 <- separate(df, country, paste0("C",3:8), sep=" ")

因此，我尝试使用 world.cities 数据集查找国家/地区名称。但是，这似乎只会循环遍历字符串，直到出现非国家/地区名称。请看这里：

library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$c1 <- sapply(str_extract_all(df$country, all_countries), toString)

我想知道是否可以使用空格作为分隔符但定义例外（例如“英国”）。这显然可能需要一些手动工作，但对我来说似乎是最可行的解决方案。有谁知道如何定义此类例外？当然，我也愿意接受并感谢任何其他解决方案。

更新：

我使用国家代码包找到了另一个解决方案：

library(countrycode)
countries <- data.frame(countryname_dict)
countries$continent <- countrycode(sourcevar = countries[["country.name.en"]],
                                   origin = "country.name.en",
                                   destination = "continent")

africa <- countries[ which(countries$continent=='Africa'), ]

library(stringr)
pat <- paste0("\\b", paste(africa$country.name.en , collapse="\\b|\\b"), "\\b")
df$country_list <- str_extract_all(df$country, regex(pat, ignore_case = TRUE))

Answer 1

你可以这样做：

library(stringi)
vec <- stri_trans_general(countrycode::codelist$country.name.en, id = "Latin-ASCII")
stri_extract_all(df$country,regex = sprintf(r"(\b(%s)\b)",stri_c(vec,collapse = "|")))
[[1]]
[1] "Cote d'Ivoire"

[[2]]
[1] "South Africa"   "United Kingdom"

[[3]]
[1] "Gambia"      "Bangladesh"  "Netherlands"

Answer 2

有人知道有一本字典也包含人或语言作为模式吗？例如，对于阿富汗，应包括“阿富汗|阿富汗|普什图语”；对于，法国“法国|法国”等

如何识别字符串中提到的所有国家/地区名称并进行相应拆分？

问题描述投票：0回答：2

2个回答

最新问题

如何识别字符串中提到的所有国家/地区名称并进行相应拆分？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2