匹配R中字符串的提取国家/地区名称

问题描述 投票:1回答:3

我一直在搜索网站的评论数据,在这个过程中,我能够获得包含用户名,评论数量,评论日期和国家/地区信息的字符串向量。它们看起来大致如此

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
"James (10) - - MEXICO - NOV 22, 2017", 
"Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
"Alex (20000) - SOUTH KOREA- MAR 11, 2015")

到目前为止,我可以提取名称,审核数字和日期,因为它们处于定义的位置或具有一致的格式。问题是国家/地区名称格式不是始终如一,并且每个字符串中的各个数据点不一致用逗号或短划线分隔。只提取大写字符串就会遇到缺少国家或者名称分为两部分的问题。

地图包中包含国家/地区列表。有没有办法可以在str_extract_all中使用stringr在国家列表矢量中找到匹配并提取它?

r web-scraping dplyr stringr data-processing
3个回答
2
投票

您可以使用maps库执行此操作,如下所示:

library(maps)

## Loading country data from package maps
data(world.cities)

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
     "James (10) - - MEXICO - NOV 22, 2017", 
     "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
     "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

###Removing punctuation
raw <- gsub("[[:punct:]\n]","",raw)

# Split data at word boundaries
raw2 <- strsplit(raw, " ")

# Match on country in world.countries
CountryList_raw <- (lapply(raw2, function(x)x[which(toupper(x) %in% toupper(world.cities$country.etc))]))

do.call(rbind, lapply(CountryList_raw, as.data.frame))

#      X[[i]]
#1        USA
#2     MEXICO
#3    FINLAND

这有效。但是,您需要稍后修复其中包含多个单词的国家/地区的名称。例如,在这种情况下,韩国。这是因为strsplit正在分裂这些词,这就是它无法与韩国相提并论的原因。


1
投票

TL;DR

我使用了raw数据并将其转换为数据框。然后,逐列,我使用正则表达式和行迭代的组合提取所需的信息。

Import Necessary Packages and Raw Data

要学习本教程,您需要安装以下软件包:

  • BBmisc:来自B. Bischl和其他一些人的杂项帮助函数,主要用于包开发。
  • maps:绘制地理地图。
  • magrittr:使代码更具可读性的运算符集。
  • purrr:一个完整​​且一致的R函数编程工具包。

如果已经拥有所有这些功能,则无需使用install.packages()功能。

install.packages( pkgs = c(  "BBmisc", "maps", "magrittr", "purrr" ) )
library( BBmisc )
library( maps )
library( magrittr )
library( purrr )

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

Import Raw Data

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

Declare Four Columns

鉴于存储在raw中的数据,四列感觉适合创建:

  • user_name:用户名
  • user_review_number:与用户评论相关联的标识号
  • user_country:用户的国家/地区
  • user_review_date:日期 - 以月日,年份格式 - 用户的评论已创建 raw <- data.frame( user_name = raw , user_review_number = raw , user_country = raw , user_review_date = raw , stringsAsFactors = FALSE )

Regular Expressions

Regular Expressions允许使用特定语法进行复杂和灵活的搜索/替换。它们用于从raw数据集中提取相关数据。

Identify raw$user_name

此列包含括号前的用户名。

raw$user_name <- strsplit( x = raw$user_name
                           , split = "\\(|\\)"
                           , fixed = FALSE 
                           )
# keep only the first element from each list, then unlist to obtain a character vector
raw$user_name <- 
  purrr::map( .x = raw$user_name, .f = 1 ) %>%
  unlist()

# remove trailing whitespace
raw$user_name <- trimws( x = raw$user_name
                         , which = "right"
                         )

Identify raw$user_review_number

此列包含用户的评论编号,该编号是两个括号之间1-10位数的整数。

raw$user_review_number <- strsplit( x = raw$user_review_number
                                    , split = "\\(|\\)"
                                    , fixed = FALSE 
                                    )
# keep only the second element from each list, then unlist to obtain a character vector
# and cast as integer
raw$user_review_number <- 
  purrr::map( .x = raw$user_review_number, .f = 2 ) %>%
  unlist() %>%
  as.integer()

Identify raw$user_country

这个专栏有点小问题。一些国家用逗号分隔,其他国家包含两部分名称(即韩国),一些是缩写(即美国),一些包含州信息(即北卡罗来纳州,美国)。

有一百种方法可以做到这一点,但我使用的逻辑包含以下内容:

见下面的代码。

# first, split by the parentheses
raw$user_country <- strsplit( x = raw$user_country
                                    , split = "\\(|\\) "
                                    , fixed = FALSE 
)
# second, keep only the third elements from each list, then unlist to obtain character vector
raw$user_country <- 
  purrr::map( .x = raw$user_country, .f = 3 ) %>%
  unlist()
# third, split by the dash marks, either one or two
raw$user_country <- strsplit( raw$user_country
                          , split = "\\-|\\- \\-"
                          , fixed = FALSE
                          )
# fourth, keep only the second elements from each list, then unlist to obtain character vector
raw$user_country <-
  purrr::map( .x = raw$user_country, .f = 2 ) %>%
  unlist()
# fifth, clear leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# sixth, separate states marked by the apperance of a comma
raw$user_country <- strsplit( x = raw$user_country
                         , split = ","
                         , fixed = TRUE
                         ) 
# seventh, make two vectors: 
# one for the first element (which may or not be the state within a country)
maybe.country <- 
  purrr::map( .x = raw$user_country, .f = 1 ) %>%
  unlist()
# one for the second element (which will always be the country)
# note: need to convert NULL elements into NA
definitely.country <-
  purrr::map( .x = raw$user_country, .f = 2, .null = NA ) %>%
  unlist()

# eighth, replace the indices within maybe.country 
#         whose indices in definitely.country are non-NA values
#         with those non-NA values from definitely.country.
# note: this is possible due to the indices within both 
#       maybe.country and definitely.country to be exact equivalents. 
#       (i.e. the 8th element in maybe.country will always align
#        with the 8th element in definitely.country )
maybe.country[
  which( !is.na( definitely.country ) )
  ] <- definitely.country[
    which( !is.na( definitely.country )  )
  ]

# ninth, assign the character vector maybe.country to raw$user_country
raw$user_country <- maybe.country

# tenth, remove all leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# eleventh, if the number of letters (length) of any element is more than 3, 
# change the spelling to Capital Case. 
# note: This logic comes from the maps::iso3166 data frame, which contains
#       3,166 country codes from the International Standards Organizations (ISO).
raw$user_country <- ifelse( test = nchar( raw$user_country ) == 2 |
                          nchar( raw$user_country ) == 3
                        , yes = raw$user_country
                        , no = stringr::str_to_title( string = raw$user_country ) 
                        )
# twelfth, check to make sure that all characters are either
# 2 character, 3 character, ISO country codes/names,
# shorter name used in the `maps` package, or the sovereign country
# by ensuring the length of the elements who meet this criteria
# is equal to the length of raw$user_country
length(
  which( raw$user_country %in%  maps::iso3166$a2 |
         raw$user_country %in% maps::iso3166$a3 |
         raw$user_country %in% maps::iso3166$ISOname |
         raw$user_country %in% maps::iso3166$mapname |
         raw$user_country %in% maps::iso3166$sovereignty
       )
) == length( raw$user_country ) # [1] TRUE

Identify raw$user_review_date

假设用户的评论数据始终是要存储在每个字符串中的最后一段文本,以下是如何删除此特定列的数据。

raw$user_review_date <- strsplit( x = raw$user_review_date
                                  , split = "\\-\\s"
                                  , fixed = FALSE
                                  )

# keep only the last element from each list, 
# unlist to obtain a character vector,
# standardize the dates 
# note: assumes no NAs will appear for date
raw$user_review_date <- 
  purrr::map( .x = raw$user_review_date, .f = BBmisc::getLast ) %>%
  unlist() %>%
  as.Date( format = "%b %d, %Y" )

0
投票

如果

  • 国家/地区名称始终以大写字母书写
  • 是所有大写字母出现的第一个单词,即名称从不以全部大写字母书写,月份字段位于国家字段之后

然后我们可以使用以下正则表达式来提取国家名称:

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

这也适用于多个部分的国家/地区名称或使用点来表示缩写:

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015", 
         "Peter (4711) - KINGDOM OF SOUTH NEVERLAND - DEC 24, 2016", 
         "Paul (0815) - REP. OF NORTH NEVERLAND - DEC 31, 2016")
stringr::str_extract(raw, "[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*")
[1] "USA"                        "MEXICO"                     "FINLAND"                   
[4] "SOUTH KOREA"                "KINGDOM OF SOUTH NEVERLAND" "REP. OF NORTH NEVERLAND"

Explanation

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

正在寻找一个由2个或更多大写字母组成的序列,可选地后跟一个点。这将捕获仅包含一个单词的国家/地区名称。

为了捕获由多个单词组成的国家anmes,parantheses中的表达式正在寻找由空白空间和另一个带有可选点的大写单词组成的任意数量的子序列。

请注意,stringr::str_extract()仅用于提取第一个匹配项,以避免捕获月份的名称。

© www.soinside.com 2019 - 2024. All rights reserved.