由于刮刮文本的明显编码问题,模式匹配失败

问题描述 投票:2回答:1

对Google的摘要编辑(如果可以的话):Grepl和模式匹配在看似完全相同的字符串上失败。怀疑的问题是对抓取的文本进行编码不规范。真正的问题是未出现在“ nchar”中的空间中看不见的,看不见的额外something。解决方案是在尝试模式匹配之前使用gsub和regex删除所有空格。解决方案由smingerson找到。

原始问题:我想对我使用rvest抓取的一系列在线讲道进行主题建模。

我正在使用模式匹配,尤其是grepl进行清理和组织。

问题是grepl无法匹配看似相同的字符串。抓取的文本是“未知”和“ UTF-8”编码的混合。诸如“ Encoding”,“ enc2native”,“ enc2utf8”,“ iconv”之类的函数似乎无济于事,调整grepl参数(如Perl = TRUE或useBytes = TRUE)似乎也无济于事。 (并不是我完全理解所有这些功能。)

似乎有几篇文章:(1)Troubles with encoding, pattern matching and noisy texts in R(2)https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055(3)R on Windows: character encoding hell和其他。

关于#1,我使用的是英语而不是瑞典语,所以我看不到更改语言环境会有所帮助。我也无法理解应归给Wiktor的代码中的哪一部分可以解决原始海报提供的答案中的问题。

关于#2,您将在下面看到,我尝试使用Encoding()进行更改,但没有成功。

我加入#3的例子是,许多文章讨论外语,而我仍使用英语。他们还讨论了Windows 10和RStudio中编码的难度(如果相关)。

这是我尝试可复制的代码。不幸的是,该错误似乎来自我的原始文件,并且无法通过复制和粘贴以下内容来重现。修改#1下charToRaw的不同结果证明了这一点。根据评论,我在GitHub上添加了一个文件,其中包含在会话中加载时的错误。另外,我还要添加库调用,并删除“ scrapedtitle”中心的一些空白,因为stackoverflow格式否则会在“ author”变量的中间引入换行符。在编辑#2的末尾,我还尝试创建一种使用rawToChar复制和粘贴有问题的编码的方法,但不能强制使用“原始”。在Edit#3中,我讨论了RStudio的Encoding选项,并描述了我使用不同的Encoding设置保存了不同的抓取部分,但是不幸的是,我没有跟踪我使用的部分。我希望这些信息本来可以恢复和可逆,但事实并非如此。

#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)

#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))

#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

#attempted grepl: 
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5

#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard

typed == scrapedvector[5] # FALSE unexpectedly

grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly

#Checking encoding
Encoding(scrapedvector) #[1] "unknown" "unknown" "unknown" "unknown" "UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"

#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown" "unknown" "unknown" "unknown" "UTF-8" # No change

编辑#1:

# Adding charToRaw information: 
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5]) 
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.

# Results from pasting the vector back into R from this stackoverflow post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.

# Posting this because it is mentioned in other posts
Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

编辑#2

Github上提供了该文件的示例:https://github.com/baprisbrey/stackoverflow/releases/tag/vA0该文件为scrapedTalk2.rds。

这是我将此文件加载到RStudio会话中时看到的:

scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8.  It should be 8 and 73

scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above 

Encoding(scrapedTalk)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [12] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [23] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [34] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [45] "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "unknown" "unknown"
 [56] "UTF-8"   "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [67] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [78] "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [89] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"  
[100] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown"
[111] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"  
[122] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown"
[133] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"

scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.

charToRaw(scrapedTalk[73]) # for reference
 [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73

# Can I create the troubled encoding by pasting the charToRaw result above?
# Note:  There may be an unintentional newline "/n" character introduced in there due to the length of the string and the StackOverflow formatting.  It should be removed.
troubleString <-  "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
                   strsplit(. ,split=" ") %>%  # so far so good
                   unlist %>%                  # no troubles
                   as.raw %>%                  # NA's and 0's introduced
                   rawToChar                   # failure!

编辑#3因为问题似乎是编码,所以我将讨论RStudio编码选项。在“ RStudio文件” >>“使用编码保存”下,带有选项的以下菜单:enter image description here

有多个编码选项。我不知道所有这些之间有什么区别。第一个问题是,为什么Encoding()不能显示所有这些选项?当然,“未知”存储桶涵盖了其中的大多数。其次,由于“编码”的困难,我选择了“编码”选项,很可能是使用其他“编码”选项之一保存了一些报废材料。但是,我不记得我尝试过使用哪部分刮板材料。我认识到这给问题带来了歧义。我想知道为什么我无法恢复正确的编码,而转换为另一种编码,但是主要是为什么我不能使grepl起作用。

r encoding pattern-matching grepl
1个回答
1
投票

值中存在某种不合作的空间。经过进一步检查后,尽管其中一个在打印时并不明显,但看起来其中一个具有额外的空间。下面的第一位显示了如何用单个空格替换多个空格。第二个显示了进行比较时如何删除所有空格字符。

解决方案1

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
# Replace multiple spaces with a single space.
condensedAuthor <- gsub("\\s+", " ", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("\\s+", " ", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk)
scrapedTalk[indices]
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

解决方案2

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
condensedAuthor <- gsub("[[:space:]]", "", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("[[:space:]]", "", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk) # Returns 8 and 73 as `TRUE
scrapedTalk[indices] # Get the corresponding values from the original vector.
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

编辑:我用正则表达式表示形式替换了\\s+,最终用“ s”代替了“”。我已更新为使用“”。

© www.soinside.com 2019 - 2024. All rights reserved.