由于刮刮文本的明显编码问题，模式匹配失败

Question

对Google的摘要编辑（如果可以的话）：Grepl和模式匹配在看似完全相同的字符串上失败。怀疑的问题是对抓取的文本进行编码不规范。真正的问题是未出现在“ nchar”中的空间中看不见的，看不见的额外something。解决方案是在尝试模式匹配之前使用gsub和regex删除所有空格。解决方案由smingerson找到。

原始问题：我想对我使用rvest抓取的一系列在线讲道进行主题建模。

我正在使用模式匹配，尤其是grepl进行清理和组织。

问题是grepl无法匹配看似相同的字符串。抓取的文本是“未知”和“ UTF-8”编码的混合。诸如“ Encoding”，“ enc2native”，“ enc2utf8”，“ iconv”之类的函数似乎无济于事，调整grepl参数（如Perl = TRUE或useBytes = TRUE）似乎也无济于事。（并不是我完全理解所有这些功能。）

似乎有几篇文章：（1）Troubles with encoding, pattern matching and noisy texts in R（2）https://community.rstudio.com/t/enconding-solution-for-linux-and-windows-10/2055（3）R on Windows: character encoding hell和其他。

关于＃1，我使用的是英语而不是瑞典语，所以我看不到更改语言环境会有所帮助。我也无法理解应归给Wiktor的代码中的哪一部分可以解决原始海报提供的答案中的问题。

关于＃2，您将在下面看到，我尝试使用Encoding（）进行更改，但没有成功。

我加入＃3的例子是，许多文章讨论外语，而我仍使用英语。他们还讨论了Windows 10和RStudio中编码的难度（如果相关）。

这是我尝试可复制的代码。不幸的是，该错误似乎来自我的原始文件，并且无法通过复制和粘贴以下内容来重现。修改＃1下charToRaw的不同结果证明了这一点。根据评论，我在GitHub上添加了一个文件，其中包含在会话中加载时的错误。另外，我还要添加库调用，并删除“ scrapedtitle”中心的一些空白，因为stackoverflow格式否则会在“ author”变量的中间引入换行符。在编辑＃2的末尾，我还尝试创建一种使用rawToChar复制和粘贴有问题的编码的方法，但不能强制使用“原始”。在Edit＃3中，我讨论了RStudio的Encoding选项，并描述了我使用不同的Encoding设置保存了不同的抓取部分，但是不幸的是，我没有跟踪我使用的部分。我希望这些信息本来可以恢复和可逆，但事实并非如此。

#Library calls
library(topicmodels)
library(LDAvis)
library(tm)
library(dplyr)
library(magrittr)
library(stringr)

#The scraped title of a sermon
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

#Extract the author from the title
author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))

#Elsewhere, identify the author from another scraped list of sermons and authors:
scrapedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

#attempted grepl: 
which(grepl(author, scrapedvector)) # only returns 2 when it should return 2 and 5

#Exploring:
typed <-"By Elder Brook P. Hales" #This is typed in from my keyboard

typed == scrapedvector[5] # FALSE unexpectedly

grepl(author, typed) #TRUE as you'd expect
grepl(author, scrapedvector[5]) # FALSE unexpectedly

#Checking encoding
Encoding(scrapedvector) #[1] "unknown" "unknown" "unknown" "unknown" "UTF-8"
Encoding(typed) #[1] "unknown"
Encoding(author) #[1] "unknown"

#Attempting to change the encoding:
Encoding(scrapedvector) <- "UTF-8"
Encoding(scrapedvector) # [1] "unknown" "unknown" "unknown" "unknown" "UTF-8" # No change

编辑＃1：

# Adding charToRaw information: 
charToRaw(typed)
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
charToRaw(scrapedvector[5]) 
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73
# There's an extra "c2 a0" in the scraped version at the 15th position.

# Results from pasting the vector back into R from this stackoverflow post:
repastedvector <- c("Answers to Prayer", "Brook P. Hales", "Church Auditing Department Report, 2018", "Russell M. Nelson", "By Elder Brook P. Hales")

charToRaw(repastedvector[5])
# [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b 20 50 2e 20 48 61 6c 65 73
# The repasted string is identical to what I typed, but not to what I saved after scraping.

# Posting this because it is mentioned in other posts
Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

编辑＃2

Github上提供了该文件的示例：https://github.com/baprisbrey/stackoverflow/releases/tag/vA0该文件为scrapedTalk2.rds。

这是我将此文件加载到RStudio会话中时看到的：

scrapedTalk <- readRDS("scrapedTalk2.rds")
grepl(author, scrapedTalk) %>% which() # Result is 8.  It should be 8 and 73

scrapedvector2 <- scrapedTalk[c(7,8,18,72,73)] # This is the same as the scrapedvector from above 

Encoding(scrapedTalk)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [12] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [23] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [34] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [45] "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "unknown" "unknown"
 [56] "UTF-8"   "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [67] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [78] "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
 [89] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"  
[100] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "unknown"
[111] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"  
[122] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown"
[133] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"

scrapedTalk[73] == "By Elder Brook P. Hales" # FALSE, which is unexpected.

charToRaw(scrapedTalk[73]) # for reference
 [1] 42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73

# Can I create the troubled encoding by pasting the charToRaw result above?
# Note:  There may be an unintentional newline "/n" character introduced in there due to the length of the string and the StackOverflow formatting.  It should be removed.
troubleString <-  "42 79 20 45 6c 64 65 72 20 42 72 6f 6f 6b c2 a0 50 2e 20 48 61 6c 65 73" %>%
                   strsplit(. ,split=" ") %>%  # so far so good
                   unlist %>%                  # no troubles
                   as.raw %>%                  # NA's and 0's introduced
                   rawToChar                   # failure!

编辑＃3因为问题似乎是编码，所以我将讨论RStudio编码选项。在“ RStudio文件” >>“使用编码保存”下，带有选项的以下菜单：

有多个编码选项。我不知道所有这些之间有什么区别。第一个问题是，为什么Encoding（）不能显示所有这些选项？当然，“未知”存储桶涵盖了其中的大多数。其次，由于“编码”的困难，我选择了“编码”选项，很可能是使用其他“编码”选项之一保存了一些报废材料。但是，我不记得我尝试过使用哪部分刮板材料。我认识到这给问题带来了歧义。我想知道为什么我无法恢复正确的编码，而转换为另一种编码，但是主要是为什么我不能使grepl起作用。

Answer 1

值中存在某种不合作的空间。经过进一步检查后，尽管其中一个在打印时并不明显，但看起来其中一个具有额外的空间。下面的第一位显示了如何用单个空格替换多个空格。第二个显示了进行比较时如何删除所有空格字符。

解决方案1

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
# Replace multiple spaces with a single space.
condensedAuthor <- gsub("\\s+", " ", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("\\s+", " ", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk)
scrapedTalk[indices]
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

解决方案2

library(tidyverse)
scrapedtitle <- "Answers to Prayer\n\t\t\t\t\t\n\t\t\t\t\t\tBrook P. Hales"

author <- scrapedtitle %>% substr(x=.,start=regexpr("\t[[:alpha:]]", .)+1, stop = nchar(.))
condensedAuthor <- gsub("[[:space:]]", "", author)

scrapedTalk <- readRDS("scrapedTalk2.rds")
condensedTalk <- gsub("[[:space:]]", "", scrapedTalk)
indices <- grepl(condensedAuthor, condensedTalk) # Returns 8 and 73 as `TRUE
scrapedTalk[indices] # Get the corresponding values from the original vector.
# [1] "Brook P. Hales"          "By Elder Brook P. Hales"

编辑：我用正则表达式表示形式替换了\\s+，最终用“ s”代替了“”。我已更新为使用“”。

由于刮刮文本的明显编码问题，模式匹配失败

问题描述投票：2回答：1

1个回答

解决方案1

解决方案2

最新问题

由于刮刮文本的明显编码问题，模式匹配失败

问题描述 投票：2回答：1

1个回答

解决方案1

解决方案2

最新问题

问题描述投票：2回答：1