如何在R中使用removeWords解决“ gsub错误”

Question

我有一个包含推文的数据框。我正在努力删除停用词，因此，我使用了：

stopWords <- stopwords("en")
tweets_sample$text<-removeWords(tweets_sample$text,stopWords)

无论如何，我获得了

Error in gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  : 
input string 1 is invalid UTF-8

导致这种错误的原因是什么？

Answer 1

看起来像是编码问题。尝试使用Encoding(tweets_sample$text) <- "UTF-8"，然后尝试removeWords功能。

Answer 2

看起来您的第一个字符串中包含无效的UTF-8。您可以使用iconv替换文本中所有不可转换的字节：

text <- "your text"
Encoding(te\xE7xt) <- "UTF-8"
iconv(text, "UTF-8", "UTF-8",sub='')

“文本”