去芜存菁

Question

在文本清理的过程中，是否可以检测并删除句子中这样的垃圾。

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")

目前我正在做这样的事情：

str_detect(x, pattern = 'Thisisaverylongexample'))

但是我越是查看我的数据框架，就发现有更多的句子有这样的垃圾，我如何使用类似regex的东西来检测和删除这样的垃圾？如何使用类似于regex这样的东西来检测和删除带有这样垃圾的行？

Answer 1

如果'垃圾'是可以通过它的异常长度来检测的，你可以定义一个相应的规则。例如，如果你想删除10个或更多字符的单词，这将提取它们。

library(stringr)
str_extract_all(x, "\\b\\w{10,}\\b")
[[1]]
[1] "Thisisaverylongexample" "removeitnow"           

[[2]]
[1] "thisisjustjunk"

这样就可以把它们去掉

trimws(gsub("\\b\\w{10,}\\b", "", x))
[1] "and I was to"         "but I do I remove it"

数据：

x <- c("Thisisaverylongexample and I was to removeitnow", "thisisjustjunk but I do I remove it")