删除R中的停用词

Question

我有一个具有以下结构的数据框：

Note.Reco Review Review.clean.lower
10 Good Products  good products
9 Nice film      nice film
....         ....

第一栏是影片的排名，第二栏是顾客的评价，第三栏是小写字母的评价。

我现在尝试用这个删除停用词：

Data_clean$Raison.Reco.clean1 <- Corpus(VectorSource(Data_clean$Review.clean.lower))
Data_clean$Review.clean.lower1 <- tm_map(Data_clean$Review.clean.lower1, removeWords, stopwords("english"))

但是 R studio 崩溃了

你能帮我解决这个问题吗？

谢谢你

编辑：

#clean up
# remove grammar/punctuation
Data_clean$Review.clean.lower <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))

Data_corpus <- Corpus(VectorSource(Data_clean$Review.clean.lower))

Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("french"))

train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

所以当我运行最后 2 条指令时出现错误。

Answer 1

尝试以下操作。您可以对语料库进行清理，而不是直接进行列。

Data_corpus <-
  Corpus(VectorSource(Data_clean$Review.clean.lower))

  Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("english"))

编辑：正如您所提到的，您希望能够在删除停用词后访问输出，请尝试以下而不是上面的：

library(tm)

stopWords <- stopwords("en")

Data_clean$Review.clean.lower<- as.character(Data_clean$Review.clean.lower)
 '%nin%' <- Negate('%in%')
 Data_clean$Review.clean.lower1<-lapply(Data_clean$Review.clean.lower, function(x) {
  chk <- unlist(strsplit(x," "))
  p <- chk[chk %nin% stopWords]
  paste(p,collapse = " ")
})

上述代码的示例输出：

>  print(Data_clean)
>       note Note.Reco.Review Review.clean.lower Review.clean.lower1
>     1   10    Good Products      good products       good products
>     2    9        Nice film     is a nice film           nice film

另请检查以下内容： R 使用 %in% 从字符向量中删除停用词

删除R中的停用词

问题描述投票：0回答：1

1个回答

最新问题

删除R中的停用词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1