R中的文本挖掘:计算2-3个词组

问题描述 投票:0回答:1

我在Stackoverflow中找到了一个非常有用的代码-Finding 2 & 3 word Phrases Using R TM Package(信用@patrick perry)以显示语料库中2个和3个单词短语的出现频率:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

您如何确保“锡wood夫”或“锡wood夫”的频率计数中不包括短语“锡”?]

谢谢

我在Stackoverflow中找到了一段非常有用的代码-使用R TM包查找2到3个单词的短语(信用@patrick perry)以显示语料库中2到3个单词的短语的频率:...

r text-mining
1个回答
0
投票

[删除停用词可能会从数据中删除噪声,从而导致出现上述问题,]]

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>% arrange(desc(count)) %>% head(20)
© www.soinside.com 2019 - 2024. All rights reserved.