计算R中多个单词的单词频率?

问题描述 投票:0回答:1

我正在尝试计算给定文本中多词的频率。例如,考虑以下文本:“环境研究环境研究环境研究研究科学能源,经济学,农业,生态学和生物学”。然后,我希望“环境研究”一词在文本中出现的次数。这是我尝试过的代码。

library(tm)
#Reading the data
text = readLines(file.choose())
text1 = Corpus(VectorSource(text))

#Cleaning the data
text1 = tm_map(text1, content_transformer(tolower))
text1 = tm_map(text1, removePunctuation)
text1 = tm_map(text1, removeNumbers)
text1 = tm_map(text1, stripWhitespace)
text1 = tm_map(text1, removeWords, stopwords("english"))

#Making a document matrix
dtm = TermDocumentMatrix(text1)
m11 = as.matrix(text1)
freq11 = sort(rowSums(m11), decreasing=TRUE)
d11 = data.frame(word=names(freq11), freq=freq11)
head(d11,9)

但是,此代码分别产生每个单词的频率。相反,我如何获得“环境研究”在文本中一起出现的次数?谢谢!

r tm word-frequency
1个回答
0
投票

如果已经有一个多词列表,并且想在文本中计算它们的频率,则可以使用str_extract_all

text <- "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"

library(stringr)
str_extract_all(text, "[Ee]nvironmental [Rr]esearch")
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research"

如果您想知道多字出现的频率,可以执行此操作:

length(unlist(str_extract_all(text, "[Ee]nvironmental [Rr]esearch")))
[1] 3

如果您有兴趣一次提取所有多字,可以这样进行:

首先定义一个包含所有多字的向量:

multiwords <- c("[Ee]nvironmental [Rr]esearch", "study science energy")

然后使用paste0将它们折叠成一个单独的可选模式字符串,并在该字符串上使用str_extract_all

str_extract_all(text, paste0(multiwords, collapse = "|"))
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research" "study science energy"

要获取多字的频率,可以使用table

table(str_extract_all(text, paste0(multiwords, collapse = "|")))

Environmental Research   study science energy 
                     3                      1
© www.soinside.com 2019 - 2024. All rights reserved.