我有一个
Corpus
(tm
包),包含1.300个不同文本文档的集合[内容:文档:1.300]。
我现在的目标是搜索每个文档中特定单词列表的频率。例如。如果我的
wordlist contains the words "january, february, march,...."
。我想分析文档引用这些词的频率。
Example:
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.
结果应该是这样的:
Text 1: 2
Text 2: 1
Text 3: 0
我尝试使用以下代码:
library(quanteda)
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks)
dict1 <- dictionary(list(c("january", "february", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
tail(dict_dtm2)
此代码是在另一个聊天中提出的,但它不适用于我的,并且出现错误,表示它仅适用于文本或语料库元素。
如何使用 R 中
Corpus
包中现有的 tm
搜索我的单词列表?
要使您的 Quaanteda 代码正常工作,您首先必须转换您的 tm VCorpus 对象
x
+ 修复其他一些小问题:
dictionary()
需要一个命名列表library(tm)
library(quanteda)
## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
"I went on a holiday in march.",
"I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"
### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus" "character"
### continue with tokenization and stemmming
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks)
# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
dict_dtm2
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
#> features
#> docs months _unmatched
#> text1 2 10
#> text2 1 7
#> text3 0 6
创建于 2023-09-02,使用 reprex v2.0.2