使用R（tm包）在语料库中搜索特定单词

Question

我有一个

Corpus

（

tm

包），包含1.300个不同文本文档的集合[内容：文档：1.300]。

我现在的目标是搜索每个文档中特定单词列表的频率。例如。如果我的

wordlist contains the words "january, february, march,...."

。我想分析文档引用这些词的频率。

Example: 
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.

结果应该是这样的：

Text 1: 2 
Text 2: 1
Text 3: 0

我尝试使用以下代码：

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks)

dict1 <- dictionary(list(c("january", "february", "march")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)

此代码是在另一个聊天中提出的，但它不适用于我的，并且出现错误，表示它仅适用于文本或语料库元素。

如何使用 R 中

Corpus

包中现有的

tm

搜索我的单词列表？

Answer 1

要使您的 Quaanteda 代码正常工作，您首先必须转换您的 tm VCorpus 对象

+ 修复其他一些小问题：

```
dictionary()
```
需要一个命名列表
英语词干分析器将返回 "januari", "februari" 而不是 "january", "february".

library(tm)
library(quanteda)

## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
          "I went on a holiday in march.",
          "I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"

### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus"    "character"

### continue with tokenization and stemmming
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 
dtm <- dfm(toks)

# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
dict_dtm2
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
#>        features
#> docs    months _unmatched
#>   text1      2         10
#>   text2      1          7
#>   text3      0          6

^{创建于 2023-09-02，使用 reprex v2.0.2}

使用R（tm包）在语料库中搜索特定单词

问题描述投票：0回答：1

1个回答

最新问题

使用R（tm包）在语料库中搜索特定单词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1