使用R(tm包)在语料库中搜索特定单词

问题描述 投票:0回答:1

我有一个

Corpus
tm
包),包含1.300个不同文本文档的集合[内容:文档:1.300]。

我现在的目标是搜索每个文档中特定单词列表的频率。例如。如果我的

wordlist contains the words "january, february, march,...."
。我想分析文档引用这些词的频率。

Example: 
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.

结果应该是这样的:

Text 1: 2 
Text 2: 1
Text 3: 0

我尝试使用以下代码:

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks)

dict1 <- dictionary(list(c("january", "february", "march")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)  

此代码是在另一个聊天中提出的,但它不适用于我的,并且出现错误,表示它仅适用于文本或语料库元素。

如何使用 R 中

Corpus
包中现有的
tm
搜索我的单词列表?

r text analysis corpus textual
1个回答
0
投票

要使您的 Quaanteda 代码正常工作,您首先必须转换您的 tm VCorpus 对象

x
+ 修复其他一些小问题:

  • dictionary()
    需要一个命名列表
  • 英语词干分析器将返回 "januari", "februari" 而不是 "january", "february".
library(tm)
library(quanteda)

## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
          "I went on a holiday in march.",
          "I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"

### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus"    "character"

### continue with tokenization and stemmming
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 
dtm <- dfm(toks)

# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
dict_dtm2
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
#>        features
#> docs    months _unmatched
#>   text1      2         10
#>   text2      1          7
#>   text3      0          6

创建于 2023-09-02,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.