来自VCorpus和DTM的术语频率不匹配

Question

我从Corpus和DTM计算了测试文件的术语频率，如下所示。但他们并不相符。谁能告诉我不匹配的来源？是因为我用错误的方法来提取术语频率吗？

library("tm")
library("stringr")
library("dplyr")
test1 <- VCorpus(DirSource("test_papers"))
mytable1 <- lapply(test1, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table() %>% sort(decreasing=T)
test2 <- DocumentTermMatrix(test1)
mytable2 <- apply(test2, 2, sum) %>% sort(decreasing=T)
head(mytable1)
.
and  of the  to  in  on 
148 116 111  69  61  54 
head(mytable2)
      and       the      this      that       are political 
      145       120        35        34        33        33

Answer 1

使用的方法不同。

str_extract_all与boundary("word")删除了句子中的标点符号。将文本转换为文档术语矩阵则不会。要获得相同的数字，您需要使用DocumentTermMatrix(test1, control = list(removePunctuation = TRUE))。

详细说明：

在第一种情况下：“这是一个文本。”没有句号就会返回四个单词。在第二种情况下，您将在文档术语矩阵中获得带有句点（“text。”）的文本。现在，如果文本显示如下：“文本和文本”。第一种情况将计数“text”= 2，文档术语矩阵将其视为“text”= 1和“text”。 = 1。

使用removePunction将删除句点并且计数将相等。

您可能还希望首先删除数字，因为removePunctuation会从数字中删除点和逗号。

来自VCorpus和DTM的术语频率不匹配

问题描述投票：0回答：1

1个回答

最新问题

来自VCorpus和DTM的术语频率不匹配

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1