将功能应用于textreuse语料库

Question

我有一个数据框如下：

df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)

                             revtext        rid
             the dog that chased the cat    r01
             the dog which chased the cat   r02
World Cup Hair 2014 very funny.i can change r03
             Bow Bow                        r04
             this is                        r05

我正在使用包textreuse将df转换为corpus：

#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
                      tokenizer = tokenize_character, k=3,
                      progress = FALSE,
                      keep_tokens = TRUE)

其中tokenize_character是我编程的函数：

 tokenize_character <- function(document, k) {
                       shingles<-c()
                 for( i in 1:( nchar(document) - k + 1 ) ) {
                         shingles[i] <- substr(document,start=i,stop= (i+k-1))
                     }
return( unique(shingles) )  
}

但是，我提示了一些警告：Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.。但请注意我的tokenizer在角色级别上工作。 r04的文本足够长。事实上，如果我们运行tokenize_character('BowBow',3)，我们会得到："Bow" "owB" "wBo"。

另请注意，对于r01，TextReuseCorpus按照预期工作，返回：tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"

有什么建议？我不知道我在这里缺少什么。

Answer 1

从textreuse::TextReuseCorpus documentation的详细信息部分：

如果skip_short = TRUE，则此函数将跳过非常短或空的文档。一个非常短的文件是一个有两个单词来创建至少两个n-gram的文件。例如，如果需要5克，则文档的长度必须至少为6个字。如果未提供n的值，则该函数假定值n = 3。

由此，我们知道将有4个单词的文档作为短文档跳过（在你的例子中n = 3），这就是我们分别对于r04和r05看到的1和2个单词。要不跳过这些文档，您可以使用skip_short = F，它将按预期返回输出：

corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
                      skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"

将功能应用于textreuse语料库

问题描述投票：2回答：1

1个回答

最新问题

将功能应用于textreuse语料库

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1