将功能应用于textreuse语料库

问题描述 投票:2回答:1

我有一个数据框如下:

df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)

                             revtext        rid
             the dog that chased the cat    r01
             the dog which chased the cat   r02
World Cup Hair 2014 very funny.i can change r03
             Bow Bow                        r04
             this is                        r05

我正在使用包textreusedf转换为corpus

#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
                      tokenizer = tokenize_character, k=3,
                      progress = FALSE,
                      keep_tokens = TRUE)

其中tokenize_character是我编程的函数:

 tokenize_character <- function(document, k) {
                       shingles<-c()
                 for( i in 1:( nchar(document) - k + 1 ) ) {
                         shingles[i] <- substr(document,start=i,stop= (i+k-1))
                     }
return( unique(shingles) )  
}   

但是,我提示了一些警告:Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.。但请注意我的tokenizer在角色级别上工作。 r04的文本足够长。事实上,如果我们运行tokenize_character('BowBow',3),我们会得到:"Bow" "owB" "wBo"

另请注意,对于r01TextReuseCorpus按照预期工作,返回:tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"

有什么建议?我不知道我在这里缺少什么。

r nlp text-mining corpus
1个回答
2
投票

textreuse::TextReuseCorpus documentation的详细信息部分:

如果skip_short = TRUE,则此函数将跳过非常短或空的文档。一个非常短的文件是一个有两个单词来创建至少两个n-gram的文件。例如,如果需要5克,则文档的长度必须至少为6个字。如果未提供n的值,则该函数假定值n = 3。

由此,我们知道将有4个单词的文档作为短文档跳过(在你的例子中n = 3),这就是我们分别对于r04r05看到的1和2个单词。要不跳过这些文档,您可以使用skip_short = F,它将按预期返回输出:

corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
                      skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"
© www.soinside.com 2019 - 2024. All rights reserved.