我有一个数据框如下:
df<-data.frame(revtext=c('the dog that chased the cat', 'the dog which chased the cat', 'World Cup Hair 2014 very funny.i can change', 'BowBow', 'this is'), rid=c('r01','r02','r03','r04','r05'), stringsAsFactors = FALSE)
revtext rid
the dog that chased the cat r01
the dog which chased the cat r02
World Cup Hair 2014 very funny.i can change r03
Bow Bow r04
this is r05
我正在使用包textreuse
将df
转换为corpus
:
#install.packages(textreuse)
library(textreuse)
d<-df$revtext
names(d)<-df$rid
corpus <- TextReuseCorpus(text = d,
tokenizer = tokenize_character, k=3,
progress = FALSE,
keep_tokens = TRUE)
其中tokenize_character
是我编程的函数:
tokenize_character <- function(document, k) {
shingles<-c()
for( i in 1:( nchar(document) - k + 1 ) ) {
shingles[i] <- substr(document,start=i,stop= (i+k-1))
}
return( unique(shingles) )
}
但是,我提示了一些警告:Skipping document with ID 'r04' because it has too few words to create at least two n-grams with n = 3.
。但请注意我的tokenizer在角色级别上工作。 r04
的文本足够长。事实上,如果我们运行tokenize_character('BowBow',3)
,我们会得到:"Bow" "owB" "wBo"
。
另请注意,对于r01
,TextReuseCorpus
按照预期工作,返回:tokens(corpus)$`r01= "the" "he " "e d" " do" "dog" "og " "g t" " th" "tha" "hat" "at " "t c" " ch" "cha" "has" "ase" "sed" "ed " "d t" "e c" " ca" "cat"
有什么建议?我不知道我在这里缺少什么。
从textreuse::TextReuseCorpus
documentation的详细信息部分:
如果skip_short = TRUE,则此函数将跳过非常短或空的文档。一个非常短的文件是一个有两个单词来创建至少两个n-gram的文件。例如,如果需要5克,则文档的长度必须至少为6个字。如果未提供n的值,则该函数假定值n = 3。
由此,我们知道将有4个单词的文档作为短文档跳过(在你的例子中n = 3),这就是我们分别对于r04
和r05
看到的1和2个单词。要不跳过这些文档,您可以使用skip_short = F
,它将按预期返回输出:
corpus <- TextReuseCorpus(text = d, tokenizer = tokenize_character, k=3,
skip_short = F, progress = FALSE, keep_tokens = TRUE)
tokens(corpus)$r04
[1] "Bow" "owB" "wBo"