我正在尝试进行交叉验证分析,以选择要估计的“适当”主题数。但是,我的数据已被编码在三列(相当大)的表中。这是一个示例:
Source;Target;Value
advice;1;1
advice;100047;1
advice;10008;1
advice;100294;1
advice;100379;1
这是我正在运行的代码以及一些正在得到的输出。
> #import libraries
> library(tm)
> library(topicmodels)
> library(Matrix)
> library(ldatuning)
> library(doParallel)
> library(ggplot2)
> library(scales)
> #library(tidyverse)
> library(RColorBrewer)
> #library(wordcloud)
>
>
> #import data from csv
> myDF <- read.csv2("ctNoIso.csv", header=TRUE)
>
> #data as factors
> myDF$Source <- as.factor(myDF$Source)
> myDF$Target <- as.factor(myDF$Target)
> myDF$Value <- as.factor(myDF$Value)
>
> str(myDF)
'data.frame': 732764 obs. of 3 variables:
$ Source: Factor w/ 13186 levels "aacsb","abandonment",..: 171 171 171 171 171 171 171 171 171 171 ...
$ Target: Factor w/ 81977 levels "1","2","3","4",..: 1 68569 6480 68815 68896 6524 6551 69322 69523 69538 ...
$ Value : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
>
> mySM <- with(myDF, sparseMatrix(i=as.numeric(Target),
+ j=as.numeric(Source),
+ x=as.numeric(Value),
+ dimnames=list(levels(Target), levels(Source))))
>
> str(mySM)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:732764] 340 381 382 398 406 419 6602 7651 7670 8937 ...
..@ p : int [1:13187] 0 25 94 116 161 167 261 282 461 614 ...
..@ Dim : int [1:2] 81977 13186
..@ Dimnames:List of 2
.. ..$ : chr [1:81977] "1" "2" "3" "4" ...
.. ..$ : chr [1:13186] "aacsb" "abandonment" "abb" "abc" ...
..@ x : num [1:732764] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
> #document term matrix
> myDTM <- as.DocumentTermMatrix(mySM, weighting=weightBin)
> inspect(myDTM)
<<DocumentTermMatrix (documents: 81977, terms: 13186)>>
Non-/sparse entries: 732764/1080215958
Sparsity : 100%
Maximal term length: 49
Weighting : binary (bin)
Sample :
Terms
Docs ideal india innovation_process market_share mediating narrative nursing_management reaction
109466 0 0 0 0 0 0 0 0
14075 0 0 0 0 0 0 0 0
1421 0 0 0 0 0 0 0 0
...到目前为止,一切似乎都可以正常工作,但是当我运行交叉验证代码时,我得到一个错误:
> #corssvalidation
> system.time({
+ tunes <- FindTopicsNumber(
+ myDTM,
+ topics = c(1:10 * 10, 120, 140, 160, 180, 0:3 * 50 + 200),
+ metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+ method = "Gibbons",
+ control = list(seed = 77),
+ verbose = TRUE
+ )
+ })
fit models...Error in checkForRemoteErrors(val) :
8 nodes produced errors; first error: The DocumentTermMatrix needs to have a term frequency weighting
In addition: Warning messages:
1: In .Internal(gc(verbose, reset, full)) :
closing unused connection 18 (<-DESKTOP-4QO2AE4:11213)
2: In .Internal(gc(verbose, reset, full)) :
closing unused connection 17 (<-DESKTOP-4QO2AE4:11213)
3: In .Internal(gc(verbose, reset, full)) :
closing unused connection 16 (<-DESKTOP-4QO2AE4:11213)
4: In .Internal(gc(verbose, reset, full)) :
closing unused connection 15 (<-DESKTOP-4QO2AE4:11213)
5: In .Internal(gc(verbose, reset, full)) :
closing unused connection 10 (<-DESKTOP-4QO2AE4:11213)
6: In .Internal(gc(verbose, reset, full)) :
closing unused connection 9 (<-DESKTOP-4QO2AE4:11213)
7: In .Internal(gc(verbose, reset, full)) :
closing unused connection 8 (<-DESKTOP-4QO2AE4:11213)
8: In .Internal(gc(verbose, reset, full)) :
closing unused connection 7 (<-DESKTOP-4QO2AE4:11213)
Timing stopped at: 0.57 0.67 13.49
如果有人可以帮助解决此问题,我将深表感谢,最好,vitaliano
我自己修复了
weighting = weightTf而不是weighting = weightBin并且此方法中现在有一个错字设置为=“ Gibbs”