来自R中给定csv文档术语矩阵的lda主题建模交叉验证

问题描述 投票:0回答:1

我正在尝试进行交叉验证分析,以选择要估计的“适当”主题数。但是,我的数据已被编码在三列(相当大)的表中。这是一个示例:

Source;Target;Value
advice;1;1
advice;100047;1
advice;10008;1
advice;100294;1
advice;100379;1

这是我正在运行的代码以及一些正在得到的输出。

> #import libraries
> library(tm)
> library(topicmodels)
> library(Matrix)
> library(ldatuning)
> library(doParallel)
> library(ggplot2)
> library(scales)
> #library(tidyverse)
> library(RColorBrewer)
> #library(wordcloud)
> 
> 
> #import data from csv
> myDF <- read.csv2("ctNoIso.csv", header=TRUE)
> 
> #data as factors
> myDF$Source <- as.factor(myDF$Source)
> myDF$Target <- as.factor(myDF$Target)
> myDF$Value <- as.factor(myDF$Value)
> 
> str(myDF)
'data.frame':   732764 obs. of  3 variables:
 $ Source: Factor w/ 13186 levels "aacsb","abandonment",..: 171 171 171 171 171 171 171 171 171 171 ...
 $ Target: Factor w/ 81977 levels "1","2","3","4",..: 1 68569 6480 68815 68896 6524 6551 69322 69523 69538 ...
 $ Value : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
> 
> mySM <- with(myDF, sparseMatrix(i=as.numeric(Target), 
+                      j=as.numeric(Source), 
+                      x=as.numeric(Value),
+                      dimnames=list(levels(Target), levels(Source))))
> 
> str(mySM)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:732764] 340 381 382 398 406 419 6602 7651 7670 8937 ...
  ..@ p       : int [1:13187] 0 25 94 116 161 167 261 282 461 614 ...
  ..@ Dim     : int [1:2] 81977 13186
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:81977] "1" "2" "3" "4" ...
  .. ..$ : chr [1:13186] "aacsb" "abandonment" "abb" "abc" ...
  ..@ x       : num [1:732764] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
> #document term matrix
> myDTM <- as.DocumentTermMatrix(mySM, weighting=weightBin)
> inspect(myDTM)
<<DocumentTermMatrix (documents: 81977, terms: 13186)>>
Non-/sparse entries: 732764/1080215958
Sparsity           : 100%
Maximal term length: 49
Weighting          : binary (bin)
Sample             :
        Terms
Docs     ideal india innovation_process market_share mediating narrative nursing_management reaction
  109466     0     0                  0            0         0         0                  0        0
  14075      0     0                  0            0         0         0                  0        0
  1421       0     0                  0            0         0         0                  0        0

...到目前为止,一切似乎都可以正常工作,但是当我运行交叉验证代码时,我得到一个错误:

> #corssvalidation
> system.time({
+ tunes <- FindTopicsNumber(
+    myDTM,
+    topics = c(1:10 * 10, 120, 140, 160, 180, 0:3 * 50 + 200),
+    metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
+    method = "Gibbons",
+    control = list(seed = 77),
+    verbose = TRUE
+ )
+ })
fit models...Error in checkForRemoteErrors(val) : 
  8 nodes produced errors; first error: The DocumentTermMatrix needs to have a term frequency weighting
In addition: Warning messages:
1: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 18 (<-DESKTOP-4QO2AE4:11213)
2: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 17 (<-DESKTOP-4QO2AE4:11213)
3: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 16 (<-DESKTOP-4QO2AE4:11213)
4: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 15 (<-DESKTOP-4QO2AE4:11213)
5: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 10 (<-DESKTOP-4QO2AE4:11213)
6: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 9 (<-DESKTOP-4QO2AE4:11213)
7: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 8 (<-DESKTOP-4QO2AE4:11213)
8: In .Internal(gc(verbose, reset, full)) :
  closing unused connection 7 (<-DESKTOP-4QO2AE4:11213)
Timing stopped at: 0.57 0.67 13.49

如果有人可以帮助解决此问题,我将深表感谢,最好,vitaliano

sparse-matrix cross-validation lda topic-modeling csv-import
1个回答
0
投票

我自己修复了

weighting = weightTf而不是weighting = weightBin并且此方法中现在有一个错字设置为=“ Gibbs”

© www.soinside.com 2019 - 2024. All rights reserved.