您是否需要标记文本以可视化来自LDA主题模型的数据?

问题描述 投票:0回答:1

我目前正在使用textmineR包对2016-2019年间的新闻文章运行LDA主题模型。但是,我对R很陌生,我不知道如何显示模型的结果。

我想展示我的模型在收集数据期间发现的8个主题的普遍性。数据被构造在一个数据框中。我的数据每天的定义为%y-%m-%d

我的LDA模型是这样制作的:

## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
                 ngram_window = c(1, 2), 
                 doc_names = dat$names,
                 stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
                 lower = T, # lowercase - this is the default value
                 remove_punctuation = T, # punctuation - this is the default
                 remove_numbers = T, # numbers - this is the default
                 verbose = T,
                 cpus = 4)


dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]

############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################

# Draw quasi-random sample from the pc
set.seed(34838)

model <- FitLdaModel(dtm = dtm, 
                     k = 8,
                     iterations = 500,
                     burnin = 200,
                     alpha = 0.1,
                     beta = 0.05,
                     optimize_alpha = TRUE,
                     calc_likelihood = TRUE,
                     calc_coherence = TRUE,
                     calc_r2 = TRUE,
                     cpus = 4) 

# model log-likelihood
plot(model$log_likelihood, type = "l")

# topic coherence
summary(model$coherence)

hist(model$coherence, 
     col= "blue", 
     main = "Histogram of probabilistic coherence")


# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)

t(model$top_terms1)

# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")

有人可以告诉我如何绘制模型随时间推移找到的最普遍的主题吗?我是否需要标记文本或类似内容?

我希望这是有道理的。最好,

r visualization lda
1个回答
0
投票

令牌化在CreateDtm函数中发生。因此,听起来这不是您的问题。

您可以通过对theta的列取平均值来获得一组文档中主题的普遍性,该矩阵是结果模型的一部分。

我无法为您提供有关数据的确切答案,但是我可以向您展示nih_sample附带的textmineR数据的类似示例>

# load the NIH sample data
data(nih_sample)

# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, 
                 doc_names = nih_sample$APPLICATION_ID)

m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)

# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")

end_year <- sapply(end_year, function(x) x[length(x)])

end_year <- as.numeric(end_year)

topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
     if (is.null(nrow(x))) {
         # if only one row, gets converted to a vector
         # just return that vector
         return(x)
    } else { # if multiple rows, then aggregate
         return(colMeans(x))
     }
 })

topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))

topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))

# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")

© www.soinside.com 2019 - 2024. All rights reserved.