LDA主题模型按年绘制

Question

我试图从这个文件中逐年绘制推文主题

https://www.mediafire.com/file/64lzbt46v01jbe1/cleaned.xlsx/file

工作正常，以获得主题，但当我尝试按年绘制它时，我有一个尺寸问题：

library(readxl)
library(tm)
tweets <- read_xlsx("C:/cleaned.xlsx")

mytextdata <- tweets$textdata

# Convert to tm corpus and use its API 
corpus <- Corpus(VectorSource(mytextdata))  # Create corpus object

dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]

k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
dim(theta)

dim（theta）= 4857我在clean.xls文件中有4876个日期，我需要它们才能运行这个聚合函数

topic_proportion_per_decade <- aggregate(theta, by = list(decade = textdata$decade), mean)

从这里

https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html

我认为问题是clean.xls文件不够干净，这就是为什么theta错过了一些行..但实际上我真的不知道为什么theta错过了一些行..我也不知道如何清理文件更好，如果这是问题，文件看起来不错，有些行只是数字或非英语单词，但我更喜欢保留它们..

Answer 1

问题是ui = unique(dtm$i)删除了几个文件（我不知道你为什么这样做，所以我不会评论那个部分）。因此，您的theta与数据的行数不同。我们可以通过仅保留仍在theta中的行来解决这个问题：

library("dplyr")
library("reshape2")
library("ggplot2")
tweets_clean <- tweets %>% 
  mutate(id = rownames(.)) %>% 
  filter(id %in% rownames(theta)) %>% # keep only rows still in theta
  cbind(theta) %>% # now we can attach the topics to the data.frame
  mutate(year = format(date, "%Y")) # make year variable

然后我使用dplyr函数进行聚合，因为我认为这样可以更容易地读取代码：

tweets_clean_yearly <- tweets_clean %>% 
  group_by(year) %>% 
  summarise_at(vars(as.character(1:7)), funs(mean)) %>% 
  melt(id.vars = "year")

然后我们可以绘制这个：

ggplot(tweets_clean_yearly, aes(x = year, y = value, fill = variable)) + 
  geom_bar(stat = "identity") + 
  ylab("proportion")

注意：我测试了θ和推文是否真的具有相同的文档：

tweets_clean <- tweets %>% 
  mutate(id = rownames(.)) %>% 
  filter(id %in% rownames(theta))

all.equal(tweets_clean$id, rownames(theta))

LDA主题模型按年绘制

问题描述投票：0回答：1

1个回答

最新问题

LDA主题模型按年绘制

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1