aggregate.data.frame(as.data.frame(x),...)中的错误:参数必须具有相同的长度

问题描述 投票:0回答:1

嗨,我正在使用本教程中的最后一个示例:主题比例随着时间的推移。 https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html

我使用此代码运行它以获取我的数据

library(readxl)
library(tm)
# Import text data

tweets <- read_xlsx("C:/R/data.xlsx")

textdata <- tweets$text

#Load in the library 'stringr' so we can use the str_replace_all function. 
library('stringr')

#Remove URL's 
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")


textdata <- gsub("@\\w+", " ", textdata)  # Remove user names (all proper names if you're wise!)

textdata <- iconv(textdata, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)

textdata <- gsub("http.+ |http.+$", " ", textdata)  # Remove links

textdata <- gsub("[[:punct:]]", " ", textdata)  # Remove punctuation


#Change all the text to lower case
textdata <- tolower(textdata)



#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))


textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)

# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata))  # Create corpus object


#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)

ui = unique(dtm$i)
dtm.new = dtm[ui,]

#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new   <- datatm[rowTotals> 0, ]

library("ldatuning")
library("topicmodels")

k <- 7

ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)


#####################################################
#topics by year

tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)

tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)


top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")

# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames


# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")

# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

这是输入数据https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file的excel文件

当我使用聚合函数运行该行时,我得到了错误,我无法找到聚合的内容,我创建了与“tutoria”相同的“十年”变量,我显示它并且看起来没问题, theta变量也没问题。我根据例如Error in aggregate.data.frame : arguments must have same length更改了聚合函数的几次

但仍然有相同的错误..请帮助

r text-mining topic-modeling
1个回答
2
投票

我不确定你想用命令实现什么

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)

据我所知,你只生产了十年

tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)

2010 
3481 

随着从tweetstextdata的所有预处理,你会产生一些空行。这是你的问题开始的地方。带有新空行的Textdata是你的corpus和你的dtm的基础。你用线条摆脱它们:

ui = unique(dtm$i)
dtm.new = dtm[ui,]

同时你基本上删除了dtm中的空列,从而改变了对象的长度。这个没有空单元格的新dtm是主题模型的新基础。当你试图将aggregate()与两个不同长度的物体一起使用时,这又回来困扰着你:tweets$decade,它仍然是由theta构成的旧长度3418,由主题模型产生,而后者又基于dtm。新 - 请记住,行数较少的那个。

我建议首先在tweets获得一个ID列。稍后您可以使用ID来查找稍后将被预处理删除的文本,并匹配tweet$decadetheta的长度。

我重写了你的代码 - 试试这个:

library(readxl)
library(tm)
# Import text data

tweets <- read_xlsx("data.xlsx")

## Include ID for later
tweets$ID <- 1:nrow(tweets)

textdata <- tweets$text

#Load in the library 'stringr' so we can use the str_replace_all function. 
library('stringr')

#Remove URL's 
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")


textdata <- gsub("@\\w+", " ", textdata)  # Remove user names (all proper names if you're wise!)

textdata <- iconv(textdata, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)

textdata <- gsub("http.+ |http.+$", " ", textdata)  # Remove links

textdata <- gsub("[[:punct:]]", " ", textdata)  # Remove punctuation

#Change all the text to lower case
textdata <- tolower(textdata)

#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))

textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)

# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata))  # Create corpus object

#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]

#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new   <- datatm[rowTotals> 0, ]

library("ldatuning")
library("topicmodels")

k <- 7

ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)

#####################################################
#topics by year

tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)

id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")

tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)

topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)
© www.soinside.com 2019 - 2024. All rights reserved.