R-如何分别解决TermDocumentMatrix()和DocumentTermMatrix()的数据丢失和错误?

问题描述 投票:0回答:1

我有1000个样本的Twitter数据。并尝试对它们进行tf和tf-idf分析,以衡量推文中每种表情符号的重要性。总共有437个独特的表情符号和810条推文。

我当前的问题是,对于TermDocumentMatrix,所有条款均未显示。而DocumentTermMatrix出现一个我无法解决的错误。这是一个工作代码段:

library(dplyr)
library(tidytext)
library(tm)
library(tidyr) 

#These are NOT from the my data, these are random fake bios I found online just to make this code snippet
tweets_data <- c("Sharp, adversarial⚔️~pro choice💪~ban Pit Bulls☠️~BSL🕊️~aberant psychology😈~common sense🤔~the Piper will lead us to reason🎵~sealskin woman🐺",
                 "Blocked by Owen, Adonis. Abbott & many #FBPE😃 Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement 🇬🇧",
                 "🇺🇸🇺🇸🇺🇸🇺🇸 #healthy #vegetarian #beatchronicillness fix infrastructure",
                 "LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
                 "💙🖤I #BackTheBlue for my son!🖤💙 Facts Over Feelings. Border Security saves lives! #ThankYouICE",
                 "🇺🇸🇺🇸 I play Pedal Steel @CooderGraw & #CharlieShafter🇺🇸🇺🇸 #GoStars #LiberalismIsAMentalDisorder",
                 "#Englishman  #Londoner  @Chelseafc  🕵️‍♂️ 🥓🚁 🍺 🏴󠁧󠁢󠁥󠁮󠁧󠁿🇬🇧🇨🇿",
                 "F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
                 "🌸🐦❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.🌸🐿️❄️",
                 "Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro 🇸🇪 | 👋🏼Hello intro on the Minds Link |")

emoticons_data <- c("🤔","🍺","💪","🥓","😃")

TagSet <- data.frame(emoticons_data)
colnames(TagSet) <- "emoticon"

TextSet <- data.frame(tweets_data)
colnames(TextSet) <- "tweet"

myCorpus <- tm::Corpus(tm::VectorSource(TextSet$tweet))

tdm <- tm::TermDocumentMatrix(myCorpus, control= list(stopwords=T))

tdm_onlytags <- tdm[rownames(tdm)%in%TagSet$emoticon, ]

tm::inspect(tdm_onlytags) #Only shows 1 terms, and not 5
#View(as.matrix(tdm_onlytags[1:tdm_onlytags$nrow, 1:tdm_onlytags$ncol])) #just to see in new window

enter image description here

此外,如果我尝试执行tf-idf,我只会得到错误。我环顾四周,但不知道应该在哪里解决错误。

tdm <- tm::as.DocumentTermMatrix(myCorpus, control= list(weighting= weightTfIdf))
tdm #Original= Error in dim(data) <- dim : dims [product 810] do not match the length of object [3]

enter image description here

[如果有人可以帮助我,我将不胜感激。这是我第一次使用tm软件包。预先感谢您,如果您愿意,我可以提供更多信息。

r utf-8 emoji tf-idf tm
1个回答
1
投票

我略微更改了原始数据,因为您的表情符号在文本中仅出现一次,这会将tfidf中的所有值都变为1(请参见下文,我只是随机添加了几张🤔)。我正在使用quanteda而不是tm,因为它速度更快,并且编码问题更少。

library(dplyr)
library(quanteda)

tweets_dfm <- dfm(TextSet$tweet)  # convert to document-feature matrix

tweets_dfm %>% 
  dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
  dfm_tfidf() %>%                 # weight with tfidf
  convert("data.frame")           # turn into data.frame to display more easily
#>    document <U+0001F914> <U+0001F4AA> <U+0001F603> <U+0001F953> <U+0001F37A>
#> 1     text1      1.39794            1            0            0            0
#> 2     text2      0.00000            0            1            0            0
#> 3     text3      0.00000            0            0            0            0
#> 4     text4      0.00000            0            0            0            0
#> 5     text5      0.00000            0            0            0            0
#> 6     text6      0.69897            0            0            0            0
#> 7     text7      0.00000            0            0            1            1
#> 8     text8      0.00000            0            0            0            0
#> 9     text9      0.00000            0            0            0            0
#> 10   text10      0.00000            0            0            0            0

[列名(即表情符号)在我的查看器中正确显示,并且应该可以导出结果data.frame。

数据

TagSet <- data.frame(emoticon = c("🤔","🍺","💪","🥓","😃"),
                     stringsAsFactors = FALSE)

TextSet <- data.frame(tweet = c("🤔Sharp, adversarial⚔️~pro choice💪~ban Pit Bulls☠️~BSL🕊️~aberant psychology😈~common sense🤔~the Piper will lead us to reason🎵~sealskin woman🐺",
                                "Blocked by Owen, Adonis. Abbott & many #FBPE😃 Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement 🇬🇧",
                                "🇺🇸🇺🇸🇺🇸🇺🇸 #healthy #vegetarian #beatchronicillness fix infrastructure",
                                "LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
                                "💙🖤I #BackTheBlue for my son!🖤💙 Facts Over Feelings. Border Security saves lives! #ThankYouICE",
                                "🤔🇺🇸🇺🇸 I play Pedal Steel @CooderGraw & #CharlieShafter🇺🇸🇺🇸 #GoStars #LiberalismIsAMentalDisorder",
                                "#Englishman  #Londoner  @Chelseafc  🕵️‍♂️ 🥓🚁 🍺 🏴󠁧󠁢󠁥󠁮󠁧󠁿🇬🇧🇨🇿",
                                "F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
                                "🌸🐦❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.🌸🐿️❄️",
                                "Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro 🇸🇪 | 👋🏼Hello intro on the Minds Link |"),
                      stringsAsFactors = FALSE)
© www.soinside.com 2019 - 2024. All rights reserved.