如何清理R中的twitter数据？

Question

我使用twitteR包从twitter中提取推文并将其保存到文本文件中。

我在语料库上进行了以下操作

xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1')

（使用mc.cores = 1和lazy = True，否则Mac上的R运行错误）

tdm<-TermDocumentMatrix(xx)

但是这个术语文档矩阵有很多奇怪的符号，无意义的单词等。如果推文是

 RT @Foxtel: One man stands between us and annihilation: @IanZiering.
 Sharknado‚Äã 3: OH HELL NO! - July 23 on Foxtel @SyfyAU

清理完推文后，我只想留下适当的完整英文单词，即句子/短语无其他一切（用户名，缩短的单词，网址）

例：

One man stands between us and annihilation oh hell no on

（注意：tm包中的转换命令只能删除停用词，标点符号空格以及转换为小写）

Answer 1

使用gsub和

斯金格包

我已经找到了部分解决方案，用于删除转推，对屏幕名称，主题标签，空格，数字，标点符号，网址的引用。

  clean_tweet = gsub("&amp", "", unclean_tweet)
  clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
  clean_tweet = gsub("@\\w+", "", clean_tweet)
  clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
  clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
  clean_tweet = gsub("http\\w+", "", clean_tweet)
  clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
  clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)

参考:(希克斯，2014）经过以上我做了以下。

 #get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")

参考:(斯坦顿2013）

在执行上述任何操作之前，我使用下面的内容将整个字符串折叠成一个长字符。

paste(mytweets, collapse=" ")

与tm_map转换相反，这个清理过程对我很有用。

我现在剩下的就是一套正确的单词和一些不正确的单词。现在，我只需要弄清楚如何删除不合适的英语单词。可能我不得不从单词词典中减去我的单词集。

Answer 2

要删除网址，您可以尝试以下操作：

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)

可能您可以定义类似的函数来进一步转换文本。

Answer 3

对我来说，由于某些原因，这段代码不起作用 -

# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")

错误是 -

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
 Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

所以，相反，我用过

clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")

摆脱URL

Answer 4

代码做了一些基本的清理工作

Converts into lowercase

df <- tm_map(df, tolower)

Removing Special characters

df <- tm_map(df, removePunctuation)

Removing Special characters

df <- tm_map(df, removeNumbers)

Removing common words

df <- tm_map(df, removeWords, stopwords('english'))

Removing URL

removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)

如何清理R中的twitter数据？

问题描述投票：11回答：4

4个回答

Converts into lowercase

Removing Special characters

Removing Special characters

Removing common words

Removing URL

最新问题

如何清理R中的twitter数据？

问题描述 投票：11回答：4

4个回答

Converts into lowercase

Removing Special characters

Removing Special characters

Removing common words

Removing URL

最新问题

问题描述投票：11回答：4