从语料库中删除电子邮件ID

问题描述 投票:1回答:2

我在R中有一个Vector Corpus。我想删除该语料库中出现的所有电子邮件ID。电子邮件ID可以位于语料库中的任何位置。比如说

1> "Could you mail me the Company policy amendments at [email protected]. Thank you." 

2> "Please send me an invoice copy at [email protected]. Looking forward to your reply". 

所以我希望电子邮件ID“[email protected]”和“[email protected]”仅从语料库中删除。

我尝试过使用:

corpus <- tm_map(corpus,removeWords,"\w*gmail.com\b")
corpus <- tm_map(corpus,removeWords,"\w*yahoo.co.in\b")
r tm
2个回答
5
投票

下面的代码使用正则表达式模式从语料库中删除电子邮件ID。我从某些地方得到了正则表达式,目前无法回忆它的来源。我本来想要归功于消息来源。

# Sample data from which email ids need to be removed

text <- c("Could you mail me the Company policy amendments at [email protected]. Thank you.",
          "Please send me an invoice copy at [email protected]. Looking forward to your reply." )


#Function containing regex pattern to remove email id
RemoveEmail <- function(x) {
  require(stringr)
  str_replace_all(x,"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
} 

library(tm)
corpus =  Corpus(VectorSource(text)) # Corpus creation
corpus <- tm_map(corpus,content_transformer(RemoveEmail)) # removing email ids

#Printing the corpus
corpus[[1]]$content
corpus[[2]]$content

0
投票

使用特定列中的无效电子邮件删除R中的所有行:

DF <- subset(DF, Column!="[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
© www.soinside.com 2019 - 2024. All rights reserved.