读取文件并在R中提取文本

问题描述 投票:1回答:1

我在文件中有以下数据:

     Message-ID: <123.juii@jkk>
        Date: Wed, 9 Mar 2002 16:12:51 -0800 (CST)
        From: [email protected]
        To: [email protected], [email protected], [email protected], 
            [email protected], [email protected]

        Subject: Sales details

    Please find attached the latest sales information
    let me know what you can do.

    Thanks,
    jLian

我想只提取电子邮件的内容。所以我试着提取没有“:”字符的行。我无法找到任何其他方式。但这会导致:

    [email protected], [email protected]
    Please find attached the latest sales information and
    let me know what you can do.

    Thanks,
    jLian

其中只有第二行是消息内容。

library("stringr")
rawData = file("mail1","r")
while(TRUE){
  line = readLines(rawData,n=1)
  if(length(line)==0){
    break
  }
  if(!(str_detect(line,":")))
    print(line)
}
r stringr
1个回答
0
投票

看看这是否有效:

数据:

mail<-
'Message-ID: <123.juii@jkk>
    Date: Wed, 9 Mar 2002 16:12:51 -0800 (CST)
From: [email protected]
To: [email protected], [email protected], [email protected], 
[email protected], [email protected]

Subject: Sales details

Please find attached the latest sales information
let me know what you can do.

Thanks,
jLian'

码:

cat(
sub(".*Subject:.*?\n\n","",mail)
)

结果:

#Please find attached the latest sales information
#let me know what you can do.

#Thanks,
#jLian

为了有效地使用我的解决方案,请将每个Mail作为多行字符串列表元素。

listOfMails <- list(mail, mail, mail) #as many as you have.

fun1<-
function(m) { sub(".*Subject:.*?\n\n","",m) }

onlyContent<-
lapply(listOfMails,fun1)
© www.soinside.com 2019 - 2024. All rights reserved.