只获取来自文本文件的信件,电子邮件

问题描述 投票:0回答:2

我想从此文本文档中删除所有from,to,cc,subject发送的标记,并仅保留邮件正文,以便我可以使用它来汇总文档的内容。在python中执行此操作的最佳方法是什么。我认为最好首先进行提取,然后对这种情况使用预处理。还附上代码。因此,如果有人可以建议如何做到这一点,将会非常有帮助。文件的有效载荷和ismultipart部分没有正确完成,我怀疑的地方是这样,因此评论了那部分并需要帮助。

附加代码和下面的.txt文件以供参考。

import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords

# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
    try:
        for files in filename:
            file = open(filename, 'r', encoding ='utf-8')
            filecontents = file.read()
            filecontents = re.sub(r'\s+', ' ', filecontents)
            print(filecontents)
            filecontents = filecontents.strip('\n')
            b = email.message_from_string(filecontents)# NEED
            if b.is_multipart():#HELP
                for payload in b.get_payload():#HERE
                    # if payload.is_multipart(): ...#SO
                    print (payload.get_payload())#COMMENTED
            else:#
                print (b.get_payload())#
            summary = summarize(filecontents, ratio =0.10)
            print(summary)
            kw = keywords(filecontents, words=15)
            print(kw)
            break
            #writer.writerow([file, summary, kw])
    except Exception as e:
        pass

文本文件

 Stephanie /ANN

From: Mr.A,  <[email protected]>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE:  Holdings: XXXX SPA – mfno.1322

Dear Dr. Tim A. , 

The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other 
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal 
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any 
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.  



Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 



Currently, there is no requirement to submit or resubmit NAs in any electronic format.  However, starting May 5, 2018, 
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common 
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal 
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A.gov/abc/bca 


This communication is an informal communication consistent with which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 

From: [email protected] [mailto:[email protected]]  
Sent: Wednesday, July 25, 2018 2:10 PM 
To: Mr.A,  <[email protected]> 
Cc: [email protected] 
Subject: RE: Holdings: XXXX SPA ‐ dm 013383 

Dear , 


XXXX



2

Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does 
direct bANNiness for test  S intermediate with b. and not with the other companies (e, 
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to 
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a 
separate QA  S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as 
described below: 

Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced 
to our NA 13383. 

Option 2: We can do a single QA for  and mention that they can cross‐reference any of their NAs. This 
would allow them to cross‐reference any of their 

If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know. 

If not, when you issue your request, can you please send to me and May Abd by email? 

Kind regards. 

Tim 

Tim A. , BsC 
Director, YY SERVICES) 
Xxxx ANN 
Phone/FAX: 2312333 
Cell: 23312123131 
Email: [email protected] 



From: , Tim /ANN  
Sent: Monday, July 23, 2018 7:05 AM 
To: 'Mr.A, ' 
Cc: Abd, May /ANN 
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383 

Dear , 

May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this 
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience. 

Kind regards. 

Tim 

Tim A. , MSC 
Director, PQR 
Xxxx 
Phone/FAX: 2312313313 
Cell: 3142342424 
Email: [email protected] 



XXXX



3


‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐ 
From: "Mr.A, " <[email protected]> 
Date: Jul 20, 2018 9:01 AM 
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383 
To: "TRETE/ANN" <[email protected]> 
Cc: "mno.com> 

Dear May Abd, 

. I need to talk to you on this.  

Thank you! 

Regards, 
 Mr.A 
PRODUCT Master File 
CDER 


Currently, there is no requirement to submit or resubmit NAs in any electronic format.   
format after this date may be subject to rejection. For more information please check the NA website 
www.GROUP A./cder/NA   


This communication is an informal communication  which represents my best judgment 
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the 
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication, 
including any attachments, is intended only for the person or entity to which it is addressed and may contain 
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities 
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the 
sender and delete the material from any computer. Thank you. 


XXXX
python email summarization document-body
2个回答
1
投票

目前还不清楚你需要帮助的代码部分,你想要它做什么而不是它目前做什么,或者如何传递结果以便进一步正确处理。

但是,我会注意到您的代码存在许多问题。

  • 您无法以UTF-8文本形式阅读电子邮件。无论文件扩展名如何,RFC822消息都只是一个字节序列。传统的电子邮件可以有大量不同的编码,如果你试图将它强制转换为UTF-8,你会遇到UnicodeDecodeErrors和其他障碍。
  • 一如既往,毯子except Exception:是一个主要的错误。也许您只是将其用于调试,但它实际上使调试更难。
  • 典型的现代电子邮件消息带有一些复杂的MIME主体结构,您必须在决定实际想要处理哪一个之前在上下文中进行分析。一个常见的现象是multipart/alternative,其中相同的消息以不同的格式呈现,以便接收者可以决定是否要将其呈现为HTML,纯文本,或者偶尔可能是PDF或RTF或单个图像或其他任何内容,具体取决于应用。此外,HTML结构通常有多个部分,因为主要的HTML想要提取MIME结构中提供的小图像(公司徽标,动画表情符号和其他侮辱读者)。也许也看到What are the "parts" in a multipart email?

这个答案的另一个复杂因素是Python的email库最近经历了一次大修。新功能是在Python 3.3中通过实验引入的,但只是在3.6中成为文档和默认版本。您将在野外发现的大部分代码将使用3.6之前的设施,但是在未来,您可能希望定位新的和改进的API。

使用遗留API,您的代码可能看起来像

from email import message_from_binary_file

for filename in glob.glob(os.path.join(dirs, '*.txt')):
    # Not useful; we already have a filename
    #for files in filename:
    # Open in binary mode, don't try to guess encoding
    # Use a context manager so we don't leave the file open
    with open(filename, 'rb') as file:
        # Just let the email library take it from here
        #filecontents = file.read()
        #filecontents = re.sub(r'\s+', ' ', filecontents)
        #print(filecontents)
        #filecontents = filecontents.strip('\n')
        b = email.message_from_binary_file(file)
    if b.is_multipart():
        # There are a number of things you could do to pick out
        # one or more payloads for analysis, but let's just take
        # the first text/plain part and call it "main_part"
        for part in b.walk()
            if part.get_content_type() == 'text/plain':
                main_part = part.get_payload()
                break
    else:
        main_part = b.get_payload()
    summary = summarize(main_part, ratio =0.10)
    print(summary)
    kw = keywords(main_part, words=15)
    print(kw)

要使用新的3.6+ API,您需要对此进行调整

from email.policy import default as default_email_policy
...
    b = email.message_from_binary_file(file, policy=default_email_policy)
    main_part = b.get_body(['related', 'plain', 'html'])

这将导致一个新的email.message.EmailMessage对象,它具有一些不同于传统email.message.Message类的方法和不同的行为。文档建议可能有一天默认传递默认的policy,此时旧代码将切换到新的行为(但也可能是一些不愉快的意外和彻底的破坏)。

另请注意get_body() method是3.6中的新功能,它可以让您轻松挑选出“可能的主要部分”;虽然如果没有text/plain部分可用,上面的代码将回退到HTML,然后您需要进一步处理以提取实际文本(可能会查看Beautifulsoup?)

没有技术,强大,可靠的方法将样板(标题,签名等)与电子邮件中的实际内容分开。某些HTML电子邮件客户端可能会在生成的消息中提供有关哪些<div>包含用户输入的内容的提示,但在一般情况下,您只需要在(坦率地说,绝望的)启发式中趟眉毛。


0
投票

如果您只想从电子邮件中删除“发件人”,“已发送”,“收件人”,“抄送”,“主题”和“转发”标签,则可以使用正则表达式。

import re

with open('email_input.txt', 'r') as input:
   lines = input.readlines()
   no_new_lines = [i.strip() for i in lines]
   for line in no_new_lines:
      email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Forwarded message).*)', re.IGNORECASE)
      remove_component = re.findall(email_component, line)
      if remove_component:
         print(line)

         # output
         ‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
         From: Mr.A,  <[email protected]>
         Sent: Wednesday, July 25, 2018 2:27 PM
         To: , Tim /ANN; Abd, May /ANN
         Cc: Mr.A, ; Theoder Jerry,
         Subject: [EXTERNAL] RE:  Holdings: XXXX SPA – mfno.1322

关于在'问候'之后删除内容。我没有将它添加到我的正则表达式中,因为电子邮件可以通过多种方式签名。以下是一些最常见的方法:

Best,
Best regards,
Best wishes,
Fond regards,
Kind regards,
Regards,
Sincerely,
Sincerely yours,
Thank you,
With appreciation,
With gratitude,
Yours sincerely,  

更新的答案

下面更新的答案会清除您的更多电子邮件输入,但需要更多清洁。

import re

with open('email_input.txt', 'r') as input:
   lines = input.readlines()

   # Remove some of the extra lines
   no_new_lines = [i.strip() for i in lines]

   # regex to catch header lines
   email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Date:|Forwarded message).*)', re.IGNORECASE)
   remove_headers = [x for x in no_new_lines if not email_component.findall(x)]

   # regex to catch greeting lines
   greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
   remove_greeting = [x for x in remove_headers if not greeting_component.findall(x)]

   # regex to catch lines with contact details
   contact_component = re.compile(r'(Phone.*:)|(Cell:.*)|(Email:.*)', re.IGNORECASE)
   remove_contacts = [x for x in remove_greeting if not contact_component.findall(x)]

    # regex to catch lines with salutation
    email_salutation_component = re.compile(r'Best,(.*?)|Best regards,(.*?)|Best wishes,(.*?)|Fond regards,(.*?)|'
                                        r'Kind regards(.*?)|Regards,(.*?)|Sincerely,(.*?)|Sincerely yours,(.*?)|'
                                        r'Thank you,(.*?)|With appreciation,(.*?)|Yours sincerely,(.*?)', re.IGNORECASE)

    remove_salutations = [x for x in remove_contacts if not email_salutation_component.findall(x)]

    # do something else

更新的答案两个

下面更新的答案使用python电子邮件库。我的输入文件是从我的电子邮件客户端提取的原始电子邮件。使用下面的代码,我能够提取我尝试过的每封电子邮件的正文。我还测试了gensim模块,它工作正常。

import email
from gensim.summarization import summarize, keywords

with open('email_input.txt', 'r') as input:
  email_body = ''
  raw_message = input.read()

  # Return a message object structure from a string
  msg = email.message_from_string(raw_message)

  # iterate over all the parts and subparts of a message object tree
  for part in msg.walk():

    # Return the message’s content type.
    if part.get_content_type() == 'text/plain':
        email_body = part.get_payload()

  summary = summarize(email_body, ratio=0.10)
  print(summary)

  kw = keywords(email_body, words=15)
  print(kw)

最终的回答

这是我对这个问题的最终答案。希望这4个答案中的一个符合您的要求。

您将不得不对输出进行一些小的清理,因为我不知道您的所有要求。

with open('email_input.txt') as infile:
  # Boolean state variable to keep track of whether we want to be printing lines or not
  lines_to_keep = False
  for line in infile:

    # Look for lines that start with a greeting
    if line.startswith("Dear"):
      # set lines_to_keep true and start capturing lines
      lines_to_keep = True

    # Look for lines that start with a salutation
    elif line.startswith("Regards") or line.startswith("Kind regards"):
        # set lines_to_keep false and stop capturing lines
        lines_to_keep = False


    if lines_to_keep:
        greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
        remove_greeting = re.match(greeting_component, line)
        if not remove_greeting:
            print (line.rstrip('\n'))
            # output 
            The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any applications submitted. We will send an administrative filing issue letter for both the holder and the agent.  

           more here....
© www.soinside.com 2019 - 2024. All rights reserved.