正则表达式从电子邮件文本Python中删除名称,地址,名称

问题描述 投票:-3回答:1

我有一个像这样的电子邮件的示例文本。我只想保留文本正文,并从文本中删除姓名,地址,名称,公司名称,电子邮件地址。所以,要明确的是,我只想要从Dear / Hi / Hello到Sincerely / Regards / Thanks之间的每封邮件的内容。如何使用正则表达式或其他方式有效地执行此操作

Subject: [EXTERNAL] RE: QUERY regarding supplement 73

Hi Roger,

Yes, an extension until June 22, 2018 is acceptable.

Regards, 
Loren


Subject: [EXTERNAL] RE: QUERY regarding supplement 73

Dear Loren, 
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.


Best Regards,
Mr. Roger
Global Director
[email protected]
78 Ford st.





Subject: [EXTERNAL] RE: QUERY regarding supplement 73

responding by June 15, 2018.check email for updates

Hello,
John Doe 
Senior Director
[email protected]




Subject: [EXTERNAL] RE: QUERY regarding supplement 73


Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.

Feel free to contact me with any questions.

Warm Regards,
Mr. Roger
Global Director
[email protected]
78 Ford st.


Center for Research
Office of New Discoveries
Food and Drug Administration 
[email protected]

从这篇文章中我只想要输出:

    Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
    symptom. We are currently reviewing your supplements and have
    made additional edits to your label. 
    Feel free to contact me with any questions.
python regex email regex-group
1个回答
1
投票

以下是适用于您当前输入的答案。处理超出下面代码中列出的参数的示例时,必须调整代码。

with open('email_input.txt') as input:

   # List to store the cleaned lines
   clean_lines = []

   # Reads until EOF
   lines = input.readlines()

   # Remove some of the extra lines
   no_new_lines = [i.strip() for i in lines]

   # Convert the input to all lowercase
   lowercase_lines = [i.lower() for i in no_new_lines]

   # Boolean state variable to keep track of whether we want to be printing lines or not
   lines_to_keep = False

   for line in lowercase_lines:
      # Look for lines that start with a subject line
      if line.startswith('subject: [external]'):
         # set lines_to_keep true and start capturing lines
         lines_to_keep = True

      # Look for lines that start with a salutation
      elif line.startswith("regards,") or line.startswith("warm regards,") \
          or line.startswith("best regards,") or line.startswith("hello,"):
          # set lines_to_keep false and stop capturing lines
          lines_to_keep = False

    if lines_to_keep:
        # regex to catch greeting lines
        greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
        remove_greeting = re.match(greeting_component, line)
        if not remove_greeting:
           if line not in clean_lines:
               clean_lines.append(line)


for item in clean_lines:
    print (item)

    # output 
    subject: [external] re: query regarding supplement 73

    yes, an extension until june 22, 2018 is acceptable.
    we had initial discussion with the abc team us know if you would be able to 
    extend the response due date to june 22, 2018.
    responding by june 15, 2018.check email for updates
    please refer to your january 12, 2018 data containing labeling supplements 
    to add text regarding this symptom. we are currently reviewing your 
    supplements and have made additional edits to your label.
    feel free to contact me with any questions.
© www.soinside.com 2019 - 2024. All rights reserved.