Python 正则表达式无法从邮件正文中提取文本和网址

问题描述 投票:0回答:1

例如,我的 Outlook 文件夹中有一封邮件,其中包含主题和大量日语文本和网址,如下所示。

01 事務用品・機器

大阪府警察大正警察署:指サック等の購入   :大阪市大正区

https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042214

01 事務用品・機器

府立学校大阪わかば高等学校:校内衛生用品7件 ★ :大阪市生野区

https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350041978

01 事務用品・機器

府立学校工芸高等学校:イレパネ 他 購入   :大阪市阿倍野区

https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042117

我想搜索匹配的父关键字和子关键字列表,这些关键字是我在 json 配置文件中配置的,如下所示。

{
  "folder_name": "調達プロジェクト",
  "output_file_path": "E:\\output",
  "output_file_name": "output.txt",
  "parent_keyword": "meeting",
  "child_keywords": ["土木一式工事", "産業用機器", "事務用品・機器"]
}

现在我正在尝试查找具有这些父子关键字的邮件,并希望使用匹配的关键字以及与这些关键字关联的信息(链接的文本和网址)创建一个文本文件。例如,对于上面的邮件,如果关键字与通信用机器关键字进行数学运算,那么我必须提取下面的文本和网址或与该关键字(以及其余匹配的关键字)相关联的文本和网址,如下所示。

keyword: matched keyword
Paragraph text: text associated with the keyword 
Urls: urls associated with the keyword

这是我用 python 尝试的。

import win32com.client
import os
import json
import logging
import re

def read_config(config_file):
    with open(config_file, 'r', encoding="utf-8") as f:
        config = json.load(f)
    return config

def search_and_save_email(config):
    try:
        folder_name = config.get("folder_name", "")
        output_file_path = config.get("output_file_path", "")
        parent_keyword = config.get("parent_keyword", "")
        child_keywords = config.get("child_keywords", [])

        # Ensure the directory exists
        os.makedirs(output_file_path, exist_ok=True)

        outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
        inbox = outlook.GetDefaultFolder(6)

        # Find the user-created folder within the Inbox
        user_folder = None
        for folder in inbox.Folders:
            if folder.Name == folder_name:
                user_folder = folder
                break

        if user_folder is not None:
            # Search for emails with the parent keyword anywhere in the subject
            parent_keyword_pattern = re.compile(r'\b(?:' + '|'.join(map(re.escape, parent_keyword.split())) + r')\b', re.IGNORECASE)
            for item in user_folder.Items:
                if parent_keyword_pattern.findall(item.Subject):
                    logging.info(f"Found parent keyword in Subject: {item.Subject}")
                    # Parent keyword found, now search for child keywords in the body
                    body_lower = item.Body.lower()

                    # Initialize output_text outside the child keywords loop
                    output_text = ""

                    for child_keyword in child_keywords:
                        # Search for child keyword in the body using regular expression
                        child_keyword_pattern = re.compile(re.escape(child_keyword), re.IGNORECASE)
                        matches = child_keyword_pattern.finditer(body_lower)

                        for match in matches:
                            logging.info(f"Found child keyword '{child_keyword}' at position {match.start()}-{match.end()}")
                            # Extract the paragraph around the matched position
                            paragraph_start = body_lower.rfind('\n', 0, match.start())
                            paragraph_end = body_lower.find('\n', match.end())
                            paragraph_text = item.Body[paragraph_start + 1:paragraph_end]

                            # Extract URLs from the paragraph using a simple pattern
                            url_pattern = re.compile(r'http[s]?://\S+')
                            urls = url_pattern.findall(paragraph_text)

                            # Append the results to the output_text
                            output_text += f"Child Keyword: {child_keyword}\n"
                            output_text += f"Paragraph Text: {paragraph_text}\n"
                            output_text += f"URLs: {', '.join(urls)}\n\n"

                    # Save the result to a text file
                    output_file = os.path.join(output_file_path, f"{item.Subject.replace(' ', '_')}.txt")
                    with open(output_file, 'w', encoding='utf-8') as f:
                        f.write(output_text)

                    logging.info(f"Saved results to {output_file}")
                else:
                    logging.warning(f"Child keywords not found in folder '{folder_name}'.")

        else:
            logging.warning(f"Folder '{folder_name}' not found.")
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    # Set up logging
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

    # Specify the path to the configuration file
    config_file_path = "E:\\config2.json"

    # Read configuration from the file
    config = read_config(config_file_path)

    # Search and save email based on the configuration
    search_and_save_email(config)`

不幸的是,代码只给了我匹配的关键字,而不是与这些关键字关联的任何文本和网址。我的输出文本文件就像

Child Keyword: 土木一式工事

Paragraph Text:  土木一式工事

URLs: 
 
Child Keyword: 産業用機器

Paragraph Text:  19 産業用機器

URLs: 
 
Child Keyword: 産業用機器

Paragraph Text:  19 産業用機器

URLs:

我很确定问题出在我试图找出的逻辑和正则表达式中,但我需要一些帮助。抱歉给大家带来了很长的信息。

python regex web-scraping full-text-search python-re
1个回答
0
投票

问题是您正在

paragraph_text
中查找 URL,这是最接近您找到的
child_keyword
的换行符所包围的文本,它只是像
01 事務用品・機器
这样的文本,没有 URL。

您可以使用正则表达式模式来捕获交替模式中的子关键字、包含关键字的段落文本以及后面的 URL:

rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)'

这样:

for paragraph_text, child_keyword, url in re.findall(
    rf'([^\n]*({"|".join(map(re.escape, child_keywords))})[^\n]*).*?\b(https?://\S+)',
    body_lower,
    re.S
):
    print(f'{paragraph_text=}', f'{child_keyword=}', f'{url=}', sep='\n')

输出:

paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042214'
paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350041978'
paragraph_text='01 事務用品・機器'
child_keyword='事務用品・機器'
url='https://www.e-nyusatsu.pref.osaka.jp/CALS/Publish/EbController?Shori=SmallKokokuInfo&open_kokoku=01202350042117'

演示:https://ideone.com/wmuXhH

© www.soinside.com 2019 - 2024. All rights reserved.