如何解析EML格式文件并提取元数据信息

问题描述 投票:0回答:2

我有一个带有一些附件的EML文件。我想读取 EML 文件的文本内容提取元数据,例如:sender,from,cc,bcc,subject。我也想下载附件。在下面代码的帮助下,我只能提取电子邮件正文中的信息/文本内容。

import email
from email import policy
from email.parser import BytesParser
import glob
file_list = glob.glob('*.eml') # returns list of files
with open(file_list[2], 'rb') as fp:  # select a specific email file from the list
    msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
print(text)

有可用于 Python 2 的模块名称 emaildata 完成了这项工作。

提取元数据信息

import email
from emaildata.metadata import MetaData

message = email.message_from_file(open('message.eml'))
extractor = MetaData(message)
data = extractor.to_dict()
print data.keys()

提取附件信息

import email
from emaildata.attachment import Attachment

message = email.message_from_file(open('message.eml'))
for content, filename, mimetype, message in Attachment.extract(message):
    print filename
    with open(filename, 'w') as stream:
        stream.write(content)
    # If message is not None then it is an instance of email.message.Message
    if message:
        print "The file {0} is a message with attachments.".format(filename)

但是这个库现在已经被弃用了。有没有其他图书馆可以提取元数据和附件相关信息?

python-3.x metadata email-attachments eml
2个回答
8
投票

可以使用 Python 3.x 中的以下代码访问元数据信息

from email import policy
from email.parser import BytesParser
with open(eml_file, 'rb') as fp:
    msg = BytesParser(policy=policy.default).parse(fp)

print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])

可以使用

msg.keys()

访问剩余的标题信息

要从 eml 文件下载附件,您可以使用以下代码:

import sys
import os
import os.path
from collections import defaultdict
from email.parser import Parser

eml_mail = 'your eml file'
output_dir = 'mention the directory where you want the files to be download'

def parse_message(filename):
    with open(filename) as f:
        return Parser().parse(f)

def find_attachments(message):
    """
    Return a tuple of parsed content-disposition dict, message object
    for each attachment found.
    """
    found = []
    for part in message.walk():
        if 'content-disposition' not in part:
            continue
        cdisp = part['content-disposition'].split(';')
        cdisp = [x.strip() for x in cdisp]
        if cdisp[0].lower() != 'attachment':
            continue
        parsed = {}
        for kv in cdisp[1:]:
            key, val = kv.split('=')
            if val.startswith('"'):
                val = val.strip('"')
            elif val.startswith("'"):
                val = val.strip("'")
            parsed[key] = val
        found.append((parsed, part))
    return found

def run(eml_filename, output_dir):
    msg = parse_message(eml_filename)
    attachments = find_attachments(msg)
    print ("Found {0} attachments...".format(len(attachments)))
    if not os.path.isdir(output_dir):
        os.mkdir(output_dir)
    for cdisp, part in attachments:
        cdisp_filename = os.path.normpath(cdisp['filename'])
        # prevent malicious crap
        if os.path.isabs(cdisp_filename):
            cdisp_filename = os.path.basename(cdisp_filename)
        towrite = os.path.join(output_dir, cdisp_filename)
        print( "Writing " + towrite)
        with open(towrite, 'wb') as fp:
            data = part.get_payload(decode=True)
            fp.write(data)


run(eml_mail, output_dir)

1
投票

看看:ParsEML 它从目录中的所有 eml 文件中批量提取附件(最初来自 Stephan Hügel)。我使用 MeIOC 的修改版本轻松提取 json 格式的所有元数据;如果你想要,我可以分享给。

© www.soinside.com 2019 - 2024. All rights reserved.