在Python中从IMAP库读取电子邮件时如何处理所有字符集和内容类型

Question

我正在从imap lib中的python中读取电子邮件，但正在读取主体部分并将主体部分存储在数据库中，但有时python代码在解码主体时返回错误，我正在识别主体的内容类型和字符集，但不了解如何处理某些内容类型和字符集的某些时间是文本/纯文本，某些邮件中的utf-8是ascii / ISO-8859 / window-1252。

请帮助我如何处理所有字符集。

仅在需要时找到我当前用于阅读电子邮件正文的以下代码，我将提供所有代码。

预期结果：转换/处理电子邮件主体的所有字符集，其格式为UTF-8，然后转换为HTML以在门户网站上显示。

 if email_message.is_multipart():
    html = None
    multipart = True
    for part in email_message.walk():
        print("%s, %s" % (part.get_content_type(), part.get_content_charset()))
        charset = part.get_content_charset()
        if part.get_content_charset() is None:
            # We cannot know the character set, so return decoded "something"
            text = part.get_payload(decode=True)
            continue
        if part.get_content_type() == 'text/plain' and part.get_content_charset() == 'utf-8':
            # print('text--->1')
            text = str(part.get_payload(decode=True))
            # text = html.decode("utf-8")
            # print(part.get_payload(decode=True))
        if part.get_content_type() == 'text/plain' and part.get_content_charset() != 'utf-8':
            # print('text--->2')
            html = part.get_payload(decode=True)
            # text1 = html.decode("utf-8")
            text1 = html.decode(part.get_content_charset()).encode('utf8')
        if part.get_content_type() == 'text/html' and part.get_content_charset() != 'windows-1252':
            html = part.get_payload(decode=True)
            # text1 = html.decode("utf-8")
            text1 = html.decode(part.get_content_charset()).encode('utf8')
        if part.get_content_type() == 'text/html' and part.get_content_charset() == 'windows-1252':
            html = part.get_payload(decode=True)
            text1 = html.decode("cp1252")
        # if part.get_content_type() == 'text/html' and part.get_content_charset() == 'windows-1252':
        #    html = part.get_payload(decode=True)
        #    text1 = html.decode("latin-1")
        # if text is not None:
        # print(text.strip())
        # prin('Rahul')
        # else:
    # print("text")    #    print( html.strip())
    # print(text1.strip())
    # print("text1")
    # print(text1)
    imageCount = 0
    imageKey = ''
    json_data = {}
    filedata = {}
    mydict1 = ''
    value = ''
    params = ''
    filename = ''
    newFileName = ''
    for part in email_message.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
        if part.get_content_type() == 'message/rfc822':
            part_string = (bytes(str(part), 'utf-8'))
            # part_string = bytes(str(part.get_payload(0)),'utf-8')
            print('EML Part')
            print(part_string)
            filename = part.get_filename()
            # filename = filename.replace('\r', '').replace('\n', '')
            # print(part_string)
            # print(('attachment wala'))
        else:
            part_string = part.get_payload(decode=True)
            # print(part_string)
            # print(('attachment wala'))
            filename = part.get_filename()
            # filename = filename.replace('\r', '').replace('\n', '')
        if filename is not None:
            filepart = []
            try:
                decodefile = email.header.decode_header(filename)
                print('decodefile')
                print(decodefile)
            except HeaderParseError:
                return filename
                #
            for line1, encoding1 in decodefile:
                enc = 'utf-8'
                #        print(encoding)
                if encoding1 is not None:  # else encoding
                    print(type(line1))
                    filepart.append((line1.decode(encoding1)))
                    print('line')
                    print(line1)
                    print(filepart)
                    filename = ''.join(filepart)[:1023]
                else:
                    filename = filename
            dot_position = filename.rfind('.')
            file_prefix = filename[0: dot_position]
            file_suffix = filename[dot_position: len(filename)]
            print(filename)
            print(file_prefix)
            print(file_suffix)
            # filename = filename.decode('utf-8')
            # subject = ''
            file_prefix = file_prefix.replace('/', '_')
            now = datetime.datetime.now()
            timestamp = str(now.strftime("%Y%m%d%H%M%S%f"))
            print('timestamp--->')
            print(timestamp)
            newFileName = file_prefix + "_" + timestamp + file_suffix
            newFileName = newFileName.replace('\r', '').replace('\n', '').replace(',', '')
            filename = filename.replace('\r', '').replace('\n', '').replace(',', '')
            sv_path = os.path.join(svdir, newFileName)
            mydict = filename + '$$' + newFileName
            mydict1 = mydict1 + ',' + mydict
            # print(mydict1)
            value, params = cgi.parse_header(part.get('Content-Disposition'))
            print(value)
            if value == 'inline':
                imageCount = imageCount + 1
                print("newFileName-->" + newFileName)
                filedata[imageCount] = newFileName
                print(filedata)
                json_data = (filedata)
            # inlineImages = inlineImages + ',' + newFileName + '{{' + str(imageCount) + '}}'
            # print(json_data)
            # print('TYPE-->')
            # print(type(raw_email))
            # print(type(part.get_payload(decode=1)))
            # if type(part.get_payload(decode=1)) is None:
            #    print('message Type')
            if not os.path.isfile(sv_path):
                # print('rahul1')
                try:
                    fp = open(sv_path, 'wb')
                    fp.write(part_string)
                    fp.close()
                except TypeError:
                    pass
                    fp.close()

else:
    print("%s, %s" % (email_message.get_content_type(), email_message.get_content_charset()))
    if email_message.get_content_charset() is None:
        # We cannot know the character set, so return decoded "something"
        text = email_message.get_payload(decode=True)
        continue
    if email_message.get_content_type() == 'text/plain' and email_message.get_content_charset() == 'utf-8':
        print('text--->1')
        text = str(email_message.get_payload(decode=True))
        # text = html.decode("utf-8")
        # print(part.get_payload(decode=True))
    if email_message.get_content_type() == 'text/plain' and email_message.get_content_charset() != 'utf-8':
        print('text--->2')
        html = email_message.get_payload(decode=True)
        # text1 = html.decode("utf-8")
        text1 = html.decode(email_message.get_content_charset()).encode('utf8')
    if email_message.get_content_type() == 'text/html' and email_message.get_content_charset() != 'windows-1252':
        html = email_message.get_payload(decode=True)
        # text1 = html.decode("utf-8")
        text1 = html.decode(email_message.get_content_charset()).encode('utf8')
    if email_message.get_content_type() == 'text/html' and email_message.get_content_charset() == 'windows-1252':
        html = email_message.get_payload(decode=True)
        text1 = html.decode("cp1252")

Answer 1

[在Python中从IMAP库读取电子邮件时如何处理所有字符集和内容类型

简单答案：遍历所有消息部分，并应用提供的编码设置。我看到您已经这样做了（尽管我将把if-else级联重写为更简单的东西，因为stdlib impl可以很好地处理它，但是您的代码目前有些混乱）。它将与标准的合格邮件内容一起使用。但是，与往常一样，有很多搞砸邮件的邮件客户端并不在乎标准的一致性（从在某些情况下坏掉的好邮件客户端到脚本弱的垃圾邮件客户端）。
长回答：不可能对所有消息都做到这一点。解码将由于各种原因而失败。每当解码部分失败时，问题是-怎么办？好吧，您基本上可以使用以下选项：
没什么特别的，只用原始内容您可以只将原始字节内容插入数据库，然后为用户提供该内容。那不是非常用户友好，并且如果您有庞大的用户群以及与此相关的业务限制，那么您也不想要什么。它仍然是处理损坏内容的更容易的方法。如果2.仍然失败，它也是回退。
尝试通过启发式解码内容讨厌的编码从这里开始-每当零件解码失败时，带注释的编码和实际内容都会出问题。那你可以在这里做什么？除了检查内容之外，尝试查找实际编码的提示（例如UTF8位掩码的模式匹配）甚至是蛮力解码。聪明的试探法可能要先尝试经常看到的编码错误（例如，较早地测试UTF8或类似latin-1的8位编码）。这里没有一个很好的经验法则，因为混乱的文本编码可以从错误地声明的编码类型到几种混合的8位编码。尽管最有可能发现第一个，但即使采用最先进的启发式方法也无法解决后者，因此应始终退回到1中的解决方案。
跳过内容不建议使用，因为它很可能会保留用户的重要数据。仅在确定内容为垃圾时才执行此操作。
如果您想采用启发式方法，我建议您执行以下操作：
从标准一致处理开始，遵循标准的任何消息都应正确处理（在理想情况下，您已在此处完成）
上面的实现1.作为常规故障转移
[从自己的用户那里收集有关典型故障的数据，或者在Internet上搜索典型故障（其他邮件客户端已经识别出这些故障并以某种方式处理它们）
实施启发式2，遵循80/20规则（大多数用户首先会受益于实施的东西，其他所有事情都由1处理。
随着时间的推移改进启发式方法
无论如何-避免使用3。
这是对您的问题的非常笼统的答案，如果您有特定问题，也许应该更详细地解决。

在Python中从IMAP库读取电子邮件时如何处理所有字符集和内容类型

问题描述投票：0回答：1

1个回答

最新问题

在Python中从IMAP库读取电子邮件时如何处理所有字符集和内容类型

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1