从(UTF-8?)解码NNTP标头

问题描述 投票:0回答:1

我正在研究一些Python 3代码以获取NNTP消息,解析标头并处理数据。我的代码对于前几百条消息运行正常,然后引发异常。

例外是:

sys.exc_info()
(<class 'UnicodeDecodeError'>, UnicodeDecodeError('utf-8', b"Ana\xefs's", 3, 4, 'invalid continuation byte'), <traceback object at 0x7fe325261c08>)

问题来自尝试解析主题。消息的原始内容是:

{'subject': 'Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died', 'from': 'Fred Williams <[email protected]>', 'date': 'Sun, 05 Aug 2007 18:55:22 -0400', 'message-id': '<[email protected]>', 'references': '<[email protected]>', ':bytes': '1353', ':lines': '14', 'xref': 'number1.nntp.dca.giganews.com rec.pets.cats.community:171958'}

那个?UTF-8?我不知道该如何处理。自己吐出的代码片段是:

for msgId, msg in overviews:
    print(msgId)
    hdrs = {}
    if msgId == 171958:
        print(msg)
    try:
        for k in msg.keys():
            hdrs[k] = nntplib.decode_header(msg[k])
    except:
        print('Unicode error!')
        continue
python utf nntp
1个回答
0
投票

这里的问题是您输入的内容实际上是无效的。

此字符串是问题:

'Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died'

您可以执行此操作以对其进行解码:

import email.header
email.header.decode_header('Re: Mme. =?UTF-8?B?QW5h73Mncw==?= Computer Died')

结果是:

[(b'Re: Mme. ', None), (b"Ana\xefs's", 'utf-8'), (b' Computer Died', None)]

因此,丑陋的部分=?UTF-8?B?QW5h73Mncw==?=b"Ana\xefs's",应该由UTF-8字符串来指定,但它不是有效的UTF-8。

>>> b"Ana\xefs's".decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3: invalid continuation byte

这是您看到的错误。

现在由您来决定要做什么。例如...

忽略错误:

>>> b"Ana\xefs's".decode('utf-8', errors='ignore')
"Anas's"

将其标记为错误:

>>> b"Ana\xefs's".decode('utf-8', errors='replace')
"Ana�s's"

疯狂猜测正确的编码:

>>> b"Ana\xefs's".decode('windows-1252')
"Anaïs's"
>>> b"Ana\xefs's".decode('iso-8859-1')
"Anaïs's"
>>> b"Ana\xefs's".decode('iso-8859-2')
"Anaďs's"
© www.soinside.com 2019 - 2024. All rights reserved.