问题是从gmail抓取HTML

问题描述 投票:0回答:1

我正在尝试从我的Gmail电子邮件中抓取HTML。我正在使用电子邮件包和漂亮的汤来获取数据。出于某种原因,当我直接查看发送公司给我的电子邮件时,HTML会像这样返回:

PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy93M2MvL2R0ZCB4aHRtbCAxLjAgdHJhbnNpdGlvbmFs
Ly9lbiIgImh0dHA6Ly93d3cudzMub3JnL3RyL3hodG1sMS9kdGQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPjxodG1sIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hl
bHZldGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7
Ym94LXNpemluZzogYm9yZGVyLWJveCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0
bWwiPjxoZWFkIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hlbHZl
dGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7Ym94
LXNpemluZzogYm9yZGVyLWJveCI+CiAgICA8bWV0YSBzdHlsZT0ibWFyZ2luOiAwO3BhZGRpbmc6
IDA7Zm9udC1mYW1pbHk6ICdIZWx2ZXRpY2EgTmV1ZScsICdIZWx2ZXRpY2EnLCBIZWx2ZXRpY2Es
IEFyaWFsLCBzYW5zLXNlcmlmO2JveC1zaXppbmc6IGJvcmRlci1ib3giIGh0dHAtZXF1aXY9IkNv
bnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PVVURi04IiAvPgogICAgPHRp

这是我正在获取上面数据的代码。

def grab_email(most_recent):
    result2, email_data = mail.uid('fetch', most_recent, '(RFC822)')
    raw_email = email_data[0][1].decode('utf-8')
    email_message = email.message_from_string(raw_email)
    return email_message

def get_data(email_message):
    for part in email_message.walk():
        content_type = part.get_content_type()
        if 'html' in content_type:
            html_ = part.get_payload()
            soup = BeautifulSoup(html_, 'lxml')
            text = soup.get_text()
            print(text)

[电子邮件来自原始来源时,我的代码返回上面的第一部分,其中包含随机数字和字母。但是,如果我将电子邮件转发给自己,那么代码将再次进行处理,它可以正常工作并完全按照预期的方式提取信息。解决这个问题的任何帮助都将很棒!

python python-3.x email beautifulsoup html-email
1个回答
1
投票

您看到的数据是base64编码的。要对其进行解码,请使用标准库中的base64模块:

base64

打印:

import base64

txt = '''PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy93M2MvL2R0ZCB4aHRtbCAxLjAgdHJhbnNpdGlvbmFs
Ly9lbiIgImh0dHA6Ly93d3cudzMub3JnL3RyL3hodG1sMS9kdGQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPjxodG1sIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hl
bHZldGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7
Ym94LXNpemluZzogYm9yZGVyLWJveCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0
bWwiPjxoZWFkIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hlbHZl
dGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7Ym94
LXNpemluZzogYm9yZGVyLWJveCI+CiAgICA8bWV0YSBzdHlsZT0ibWFyZ2luOiAwO3BhZGRpbmc6
IDA7Zm9udC1mYW1pbHk6ICdIZWx2ZXRpY2EgTmV1ZScsICdIZWx2ZXRpY2EnLCBIZWx2ZXRpY2Es
IEFyaWFsLCBzYW5zLXNlcmlmO2JveC1zaXppbmc6IGJvcmRlci1ib3giIGh0dHAtZXF1aXY9IkNv
bnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PVVURi04IiAvPgogICAgPHRp'''


print(base64.b64decode(txt))
© www.soinside.com 2019 - 2024. All rights reserved.