python lxml.html.parse 不读取 url - 或者如何将 request.get 放入 lxml.html.dom?

问题描述 投票:0回答:1

下面的相同代码适用于许多网页,但对于像下面这样的一些网页,它会给出错误:

错误:读取文件时出错 'http://akademos-garden.com/homeschooling-tips-work-home-parents': 加载 HTTP 资源失败

Python 重现:

from lxml.html import parse
import requests
import pprint 

page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'

try:
    parsed_page = parse(page_url)

    dom = parsed_page.getroot()

except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)


print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)

print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)

这篇文章涉及添加用户代理,但我不明白如何使用 lxml 来做到这一点。上面的 request.get 都运行没有错误,返回 http status=200。

python lxml.html.parse 不读取 url.

如果我必须使用requests.get,我可以这样做,但是如何在dom对象中获取它?

python python-3.x lxml.html
1个回答
0
投票

以下似乎有效,我只是不明白为什么需要额外的步骤。如果有人能解释一下,我们将不胜感激。

from lxml.html import parse, etree
import requests
import pprint
try:

    # old way of doing it
    # parsed_page = parse(page_url)
    # dom = parsed_page.getroot()

    # so goal of the new way is to put data in the same dom variable
    print ("retrieve page using requests.get")
    result = requests.get(page_url, headers={'User-Agent': None})
    print("result.status_code=", result.status_code)
    parser = etree.HTMLParser()
    dom = etree.fromstring(result.content, parser)

    #prove that the dom variable works like it did before 
    links = dom.cssselect('a')
    for link in links:
       print ("Link:", link.text)
except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)
© www.soinside.com 2019 - 2024. All rights reserved.