python lxml.html.parse 不读取 url - 或者如何将 request.get 放入 lxml.html.dom？

Question

下面的相同代码适用于许多网页，但对于像下面这样的一些网页，它会给出错误：

错误：读取文件时出错 'http://akademos-garden.com/homeschooling-tips-work-home-parents'：加载 HTTP 资源失败

Python 重现：

from lxml.html import parse
import requests
import pprint 

page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'

try:
    parsed_page = parse(page_url)

    dom = parsed_page.getroot()

except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)


print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)

print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)

这篇文章涉及添加用户代理，但我不明白如何使用 lxml 来做到这一点。上面的 request.get 都运行没有错误，返回 http status=200。

python lxml.html.parse 不读取 url.

如果我必须使用requests.get，我可以这样做，但是如何在dom对象中获取它？

Answer 1

以下似乎有效，我只是不明白为什么需要额外的步骤。如果有人能解释一下，我们将不胜感激。

from lxml.html import parse, etree
import requests
import pprint
try:

    # old way of doing it
    # parsed_page = parse(page_url)
    # dom = parsed_page.getroot()

    # so goal of the new way is to put data in the same dom variable
    print ("retrieve page using requests.get")
    result = requests.get(page_url, headers={'User-Agent': None})
    print("result.status_code=", result.status_code)
    parser = etree.HTMLParser()
    dom = etree.fromstring(result.content, parser)

    #prove that the dom variable works like it did before 
    links = dom.cssselect('a')
    for link in links:
       print ("Link:", link.text)
except Exception as e:
    # TODO - log this into some other error table to come back and research
    errMsg = f"Error: {e} "
    print(errMsg)

python lxml.html.parse 不读取 url - 或者如何将 request.get 放入 lxml.html.dom？

问题描述投票：0回答：1

1个回答

最新问题

python lxml.html.parse 不读取 url - 或者如何将 request.get 放入 lxml.html.dom？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1