下面的相同代码适用于许多网页,但对于像下面这样的一些网页,它会给出错误:
错误:读取文件时出错 'http://akademos-garden.com/homeschooling-tips-work-home-parents': 加载 HTTP 资源失败
Python 重现:
from lxml.html import parse
import requests
import pprint
page_url = 'http://akademos-garden.com/homeschooling-tips-work-home-parents/'
try:
parsed_page = parse(page_url)
dom = parsed_page.getroot()
except Exception as e:
# TODO - log this into some other error table to come back and research
errMsg = f"Error: {e} "
print(errMsg)
print("Try get without User-Agent")
result = requests.get(page_url).status_code
pprint.pprint(result)
print("Try get with User-Agent")
result = requests.get(page_url, headers={'User-Agent': None}).status_code
pprint.pprint(result)
这篇文章涉及添加用户代理,但我不明白如何使用 lxml 来做到这一点。上面的 request.get 都运行没有错误,返回 http status=200。
python lxml.html.parse 不读取 url.
如果我必须使用requests.get,我可以这样做,但是如何在dom对象中获取它?
以下似乎有效,我只是不明白为什么需要额外的步骤。如果有人能解释一下,我们将不胜感激。
from lxml.html import parse, etree
import requests
import pprint
try:
# old way of doing it
# parsed_page = parse(page_url)
# dom = parsed_page.getroot()
# so goal of the new way is to put data in the same dom variable
print ("retrieve page using requests.get")
result = requests.get(page_url, headers={'User-Agent': None})
print("result.status_code=", result.status_code)
parser = etree.HTMLParser()
dom = etree.fromstring(result.content, parser)
#prove that the dom variable works like it did before
links = dom.cssselect('a')
for link in links:
print ("Link:", link.text)
except Exception as e:
# TODO - log this into some other error table to come back and research
errMsg = f"Error: {e} "
print(errMsg)