Python requests.get 返回空白结果

问题描述 投票:0回答:1

我是网络抓取新手,试图从 redfin.com 抓取一些住房信息,我使用 python requests 包来获取网站代码。然而,代码有时会工作并返回每个 url 的完整 html,而有时它只返回空白。

这是我的代码的简化版本:

import requests

headers = {
        'user-agent': XXX
    }
links = ['https://www.redfin.com/ID/Meridian/3642-N-Hollymount-Way-83646/home/106711385',
         'https://www.redfin.com/ID/Meridian/1506-N-Penrith-Pl-83642/home/106700395',
         'https://www.redfin.com/ID/Nampa/34-N-Middleton-Rd-83651/home/117266789',
         'https://www.redfin.com/OR/The-Dalles/1308-Harris-St-97058/home/53055510']
for link in links:
    response = requests.get(link, headers = headers)
    html = response.text
print(html)

状态代码始终为 200,有时我可以获取 html,但大多数时候它只是返回空白。这真的让我很困惑,我非常感谢您帮助解决这个问题。谢谢!

python web-scraping python-requests web-crawler
1个回答
0
投票

以下代码(使用有效的用户代理)毫无例外地工作。

但是,由于速率限制,短时间内多次运行可能会导致 HTTP 429 Too Many Requests。

from requests import Session

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

links = [
    "https://www.redfin.com/ID/Meridian/3642-N-Hollymount-Way-83646/home/106711385",
    "https://www.redfin.com/ID/Meridian/1506-N-Penrith-Pl-83642/home/106700395",
    "https://www.redfin.com/ID/Nampa/34-N-Middleton-Rd-83651/home/117266789",
    "https://www.redfin.com/OR/The-Dalles/1308-Harris-St-97058/home/53055510",
]

with Session() as session:
    for link in links:
        try:
            with session.get(link, headers=HEADERS) as response:
                response.raise_for_status()
                print(response.text)
        except Exception as e:
            print(e)
© www.soinside.com 2019 - 2024. All rights reserved.