Python请求爬网URL在浏览器内部工作时返回404错误

Question

我有一个爬网的python脚本挂在URL上：pulsepoint.com/sellers.json

该机器人使用标准请求来获取内容，但返回错误404。在浏览器中它可以工作（存在301重定向，但请求可以遵循该重定向）。我的第一感觉是这可能是请求标头问题，因此我复制了浏览器配置。代码看起来像这样

        crawled_url="pulsepoint.com"
        seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
        print(seller_json_url)
        myheaders = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
                'Accept-Encoding': 'gzip, deflate, br',
                'Connection': 'keep-alive',
                'Pragma': 'no-cache',
                'Cache-Control': 'no-cache'
            }
        r = requests.get(seller_json_url, headers=myheaders)
        logging.info("  %d" % r.status_code)

但是我仍然收到404错误。

我的下一个猜测：

登录？这里不使用
饼干？我看不到

那么他们的服务器如何阻止我的机器人？这是应该被抓取的网址，没有任何非法。.

提前感谢！

Answer 1

您可以直接转到链接并提取数据，无需将301链接到正确的链接

import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
    url="https://projects.contextweb.com/sellersjson/sellers.json",
    headers=headers,
    verify=False,
)

Answer 2

您还可以对SSL证书错误执行以下解决方法：

from urllib.request import urlopen
import ssl
import json

#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context

seller_json_url = 'https://{thehost}/sellers.json'.format(thehost='projects.contextweb.com/sellersjson')
print(seller_json_url)

response = urlopen(seller_json_url).read() 
# print in dictionary format
print(json.loads(response))

示例响应：

{''contact_email'：'[email protected]'，'contact_address'：'360 Madison Ave，14th Floor，NY，NY，10017'，'version'：'1.0'，'identifiers：[{'name '：'TAG-ID'，'value'：'89ff185a4c4e857c'}]，'sellers'：[{'seller_id'：'508738'，...

...'seller_type'：'PUBLISHER'}，{'seller_id'：'562225'，'name'：'EL DIARIO'，'domain'：'impremedia.com'，'seller_type'：'PUBLISHER'} ]}

Python请求爬网URL在浏览器内部工作时返回404错误

问题描述投票：0回答：2

2个回答

最新问题

Python请求爬网URL在浏览器内部工作时返回404错误

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2