我有一个爬网的python脚本挂在URL上:pulsepoint.com/sellers.json
该机器人使用标准请求来获取内容,但返回错误404。在浏览器中它可以工作(存在301重定向,但请求可以遵循该重定向)。我的第一感觉是这可能是请求标头问题,因此我复制了浏览器配置。代码看起来像这样
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
但是我仍然收到404错误。
我的下一个猜测:
那么他们的服务器如何阻止我的机器人?这是应该被抓取的网址,没有任何非法。.
提前感谢!
您可以直接转到链接并提取数据,无需将301链接到正确的链接
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)
您还可以对SSL证书错误执行以下解决方法:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
seller_json_url = 'https://{thehost}/sellers.json'.format(thehost='projects.contextweb.com/sellersjson')
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
示例响应:
{''contact_email':'[email protected]','contact_address':'360 Madison Ave,14th Floor,NY,NY,10017','version':'1.0','identifiers:[{'name ':'TAG-ID','value':'89ff185a4c4e857c'}],'sellers':[{'seller_id':'508738',...
...'seller_type':'PUBLISHER'},{'seller_id':'562225','name':'EL DIARIO','domain':'impremedia.com','seller_type':'PUBLISHER'} ]}