我最近开始使用 python 进行网页抓取。我通常使用 HTML 解析方式处理请求。事情是现在我正在废弃网络 coches.net
因此,在 Fetch/XHR 选项卡下使用网络检查元素后,我发现出现了一个名为“listing”的 POST。它包含 JSON 格式的所有汽车信息。
注意:此listing元素仅出现这样做:
仅在任何页面中执行步骤 2 后,才会出现列表响应
然后在第 2 步之后,将显示 listing 元素。
还有一个我想我需要输入的请求负载
Cookie选项卡下还有一个Cookie
我不知道是否必须输入有效负载、标头、cookie。从未在请求中深入使用 GET/POST。
import requests
user_agents = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
s = requests.Session()
s.get("https://www.coches.net/segunda-mano/")
s.post("https://web.gw.coches.net/search/listing").text
输出:
'{"timestamp":"2023-06-06T14:24:53.811+00:00","status":400,"error":"Bad Request","path":"/search/listing"}'
(邮政编码为400)
第 1 步: 页面加载后,右键单击检查器网络选项卡中找到的列表请求 -> 复制 -> 复制为 cURL
第2步:删除不必要的标头后,您将得到以下cURL请求:
curl 'https://web.gw.coches.net/search/listing' \
-H 'authority: web.gw.coches.net' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: en-US,en;q=0.9' \
-H 'content-type: application/json' \
-H 'origin: https://www.coches.net' \
-H 'referer: https://www.coches.net/' \
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
-H 'x-adevinta-page-url: https://www.coches.net/km-0/' \
-H 'x-adevinta-referer: https://www.coches.net/' \
-H 'x-adevinta-session-id: 94c2b83f-4058-4d88-8930-62c1eed6737c' \
-H 'x-schibsted-tenant: coches' \
--data-raw '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"year"},"filters":{"isFinanced":false,"price":{"from":null,"to":null},"bodyTypeIds":[],"categories":{"category1Ids":[2500]},"contractId":0,"drivenWheelsIds":[],"environmentalLabels":[],"equipments":[],"fuelTypeIds":[],"hasPhoto":null,"hasStock":null,"hasWarranty":null,"hp":{"from":null,"to":null},"isCertified":false,"km":{"from":null,"to":null},"luggageCapacity":{"from":null,"to":null},"onlyPeninsula":false,"offerTypeIds":[1],"provinceIds":[],"sellerTypeId":0,"transmissionTypeId":0,"year":{"from":null,"to":null}}}' \
--compressed
第3步:使用https://curlconverter.com/python/将上述curl请求转换为python代码并运行它以获得所需的JSON响应。
import requests
headers = {
'authority': 'web.gw.coches.net',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/json',
'origin': 'https://www.coches.net',
'referer': 'https://www.coches.net/',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
'x-adevinta-page-url': 'https://www.coches.net/km-0/',
'x-adevinta-referer': 'https://www.coches.net/',
'x-adevinta-session-id': '94c2b83f-4058-4d88-8930-62c1eed6737c',
'x-schibsted-tenant': 'coches',
}
json_data = {
'pagination': {
'page': 1,
'size': 30,
},
'sort': {
'order': 'desc',
'term': 'year',
},
'filters': {
'isFinanced': False,
'price': {
'from': None,
'to': None,
},
'bodyTypeIds': [],
'categories': {
'category1Ids': [
2500,
],
},
'contractId': 0,
'drivenWheelsIds': [],
'environmentalLabels': [],
'equipments': [],
'fuelTypeIds': [],
'hasPhoto': None,
'hasStock': None,
'hasWarranty': None,
'hp': {
'from': None,
'to': None,
},
'isCertified': False,
'km': {
'from': None,
'to': None,
},
'luggageCapacity': {
'from': None,
'to': None,
},
'onlyPeninsula': False,
'offerTypeIds': [
1,
],
'provinceIds': [],
'sellerTypeId': 0,
'transmissionTypeId': 0,
'year': {
'from': None,
'to': None,
},
},
}
response = requests.post('https://web.gw.coches.net/search/listing', headers=headers, json=json_data)
if response.status_code == 200:
json_data = response.json()
# Process the JSON data as needed
print(json_data)
else:
print(f"Error: {response.status_code}")