使用 Python 进行网页抓取 - 如何从 POST 请求中抓取 .JSON?

问题描述 投票:0回答:1

我最近开始使用 python 进行网页抓取。我通常使用 HTML 解析方式处理请求。事情是现在我正在废弃网络 coches.net

因此,在 Fetch/XHR 选项卡下使用网络检查元素后,我发现出现了一个名为“listing”的 POST。它包含 JSON 格式的所有汽车信息。

listing element

注意:此listing元素仅出现这样做:

仅在任何页面中执行步骤 2 后,才会出现列表响应

  1. 进入基础网站
  2. 单击 Ordenar:Los más nuevos - 订购方式:最新

然后在第 2 步之后,将显示 listing 元素。 Click to order and listing shows

还有一个我想我需要输入的请求负载

Payload

Cookie选项卡下还有一个Cookie

Cookie

仅通过标题了解更多信息 Headers

我不知道是否必须输入有效负载、标头、cookie。从未在请求中深入使用 GET/POST。

import requests

user_agents = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

s = requests.Session()

s.get("https://www.coches.net/segunda-mano/")

s.post("https://web.gw.coches.net/search/listing").text

输出:

'{"timestamp":"2023-06-06T14:24:53.811+00:00","status":400,"error":"Bad Request","path":"/search/listing"}'

(邮政编码为400)

python json web-scraping post python-requests
1个回答
0
投票

第 1 步: 页面加载后,右键单击检查器网络选项卡中找到的列表请求 -> 复制 -> 复制为 cURL

第2步:删除不必要的标头后,您将得到以下cURL请求:

curl 'https://web.gw.coches.net/search/listing' \
  -H 'authority: web.gw.coches.net' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'content-type: application/json' \
  -H 'origin: https://www.coches.net' \
  -H 'referer: https://www.coches.net/' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'x-adevinta-page-url: https://www.coches.net/km-0/' \
  -H 'x-adevinta-referer: https://www.coches.net/' \
  -H 'x-adevinta-session-id: 94c2b83f-4058-4d88-8930-62c1eed6737c' \
  -H 'x-schibsted-tenant: coches' \
  --data-raw '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"year"},"filters":{"isFinanced":false,"price":{"from":null,"to":null},"bodyTypeIds":[],"categories":{"category1Ids":[2500]},"contractId":0,"drivenWheelsIds":[],"environmentalLabels":[],"equipments":[],"fuelTypeIds":[],"hasPhoto":null,"hasStock":null,"hasWarranty":null,"hp":{"from":null,"to":null},"isCertified":false,"km":{"from":null,"to":null},"luggageCapacity":{"from":null,"to":null},"onlyPeninsula":false,"offerTypeIds":[1],"provinceIds":[],"sellerTypeId":0,"transmissionTypeId":0,"year":{"from":null,"to":null}}}' \
  --compressed

第3步:使用https://curlconverter.com/python/将上述curl请求转换为python代码并运行它以获得所需的JSON响应。

import requests

headers = {
    'authority': 'web.gw.coches.net',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'en-US,en;q=0.9',
    'content-type': 'application/json',
    'origin': 'https://www.coches.net',
    'referer': 'https://www.coches.net/',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
    'x-adevinta-page-url': 'https://www.coches.net/km-0/',
    'x-adevinta-referer': 'https://www.coches.net/',
    'x-adevinta-session-id': '94c2b83f-4058-4d88-8930-62c1eed6737c',
    'x-schibsted-tenant': 'coches',
}

json_data = {
    'pagination': {
        'page': 1,
        'size': 30,
    },
    'sort': {
        'order': 'desc',
        'term': 'year',
    },
    'filters': {
        'isFinanced': False,
        'price': {
            'from': None,
            'to': None,
        },
        'bodyTypeIds': [],
        'categories': {
            'category1Ids': [
                2500,
            ],
        },
        'contractId': 0,
        'drivenWheelsIds': [],
        'environmentalLabels': [],
        'equipments': [],
        'fuelTypeIds': [],
        'hasPhoto': None,
        'hasStock': None,
        'hasWarranty': None,
        'hp': {
            'from': None,
            'to': None,
        },
        'isCertified': False,
        'km': {
            'from': None,
            'to': None,
        },
        'luggageCapacity': {
            'from': None,
            'to': None,
        },
        'onlyPeninsula': False,
        'offerTypeIds': [
            1,
        ],
        'provinceIds': [],
        'sellerTypeId': 0,
        'transmissionTypeId': 0,
        'year': {
            'from': None,
            'to': None,
        },
    },
}

response = requests.post('https://web.gw.coches.net/search/listing', headers=headers, json=json_data)

if response.status_code == 200:
    json_data = response.json()
    # Process the JSON data as needed
    print(json_data)
else:
    print(f"Error: {response.status_code}")
© www.soinside.com 2019 - 2024. All rights reserved.