我正在尝试在这里搜索电影标题:https://classindportal.mj.gov.br/consulta-filmes并抓取结果页面。我知道这涉及到使用我的搜索词向网站发送特定请求的中间步骤,但我目前无法执行此操作。
使用 Google DevTools 时,网络选项卡显示以下信息
Request URL: https://classindportal.mj.gov.br/api/solicitacao-classificacao-consultas/list
Request Method: POST
Status Code: 200 OK
Referrer Policy: strict-origin-when-cross-origin
并且请求负载包含一个键
tituloBr
,其值等于搜索词(例如,如果我在搜索栏中输入“shrek”并按 Enter 键,则为 {'tituloBr': 'shrek'}
)。
我相信搜索涉及向请求 URL 发送一个 post 请求(如上所示),发送数据
{'tituloBr': 'shrek'}
,所以我使用了 requests 库,如下所示:
payload = {'tituloBr': 'shrek'}
r = requests.post('https://classindportal.mj.gov.br/api/solicitacao-classificacao-consultas/list', data = payload)
但这会给出错误代码 400,其中
r.reason
显示 'Bad Request'
。
我认为我发送的 URL 或数据没有任何问题,所以我不确定问题是什么。
我检查了页面,似乎您需要提供
token
- 可以通过向以下地址发送 POST
请求来获取:
https://sso.mj.gov.br/auth/realms/PRD/protocol/openid-connect/token
因此,获取令牌,然后使用令牌向 API 发送另一个请求来搜索您想要的电影
import requests
SEARCH_TERM = "shrek"
token_url = "https://sso.mj.gov.br/auth/realms/PRD/protocol/openid-connect/token"
movies_url = (
"https://classindportal.mj.gov.br/api/solicitacao-classificacao-consultas/list"
)
headers = {
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9,he;q=0.8",
"Authorization": "Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJMRVNSQzZ4UGtUdnlzNUdvUHpwaHNmeTJTSmMta0ZZcjFKM2VBNS1uOExnIn0.eyJleHAiOjE3MDY1NDIwNzMsImlhdCI6MTcwNjU0MTc3MywianRpIjoiYzNkY2FhOTctMTFhNi00N2Y0LThlMjUtNzRlYzcxMTIzNGNkIiwiaXNzIjoiaHR0cHM6Ly9zc28ubWouZ292LmJyL2F1dGgvcmVhbG1zL1BSRCIsImF1ZCI6WyJjbGFzc2luZC1iYWNrZW5kIiwiYWNjb3VudCJdLCJzdWIiOiIxODNmYWI5MC1hM2Y1LTQ1MWMtODQwMi1hYzAwMWVhYmM1ZTMiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJjbGFzc2luZC1jb25zdWx0YXB1YmxpY2EtZnJvbnRlbmQiLCJhY3IiOiIxIiwiYWxsb3dlZC1vcmlnaW5zIjpbImh0dHBzOi8vY2xhc3NpbmRwb3J0YWwubWouZ292LmJyIl0sInJlYWxtX2FjY2VzcyI6eyJyb2xlcyI6WyJ1bWFfYXV0aG9yaXphdGlvbiIsImRlZmF1bHQtcm9sZXMtcHJkIl19LCJyZXNvdXJjZV9hY2Nlc3MiOnsiYWNjb3VudCI6eyJyb2xlcyI6WyJtYW5hZ2UtYWNjb3VudCIsIm1hbmFnZS1hY2NvdW50LWxpbmtzIiwidmlldy1wcm9maWxlIl19fSwic2NvcGUiOiJjbGFzc2luZC1iYWNrZW5kIiwiY2xpZW50SWQiOiJjbGFzc2luZC1jb25zdWx0YXB1YmxpY2EtZnJvbnRlbmQiLCJjbGllbnRIb3N0IjoiMTAuMjUwLjEyOC4xMTMiLCJjbGllbnRBZGRyZXNzIjoiMTAuMjUwLjEyOC4xMTMifQ.RbreSBJYQ4aPZYEQmSHWo5ZkQaEEy4M9UqWkOHg2wRAoQsxHCzo3dj3CRilyHocnt-K6toV1MUVF_pk1rg2IYeOcrq5NJFaErKGl4Iy69dG_PBwU1RHP3da5-paLDg6DPZZTu2UR1FmShuvlzaSXFNe5JSDoWP1RMjpCSP5bBpXHz0M-KvbZqPykYky-pIpxCpwEIlsL15hpTFqxrghpvWcpiLfjC-YRALynXxPZFiDzqpNq9nsQwLFCXjC6lAeZmP3GQcDZMIDEBgeSx7slomM2E360teqK2WXmZHmJxRwIWP1snJDetlxbDlDHuFxGVLyLsR8kJMbKTPnZEeDUyw",
"Connection": "keep-alive",
"Origin": "https://classindportal.mj.gov.br",
"Referer": "https://classindportal.mj.gov.br/consulta-filmes",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"sec-ch-ua": '"Not A(Brand";v="99", "Google Chrome";v="121", "Chromium";v="121"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
}
json_data = {
"currentPage": 0,
"pageSize": 10,
"sortItem": None,
"totalResults": None,
"itens": None,
"tituloBr": f"{SEARCH_TERM}",
"tituloOr": "",
"requerente": "",
"produtor": "",
"editora": "",
"idModulo": 1,
}
token_data = {
"client_id": "classind-consultapublica-frontend",
"client_secret": "4PmaBa8bBeVow40SKFNb7qNHzAxuLoqz",
"grant_type": "client_credentials",
"scope": "classind-backend",
}
with requests.Session() as session:
token = session.post(token_url, data=token_data).json()["access_token"]
headers["Authorization"] = f"Bearer {token}"
response = session.post(movies_url, json=json_data, headers=headers)
print(response.json())
如果您愿意,您甚至可以将数据转换为 Pandas 数据框:
import pandas as pd
# ...
with requests.Session() as session:
token = session.post(token_url, data=token_data).json()["access_token"]
headers["Authorization"] = f"Bearer {token}"
response = session.post(movies_url, json=json_data, headers=headers)
data = response.json()["itens"]
df = pd.DataFrame(data)
print(df)
哪个打印:
id tituloBrasil ... classificacaoAtribuida classificacaoPretendida
0 164346 SHREK ... Livre None
1 164345 SHREK 2 ... Livre None
2 164344 SHREK PARA SEMPRE ... Livre None
3 164343 SHREK TERCEIRO ... Livre None
4 146845 SHREK 2 ... Livre None
5 146844 SHREK TERCEIRO ... Livre None
6 135770 SHREK ... Livre None
7 135769 SHREK 2 ... Livre None
8 135768 SHREK PARA SEMPRE ... Livre None
9 135767 SHREK TERCEIRO ... Livre None
[10 rows x 8 columns]