我正在使用 Python Selenium 从雅虎财经对话页面抓取特定股票(如 TSLA)的评论和回复。提取所有评论及其回复具有挑战性,因为雅虎财经需要用户交互才能在每条评论下显示回复,并且各个评论缺乏唯一标识符。此外,处理已删除的评论会增加复杂性。
这是我迄今为止采用的方法。
import requests
import json
# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = json.dumps({
"conversation_id": "sp_Rba9aFpG_finmb$27444752", # Updated to match the desired format
"count": 250,
"offset": 0,
"sort_by": "newest" # Assuming you want to sort by the newest; adjust as needed
})
api_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0',
'Content-Type': 'application/json',
'x-spot-id': "sp_Rba9aFpG", # Spot ID as per your configuration
'x-post-id': "finmb$27444752", # Post ID updated to reflect the desired conversation
# Include any other necessary headers as per the API documentation or your requirements
}
# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)
# Parse the JSON response and print it
data = response.json()
print(json.dumps(data, indent=4)) # Print the response data formatted for readability
您可以继续循环调用 API,直到所有评论都用完。
import requests
from pprint import pprint
# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = {
"count": 25,
"offset": 0,
"sort_by": "newest",
}
api_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0",
"Content-Type": "application/json",
"x-spot-id": "sp_Rba9aFpG", # Spot ID as per your configuration
"x-post-id": "finmb$27444752", # Post ID updated to reflect the desired conversation
# Include any other necessary headers as per the API documentation or your requirements
}
# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)
comments = []
data = response.json()
while data["conversation"]["has_next"]:
comm = data["conversation"]["comments"]
comments.extend(comm)
pprint(comm)
conv_id = data["conversation"]["conversation_id"]
payload["conversation_id"] = conv_id
payload["offset"] = data["conversation"]["offset"]
response = requests.post(api_url, headers=api_headers, data=payload)
data = response.json()
pprint(comments)
这就是代码,但是对于像这样有大约 90 万条评论的代码,几乎不可能不出现任何错误。
所以我建议你继续将这些评论提交到像
SQLITE
这样的数据库,并创建checkpoints
(包含最后一个offset
),这样万一出现任何失败,都可以从停止的地方开始.
还有人担心雅虎会禁止您使用 API。