提取雅虎财经社区论坛的所有评论

问题描述 投票:0回答:1

我正在使用 Python Selenium 从雅虎财经对话页面抓取特定股票(如 TSLA)的评论和回复。提取所有评论及其回复具有挑战性,因为雅虎财经需要用户交互才能在每条评论下显示回复,并且各个评论缺乏唯一标识符。此外,处理已删除的评论会增加复杂性。

这是我迄今为止采用的方法。

import requests
import json

# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = json.dumps({
  "conversation_id": "sp_Rba9aFpG_finmb$27444752",  # Updated to match the desired format
  "count": 250,
  "offset": 0,
  "sort_by": "newest"  # Assuming you want to sort by the newest; adjust as needed
})

api_headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0',
  'Content-Type': 'application/json',
  'x-spot-id': "sp_Rba9aFpG",  # Spot ID as per your configuration
  'x-post-id': "finmb$27444752",  # Post ID updated to reflect the desired conversation
  # Include any other necessary headers as per the API documentation or your requirements
}

# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)

# Parse the JSON response and print it
data = response.json()
print(json.dumps(data, indent=4))  # Print the response data formatted for readability
python selenium-webdriver web-scraping
1个回答
0
投票

您可以继续循环调用 API,直到所有评论都用完。

import requests
from pprint import pprint

# Prepare the payload for the API request using the updated 'spotId' and 'uuid'
api_url = "https://api-2-0.spot.im/v1.0.0/conversation/read"
payload = {
    "count": 25,
    "offset": 0,
    "sort_by": "newest",
}


api_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0",
    "Content-Type": "application/json",
    "x-spot-id": "sp_Rba9aFpG",  # Spot ID as per your configuration
    "x-post-id": "finmb$27444752",  # Post ID updated to reflect the desired conversation
    # Include any other necessary headers as per the API documentation or your requirements
}

# Make the API request to fetch the conversation data
response = requests.post(api_url, headers=api_headers, data=payload)


comments = []
data = response.json()
while data["conversation"]["has_next"]:
    comm = data["conversation"]["comments"]
    comments.extend(comm)
    pprint(comm)

    conv_id = data["conversation"]["conversation_id"]
    payload["conversation_id"] = conv_id
    payload["offset"] = data["conversation"]["offset"]
    response = requests.post(api_url, headers=api_headers, data=payload)
    data = response.json()
    


pprint(comments)

这就是代码,但是对于像这样有大约 90 万条评论的代码,几乎不可能不出现任何错误。

所以我建议你继续将这些评论提交到像

SQLITE
这样的数据库,并创建
checkpoints
(包含最后一个
offset
),这样万一出现任何失败,都可以从停止的地方开始.

还有人担心雅虎会禁止您使用 API。

© www.soinside.com 2019 - 2024. All rights reserved.