收集来自 r/worldnews 实时线程的所有顶级评论

问题描述 投票:0回答:1

我是一名学生,试图从这个 r/worldnews 实时线程中获取所有顶级评论: https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ 用于学校研究项目。我目前正在使用 PRAW API 和 pandas 库使用 Python 进行编码。这是我到目前为止编写的代码:

url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)

但是当 limit=None 时代码超时。使用其他限制 (100,300,500) 仅返回约 700 条评论。如果您能帮助收集此 Reddit 帖子中的顶级评论,我们将不胜感激。

我查看了大约数百页的文档/Reddit 线程,并尝试了以下技术:

  • 为 Reddit API 编写“超时”代码,然后在休息后继续收集评论
  • 批量收集评论,然后再次调用replace_more 但无济于事。我还查看了 Reddit API 速率限制请求文档,希望有一种方法可以绕过这些限制。
python pandas data-science reddit praw
1个回答
0
投票

我能够使用递归函数而不是 Replace_more 方法来拉取超过 190k 的评论来绕过超时问题。也许这会有所帮助:

url =“https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/” 提交 = reddit.submission(url=url) 评论列表 = []

def process_comment(comment):
    if isinstance(comment, praw.models.Comment) and comment.is_root:
        comments_list.append({
            'author': comment.author.name if comment.author else '[deleted]',
            'body': comment.body,
            'score': comment.score,
            'edited': comment.edited,
            'created_utc': comment.created_utc,
            'permalink': f"https://www.reddit.com{comment.permalink}"
        })

def gather_comments(comment_list):
    for comment in comment_list:
        if isinstance(comment, praw.models.MoreComments):
            try:
                comment_list = comment_list[:comment_list.index(comment)] + comment.comments() + comment_list[comment_list.index(comment) + 1:]
            except Exception as e:
                print(f"Error replacing MoreComments: {e}")
        else:
            process_comment(comment)

    if any(isinstance(comment, praw.models.MoreComments) for comment in comment_list):
        gather_comments(comment_list)


top_level_comments = submission.comments
gather_comments(top_level_comments)

# Create DataFrame
comments_df = pd.DataFrame(comments_list)
© www.soinside.com 2019 - 2024. All rights reserved.