我是一名学生,试图从这个 r/worldnews 实时线程中获取所有顶级评论: https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ 用于学校研究项目。我目前正在使用 PRAW API 和 pandas 库使用 Python 进行编码。这是我到目前为止编写的代码:
url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)
但是当 limit=None 时代码超时。使用其他限制 (100,300,500) 仅返回约 700 条评论。如果您能帮助收集此 Reddit 帖子中的顶级评论,我们将不胜感激。
我查看了大约数百页的文档/Reddit 线程,并尝试了以下技术:
我能够使用递归函数而不是 Replace_more 方法来拉取超过 190k 的评论来绕过超时问题。也许这会有所帮助:
url =“https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/” 提交 = reddit.submission(url=url) 评论列表 = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
def gather_comments(comment_list):
for comment in comment_list:
if isinstance(comment, praw.models.MoreComments):
try:
comment_list = comment_list[:comment_list.index(comment)] + comment.comments() + comment_list[comment_list.index(comment) + 1:]
except Exception as e:
print(f"Error replacing MoreComments: {e}")
else:
process_comment(comment)
if any(isinstance(comment, praw.models.MoreComments) for comment in comment_list):
gather_comments(comment_list)
top_level_comments = submission.comments
gather_comments(top_level_comments)
# Create DataFrame
comments_df = pd.DataFrame(comments_list)