使用 Python + PRAW 库优化 Reddit 数据抓取

Question

我正在编写一个脚本来抓取 subreddit“我是混蛋吗”并生成一组输入/标签，以训练神经网络阅读帖子并预测 reddit 是否会将用户视为混蛋混蛋（现在选择忽略更多灰色区域标签，如 ESH 和 NAH，尽管如果我建立了一个完整的 NN 训练周期，包括那些在抓取和训练中的标签应该是微不足道的）

为了重现此代码，您需要建立自己的 praw 联系点，我使用了 this geeks for geeks post 作为指南。设置大约需要 5 分钟。

所需的输出看起来像：

{body of the original post: (text), label: (NTA/YTA), some way to find the original post: URL + ID}

在 CSV 中，每行 1 个数据点。

我使用置顶评论（评论 0 始终是机器人置顶评论）作为标签，但如果它在文本中不包含 NTA/YTA，则将其丢弃，因为进行情绪分析超出了本文的范围小项目。

import praw
import pandas as pd
import re

# Read-only instance
# ========== PUT YOUR DETAILS HERE (see g4g post linked above) ========== 
reddit = praw.Reddit(client_id="",                            # your client id
                     client_secret="",      # your client secret
                     user_agent="",                             # your user agent
                     check_for_async=False)       
# =======================================================================
 
subreddit = reddit.subreddit("AmITheAsshole")

# Scraping the top 500 posts of all time
posts = subreddit.top(time_filter="all", limit=500)
 
posts_dict = {"Post Text": [], "Label" : [],
              "ID": [], "Post URL": []
              }
 
for post in posts:
    if len(post.comments) >= 2:
      top_comment = post.comments[1].body

      # probably a better way to do this, but this works sooooo...
      nta_label = re.search("NTA", top_comment)
      yta_label = re.search("YTA", top_comment)
      label = ""

      if nta_label != None:
        label = "nta"
      elif yta_label != None:
        label = "yta"

      # check that top comment includes a label, else skip
      if label != "":
        # Text inside a post
        posts_dict["Post Text"].append(post.selftext)

        # Label associated with top comment
        posts_dict["Label"].append(label)
        
        # Unique ID of post
        posts_dict["ID"].append(post.id)
        
        # URL of post
        posts_dict["Post URL"].append(post.url)
 
# Saving the data in a pandas dataframe + exporting it
top_posts = pd.DataFrame(posts_dict)
top_posts.to_csv("Posts.csv", index=True)

我的主要问题是这段代码需要一段时间才能运行（500 个帖子大约需要 1 小时），而且我需要成千上万的数据点。如果迫在眉睫，我可以让我的工作发挥作用，但这并不理想。特别是当我遇到问题并且需要比原先预期更多的数据时。截至目前，该脚本还没有得到优化，因为我不太熟悉 Python 的小众方面，我认为这可以加快速度。

有人可以让我知道如何优化这个/如果有任何重要的优化，或者如果 PRAW 只是很慢并且我只需要在接下来的几天里专门用一台电脑来做这个？

使用 Python + PRAW 库优化 Reddit 数据抓取

问题描述投票：0回答：0

最新问题

使用 Python + PRAW 库优化 Reddit 数据抓取

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0