使用praw将reddit数据获取到JSON行中。

问题描述 投票:0回答:1

所以我想用praw来获取reddit帖子的数据,然后把它变成一个 JSON行 文件。

我需要的是这样的东西。

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?"], "response": ["My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "response": ["No, it's still in the game. Use the debug stick to set all sides to `none`"], "id": "gabsj3"}

所以context包含["POST TITLE", "FIRST LEVEL COMMENT", "SECOND LEVEL COMMENT", "ETC..."]而response包含最后一级评论. 在这个 上传到reddit上,应该是这样的。

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in", "No, it's still in the game. Use the debug stick to set all sides to `none`"], "response": ["Huh, alright"], "id": "gabsj3"}

但我的代码输出是这样的:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?", "I think we can still use resource packs to change it back into a dot, I don't know so don't quote me on that", "I honestly think the cross redstone looks a bit more like a splatter."], "id": "gabsj3"}

这是我的代码

import praw
import jsonlines

reddit = praw.Reddit(client_id='-', client_secret='-', user_agent='user_agent')

max = 1000
sequence =1
for post in reddit.subreddit('minecraft').new(limit=max):
data = []
title = []
comment = []
response = []
post_id = post.id
titl = post.title
# print("https://www.reddit.com/"+post.permalink)

print("Fetched "+str(sequence) + " posts .. ")
title.append(titl)
try:
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    sequence = sequence + 1

    for top_level_comment in submission.comments:
        cmnt_body = top_level_comment.body
        comment.append(cmnt_body)
        for second_level_comment in top_level_comment.replies:
            response.append(second_level_comment.body)
        context = [title[0],comment[0]]
        data.append({"context":context,"response":response,"id":post_id})
        response = []
        # print(data[0])
        with jsonlines.open('2020-04-30_12.jsonl', mode='a') as writer:
            writer.write(data.pop())
        comment.pop()
    title.pop()


except Exception :
    pass
python reddit praw
1个回答
0
投票

这是一种有趣的想要存储数据的方式。我不能说我自己会使用这种方法,因为它涉及到一遍又一遍地重复相同的信息。

为了达到这个目的,你需要管理一个 一堆 包含当前上下文,并使用递归来获取每个注释的子项。

import jsonlines
import praw

reddit = praw.Reddit(...)  # fill in with your authentication


def main():
    for post in reddit.subreddit("minecraft").new(limit=1000):
        dump_replies(replies=post.comments, context=[post.title])


def dump_replies(replies, context):
    for reply in replies:
        if isinstance(reply, praw.models.MoreComments):
            continue

        reply_data = {
            "context": context,
            "response": reply.body,
            "id": reply.submission.id,
        }
        with jsonlines.open("2020-04-30_12.jsonl", mode="a") as writer:
            writer.write(reply_data)

        context.append(reply.body)
        dump_replies(reply.replies, context)
        context.pop()


main()

在每次递归调用之前,我们将当前项的主体附加到上下文列表中, 然后在递归之后删除它。这就建立了一个堆栈,显示了到当前评论的路径。然后对于每一条评论,我们转储它的上下文,它的body和它的提交ID。

请注意,对于没有评论的帖子,这不会转储任何东西,这似乎与你的示例数据中的策略一致(因为每一行都代表一条评论,而这条评论是对其他东西的回复)。

© www.soinside.com 2019 - 2024. All rights reserved.