从动态变化的网站上抓取评论

问题描述 投票:0回答:1

我需要从 Debank 网站的帖子中抓取/解析评论。 例如:https://debank.com/stream/2057406 问题是,如果我滚动网站,html 会发生变化并实时呈现评论。 每条评论都有它的 ID,当我滚动网站时,新评论会添加,而前一条评论会消失。 动态代码 例如,第一个评论具有属性

data-index="0"
,第二个评论具有
data-index="1"
,其工作原理如下:

页面已加载 稍微滚动一下看看评论区 我们可以看到评论以及在 f12 模式下他们的 ID

data-index="0" data-index="1" data-index="2" data-index="3"
我们滚动更多
data-index="1" data-index="2" data-index="3" data-index="4"
我们滚动更多
data-index="2" data-index="3" data-index="4" data-index="5"
等等..

我尝试使用 requests、selenium,甚至下载源代码并尝试在我的计算机上解析它,但我得到的只是前几条评论,但我需要它们全部。

python parsing web-scraping
1个回答
0
投票

您可以尝试使用他们的 REST API(可能您还需要调整 HTTP

x-api-*
标头):

import requests

id_ = 2057406
comments_url = "https://api.debank.com/article/comments?id={id}&start={start}&limit=20&order_by=-trust_degree"

headers = {
    "x-api-nonce": "n_XaAx0SDDZtZFMoTbw7KXX8PTERJWl5pFFGcCs6lb",
    "x-api-sign": "6e9ceb1c9431a2d47cf549a3910e3df91eb6b64b58453b2c6e6906c659f34f05",
    "x-api-ts": "1715028741",
    "x-api-ver": "v2",
}

data = requests.get(comments_url.format(id=id_, start=0), headers=headers).json()
for c in data["data"]["comments"]:
    print(c["content"])
    print()

打印:

I always follow back on "Hi Message"

Hopefully a real tokens airdrop and not another points system

Same here I unfollowed a bunch of addresses and the more surprising was that some sent me a hi to be followed and still unfollowed me :D

I must be annoying them a lot with my posts 😅

If there are points like on Rabby, it depends on whether they admit it once or whether it will be open to further collection. And if so, what will they continue to award them for and how long will it last? I would prefer if they counted once and closed the Airdrop collection.

really hope they won't announce a point system after a year farming it 😅

unfollow

Do you trust airdrop ? I feel like 2024 airdrop are just bad

👀

...
© www.soinside.com 2019 - 2024. All rights reserved.