Selenium 点击按钮进行网页抓取

问题描述 投票:0回答:1

我正在尝试使用 selenium 来抓取数据,这需要您推动每一轮以显示更多数据,但我对 selenium 非常缺乏经验,并且无法找到要从中抓取数据的元素

我在google collab上使用selenium并通过xpath定位,但它似乎找不到该元素

options = Options()

options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = gs.Chrome(options=options)
driver.get('https://dropstab.com/coins/centrifuge/fundraising')
button = driver.find_element(by=By.XPATH, value='/html/body/div/div[1]/div/div[2]/main/div/article/div/div/section/div/div[1]/section[1]/div/div[1]/button')
button.click()

作为参考,如果向下滚动,该按钮就是每一轮筹款(A 轮、风险投资轮等)

python selenium-webdriver web-scraping google-colaboratory
1个回答
0
投票

您可以通过解析页面中嵌入的Json数据,更轻松地获取筹款信息,例如:

import json

import requests
from bs4 import BeautifulSoup

url = "https://dropstab.com/coins/centrifuge/fundraising"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

fundraising = data["props"]["pageProps"]["coin"]["fundraising"]

# print(json.dumps(fundraising, indent=4))

for s in fundraising["sales"]:
    print(s["name"], s["raised"])
    # ... print other info here
    print()

打印:

Series A 15000000

Venture Round 4000000

Funding Round 3000000

Community Grants None

Early Ecosystem None

Rewards & Grants None

Core Contributors None

Total Backers None

Foundation Endowment None

Development Grants 1800000

Venture Round 4300000

Strategic Round 3700000

Seed 3800000

Main Sale Option 2 8882500

Main Sale Option 1 9350000
© www.soinside.com 2019 - 2024. All rights reserved.