如何使用 BeautifulSoup4 在嵌套 HTML 中获取 <a> 标签

问题描述 投票:0回答:1

我想访问 href 链接。虽然我的 HTML 是如下图所示的嵌套结构

我正在尝试使用 BeautifulSoup4 来做到这一点,但是我对 WebScrapping 很陌生。我使用的代码是:

import requests
from bs4 import BeautifulSoup
import time

url = "https://openfinancebrasil.atlassian.net/wiki/spaces/OF/pages/17368301/DA+API+-+Canais+de+Atendimento"


response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    page_body = soup.find_all('div', class_= '_1bsb1osq _19pkidpf _2hwx1wug _otyridpf _18u01wug') 
    for p in page_body:
        print(p.find_all('a'))
else:
    print(f"Failed to retrieve content. Status Code: {response.status_code}")

但是,我的打印显示一个空列表

[]

我的疑问是:有没有办法直接访问这个元素?

python web-scraping beautifulsoup
1个回答
0
投票

所需数据是动态加载的,只有Beautifulsoup无法抓取数据。因此,您可以使用 selenium 或从 API 请求获取数据。我在这里申请了

selenium with beautifulsoup
,现在效果很好。

脚本:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# chrome to stay open
options.add_experimental_option("detach", True)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get("https://openfinancebrasil.atlassian.net/wiki/spaces/OF/pages/17368301/DA+API+-+Canais+de+Atendimento")
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')

page_body = soup.select('ul.childpages-macro.conf-macro.output-block  li')
for p in page_body:
  print(p.a.get('href'))

输出:

/wiki/spaces/OF/pages/223773060
/wiki/spaces/OF/pages/297533441
/wiki/spaces/OF/pages/297533461
/wiki/spaces/OF/pages/297533518
/wiki/spaces/OF/pages/297533542
/wiki/spaces/OF/pages/297533567
/wiki/spaces/OF/pages/17368404
/wiki/spaces/OF/pages/17368427
/wiki/spaces/OF/pages/17368487
/wiki/spaces/OF/pages/17368514
/wiki/spaces/OF/pages/17368537/v1.0.1+-+Canais+de+Atendimentos      
/wiki/spaces/OF/pages/17368560
/wiki/spaces/OF/pages/17368587/v1.0.0-rc5.2+-+Canais+de+Atendimentos
/wiki/spaces/OF/pages/17368610
/wiki/spaces/OF/pages/223805833
/wiki/spaces/OF/pages/223805853
/wiki/spaces/OF/pages/223805910
/wiki/spaces/OF/pages/223805934
/wiki/spaces/OF/pages/266895490
/wiki/spaces/OF/pages/266895510
/wiki/spaces/OF/pages/266895567
/wiki/spaces/OF/pages/266895591
/wiki/spaces/OF/pages/282886145
/wiki/spaces/OF/pages/282886165
/wiki/spaces/OF/pages/282886222
/wiki/spaces/OF/pages/282886246
/wiki/spaces/OF/pages/283901953
© www.soinside.com 2019 - 2024. All rights reserved.