我想访问 href 链接。虽然我的 HTML 是如下图所示的嵌套结构
我正在尝试使用 BeautifulSoup4 来做到这一点,但是我对 WebScrapping 很陌生。我使用的代码是:
import requests
from bs4 import BeautifulSoup
import time
url = "https://openfinancebrasil.atlassian.net/wiki/spaces/OF/pages/17368301/DA+API+-+Canais+de+Atendimento"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
page_body = soup.find_all('div', class_= '_1bsb1osq _19pkidpf _2hwx1wug _otyridpf _18u01wug')
for p in page_body:
print(p.find_all('a'))
else:
print(f"Failed to retrieve content. Status Code: {response.status_code}")
但是,我的打印显示一个空列表
[]
我的疑问是:有没有办法直接访问这个元素?
所需数据是动态加载的,只有Beautifulsoup无法抓取数据。因此,您可以使用 selenium 或从 API 请求获取数据。我在这里申请了
selenium with beautifulsoup
,现在效果很好。
脚本:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# chrome to stay open
options.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://openfinancebrasil.atlassian.net/wiki/spaces/OF/pages/17368301/DA+API+-+Canais+de+Atendimento")
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
page_body = soup.select('ul.childpages-macro.conf-macro.output-block li')
for p in page_body:
print(p.a.get('href'))
输出:
/wiki/spaces/OF/pages/223773060
/wiki/spaces/OF/pages/297533441
/wiki/spaces/OF/pages/297533461
/wiki/spaces/OF/pages/297533518
/wiki/spaces/OF/pages/297533542
/wiki/spaces/OF/pages/297533567
/wiki/spaces/OF/pages/17368404
/wiki/spaces/OF/pages/17368427
/wiki/spaces/OF/pages/17368487
/wiki/spaces/OF/pages/17368514
/wiki/spaces/OF/pages/17368537/v1.0.1+-+Canais+de+Atendimentos
/wiki/spaces/OF/pages/17368560
/wiki/spaces/OF/pages/17368587/v1.0.0-rc5.2+-+Canais+de+Atendimentos
/wiki/spaces/OF/pages/17368610
/wiki/spaces/OF/pages/223805833
/wiki/spaces/OF/pages/223805853
/wiki/spaces/OF/pages/223805910
/wiki/spaces/OF/pages/223805934
/wiki/spaces/OF/pages/266895490
/wiki/spaces/OF/pages/266895510
/wiki/spaces/OF/pages/266895567
/wiki/spaces/OF/pages/266895591
/wiki/spaces/OF/pages/282886145
/wiki/spaces/OF/pages/282886165
/wiki/spaces/OF/pages/282886222
/wiki/spaces/OF/pages/282886246
/wiki/spaces/OF/pages/283901953