使用 BeautifulSoup 网络抓取公司名称

Question

我试图从网站上提取公司名称，但无法产生任何结果。我不确定我是否选择了错误的课程或者我做错了什么。

网站：https://wineparis-vineexpo.com/newfront/search/exhibitors

是“参展商”栏目下的公司。名单上的第一家公司是“0.0% Sober Spirits”，然后是“1884 Dumangin J. Fils”等等。

import requests
from bs4 import BeautifulSoup

URL = "https://wineparis-vinexpo.com/newfront/search/exhibitors"
page = requests.post(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="__next")
ex_elements = results.find_all("div", class_="MuiBox-root css-k008qs")

for ex_element in ex_elements:
    company = ex_element.find("div", class_="MuiBox-root css-k008qs")
    print(company.text.strip())
    print()

此外，该列表涉及多个页面，但我还没有解决这个问题。

任何帮助将不胜感激。

Answer 1

您要查找的内容是在页面已经呈现后加载的，这意味着请求模块在加载完成后无法获取内容。

规避此行为的一种方法是使用 selenium 等待整个页面加载，然后才获取内容。此外，selenium 将使加载后续页面变得更加容易。

要设置 selenium，请使用 pip （或您首选的包管理器）安装包

pip install selenium

然后，使用您选择的浏览器安装网络驱动程序：

https://sites.google.com/a/chromium.org/chromedriver/downloads
https://github.com/mozilla/geckodriver/releases
https://webkit.org/blog/6900/webdriver-support-in-safari-10/

之后，导入selenium及其包

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("https://wineparis-vinexpo.com/newfront/search/exhibitors")

driver.implicitly_wait(5) # wait for the page to fully load

ex_elements = driver.find_element(by=By.ID, value="idHere")

driver.quit() # Quit selenium driver after use

有关如何使用它的更多信息可以在 selenium 文档中找到：

https://www.selenium.dev/documentation/

使用 BeautifulSoup 网络抓取公司名称

问题描述投票：0回答：1

1个回答

最新问题

使用 BeautifulSoup 网络抓取公司名称

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1