使用 BeautifulSoup 进行网页抓取只会给出 NonType 错误

问题描述 投票:0回答:1

从此链接:https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance?year=9a9aaf13-af33-47f6-9150-8eabe38f0aa8

我想抓取“课程内容”部分下的课程代码。但是,我运行的每个代码以及每次更改它时都会收到错误消息,它是“NonType”。

以下是 HTML 代码示例:

`<div class="sc-fXSgeo BFRgs">
  <div class="courseList">
    <div class="sc-esYiGF ikRlqb ui-card ui-card--course">
      <div class="codeUnitContainer"><div class="code">FM100</div>
      <div class="unit">Half unit</div>
    </div>
    <div class="card__content">
      <h4 class="card__title">
<a href="https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2023_FM100.htm" rel="noopener noreferrer" target="_blank">Introduction to Finance</a>
</h4>
</div>`

你能写一个有效的代码吗 - 我不知道问题出在哪里。

course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})

python html loops beautifulsoup python-requests
1个回答
0
投票

此页面使用

JavaScript
来加载此部分,可能需要使用 Selenium 来控制真实的网络浏览器才能获取它

当您拥有

driver
时,您可以获取 HTML 并发送至
Beautifulsoup

soup = BeautifulSoup(driver.page_source, 'html5lib')

或者您可以使用

Selenium
来搜索数据

course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')

最少的工作代码:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException

#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager

import time

#import undetected_chromedriver as uc

# ---

import selenium
print('Selenium:', selenium.__version__)

# ---

url = 'https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance'

#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))

#driver = uc.Chrome(executable_path='/home/furas/bin/chromedriver', service_args=['--quiet'])
#driver = uc.Chrome()

#driver.maximize_window()

driver.get(url)
#driver.get("data:text/html;charset=utf-8," + html)

# ---

time.sleep(5)

#text_box.send_keys(Keys.ARROW_DOWN)

#wait = WebDriverWait(driver, 10)
#all_items = wait.until(EC.visibility_of_element_located((By.XPATH, "//a")))
#all_items = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a")))

# ---
 
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, 'html5lib')

course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})

print('--- code ---', len(course_code))
for item in course_code:
    print(item.get_text(strip=True, separator='\n').split('\n'))

# ---

course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
#course_code = course_list.find_elements(By.XPATH, '//div[@class="sc-esYiGF.ikRlqb.ui-card.ui-card--course"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
#course_code = driver.find_elements(By.CSS_SELECTOR, 'div.courseList"  div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
print('--- code ---', len(course_code))
for item in course_code:
    print(item.text.split('\n'))

结果:

看来`Beautifulsoup发现了更多元素

--- code --- 9
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']
['AC102', 'Half unit', 'Elements of Financial Accounting']
['ST101', 'Half unit', 'Programming for Data Science']
--- code --- 7
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']
© www.soinside.com 2019 - 2024. All rights reserved.