我想抓取“课程内容”部分下的课程代码。但是,我运行的每个代码以及每次更改它时都会收到错误消息,它是“NonType”。
以下是 HTML 代码示例:
`<div class="sc-fXSgeo BFRgs">
<div class="courseList">
<div class="sc-esYiGF ikRlqb ui-card ui-card--course">
<div class="codeUnitContainer"><div class="code">FM100</div>
<div class="unit">Half unit</div>
</div>
<div class="card__content">
<h4 class="card__title">
<a href="https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2023_FM100.htm" rel="noopener noreferrer" target="_blank">Introduction to Finance</a>
</h4>
</div>`
你能写一个有效的代码吗 - 我不知道问题出在哪里。
course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})
此页面使用
JavaScript
来加载此部分,可能需要使用 Selenium 来控制真实的网络浏览器才能获取它
当您拥有
driver
时,您可以获取 HTML 并发送至 Beautifulsoup
soup = BeautifulSoup(driver.page_source, 'html5lib')
或者您可以使用
Selenium
来搜索数据
course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
最少的工作代码:
#!/usr/bin/env python3
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import time
#import undetected_chromedriver as uc
# ---
import selenium
print('Selenium:', selenium.__version__)
# ---
url = 'https://www.lse.ac.uk/study-at-lse/undergraduate/bsc-finance'
#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
#driver = uc.Chrome(executable_path='/home/furas/bin/chromedriver', service_args=['--quiet'])
#driver = uc.Chrome()
#driver.maximize_window()
driver.get(url)
#driver.get("data:text/html;charset=utf-8," + html)
# ---
time.sleep(5)
#text_box.send_keys(Keys.ARROW_DOWN)
#wait = WebDriverWait(driver, 10)
#all_items = wait.until(EC.visibility_of_element_located((By.XPATH, "//a")))
#all_items = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a")))
# ---
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html5lib')
course_list = soup.find("div", attrs={'class': "courseList"})
course_code = course_list.find_all("div", attrs={'class': "sc-esYiGF ikRlqb ui-card ui-card--course"})
print('--- code ---', len(course_code))
for item in course_code:
print(item.get_text(strip=True, separator='\n').split('\n'))
# ---
course_list = driver.find_element(By.XPATH, '//div[@class="courseList"]')
#course_code = course_list.find_elements(By.XPATH, '//div[@class="sc-esYiGF.ikRlqb.ui-card.ui-card--course"]')
course_code = course_list.find_elements(By.CSS_SELECTOR, 'div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
#course_code = driver.find_elements(By.CSS_SELECTOR, 'div.courseList" div.sc-esYiGF.ikRlqb.ui-card.ui-card--course')
print('--- code ---', len(course_code))
for item in course_code:
print(item.text.split('\n'))
结果:
看来`Beautifulsoup发现了更多元素
--- code --- 9
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']
['AC102', 'Half unit', 'Elements of Financial Accounting']
['ST101', 'Half unit', 'Programming for Data Science']
--- code --- 7
['FM100', 'Half unit', 'Introduction to Finance']
['EC1A3', 'Half unit', 'Microeconomics I']
['EC1B3', 'Half unit', 'Macroeconomics I']
['ST102', 'One unit', 'Elementary Statistical Theory']
['FM102', 'Half unit', 'Quantitative Methods for Finance']
['MA108', 'Half unit', 'Methods in calculus and linear algebra']
['LSE100', 'Half unit', 'The LSE Course']